Hi: Unfortunately I met the exactly same bug today. With pvops kernel 2.6.32.36, and xen 4.0.1. Kernel Panic and serial log attached. Our test cases is quite simple, on a single physical host, we start 12 HVMS(windows 2003), each of the HVM reboot every 10minutes. The bug is easy to hit on our 48G machine(in hours). But We haven''t hit the bug in our 24G machine(we have three 24G machine, all works fine.) -----Is is possible related to Memory capacity? Taking a look at the serial output, the Dom0 code is attempting to pin what it thins is a "PGT_l3_page_table", however the hypervisor returns -EINVAL because it actually is a "PGT_writable_page". (XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000 0000 0000 0000) for mfn 898a41 (pfn 9ca41) (XEN) mm.c:2733:d0 Error while pinning mfn 898a41 And before that quite a lot abnormal grant table log like : (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965888 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 It looks like something wrong with grant table. Many thanks.> From: Jeremy Fitzhardinge <jeremy@goop.org> > Subject: Re: [Xen-devel] [SPAM] Re: kernel BUG at > arch/x86/xen/mmu.c:1860! - ideas. > To: Ian Campbell <Ian.Campbell@citrix.com> > Cc: Dave Hunter <dave@ivt.com.au>, Teck Choon Giam > <giamteckchoon@gmail.com>, "xen-devel@lists.xensource.com" > xen-devel@lists.xensource.com> On 04/06/2011 12:53 AM, Ian Campbell wrote: > > Please don''t top post. > > > > On Wed, 2011-04-06 at 00:20 +0100, Dave Hunter wrote: > >> Is it likely that Debian would release an updated kernel in squeeze with > >> this configuration? (sorry, this might not be the place to ask). > > I doubt they will, enabling DEBUG_PAGEALLOC seems very much like a > > workaround not a solution to me. > > Yes, it will impose a pretty large performance overhead. > > J > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
HI: As I go through the code with log, I noticed that the log: (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) is from xen/common/grant_table.c:266, which is in function _set_status_v1() so it looks like kernel 2.6.32 use grant table version 1. While in 2.6.31. driver/xen/grant-table.c, I noticed function gnttab_request_version() which looks like 2.6.31 require grant version 2. But this function cannot be found in 2.6.32. Is this correct? Thanks.>-------------------------------------------------------------------------------- >From: tinnycloud@hotmail.com >To: xen-devel@lists.xensource.com >CC: dave@ivt.com.au; ian.campbell@citrix.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com; jeremy@goop.org >Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1860! >Date: Fri, 8 Apr 2011 19:24:35 +0800 > >Hi: > Unfortunately I met the exactly same bug today. With pvops kernel 2.6.32.36, and xen 4.0.1. > Kernel Panic and serial log attached. > > Our test cases is quite simple, on a single physical host, we start 12 HVMS(windows 2003), >each of the HVM reboot every 10minutes. > > The bug is easy to hit on our 48G machine(in hours). But We haven''t hit the bug in our 24G >machine(we have three 24G machine, all works fine.) -----Is is possible related to Memory capacity? > >Taking a look at the serial output, the Dom0 code is attempting to pin what it thins >is a "PGT_l3_page_table", however the hypervisor returns -EINVAL because it actually is a "PGT_writable_page". > >(XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 4000 0000 0000 0000) for mfn 898a41 (pfn 9ca41) >(XEN) mm.c:2733:d0 Error while pinning mfn 898a41 > >And before that quite a lot abnormal grant table log like : > >(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 >(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) >(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) >(XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) >(XEN) grant_table.c:1717:d0 Bad grant reference 4294965888 >(XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 > >It looks like something wrong with grant table. > >Many thanks. > >> From: Jeremy Fitzhardinge <jeremy@goop.org> >> Subject: Re: [Xen-devel] [SPAM] Re: kernel BUG at >> arch/x86/xen/mmu.c:1860! - ideas. >> To: Ian Campbell <Ian.Campbell@citrix.com> >> Cc: Dave Hunter <dave@ivt.com.au>, Teck Choon Giam >> <giamteckchoon@gmail.com>, "xen-devel@lists.xensource.com" >> xen-devel@lists.xensource.com > >> On 04/06/2011 12:53 AM, Ian Campbell wrote: >> > Please don''t top post. >> > >> > On Wed, 2011-04-06 at 00:20 +0100, Dave Hunter wrote: >> >> Is it likely that Debian would release an updated kernel in squeeze with >> >> this configuration? (sorry, this might not be the place to ask). >> > I doubt they will, enabling DEBUG_PAGEALLOC seems very much like a >> > workaround not a solution to me. >> >> Yes, it will impose a pretty large performance overhead. >> >> J >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Konrad & Jeremy: I''d like to open this BUG in a new thread, since the old thread is too long for read. We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug. Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts). Our test go much further. We test different kernel version. 2.6.32.10 2.6.32.10 2.6.32.10 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
(Please ignore my last mail, sent by type..) Hi Konrad & Jeremy: I''d like to open this BUG in a new thread, since the old thread is too long for easy read. We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug. Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts). Our test go much further. We test different kernel version. 2.6.32.10 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59 2.6.32.11 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815 2.6.32.12 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce 2.6.32.13 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b 2.6.32.15 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd 2.6.32.21 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a There are basic three different result we met. i1) grant table issue The host still function, but use xm dmesg, we have abnormal log. please refer to the attched log of grant table i2) kernel crash on a different place. Host die during the test, after reboot, we can see nothing abnormal in /var/log/messages i3) kernel BUG at arch/x86/xen/mmu.c:1872; Host die during the test, after reboot, we see the crash log in messages, refer to the attached log of 2.6.32.36 Summary of the test result, can be classified in two: 1) 2.6.32.10 30 machines involved the test, and three has issue (i1), and two has issue (i2), *no* issue (i3) Other machines run tests successfully till now, more than 8 hours 2)2.6.32.11 or later version. Each version containers 10 machine for tests, and all machine crashed in less than half an hour. Conclusion: 1) grant table issue exists in all kernel version 2) kernerl crash at different place may exist in all kernel versions, but not happen so frequently, 2 out of 30 3) We observe the major difference of issue i3), from the test, it looks like it is introduced between the version 2.6.32.10 and 2.6.32.11. Hope this help to locate the bug. Many thanks. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Konrad & Jeremy: I think we finally located the missing patch for this commit. We test commit http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=c97f681f138039425c87f35ea46a92385d81e70e which is works. We test commit http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=221c64dbf860d37f841f40893bddf8d804aa55bd which server crashed. Later I found the comments for this commit: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec So It looks like this fix is not applied on 2.6.32.36, Could you take a look at this? Many thanks. ====================================================>Hi Konrad & Jeremy:> > I''d like to open this BUG in a new thread, since the old thread is too long for easy read. > > We recently want to upgrade our kernel to 2.6.32, but unfortunately, we confront a kernel crash bug. >Our test case is simple, start 24 win2003 HVMS on our physical machine, and each HVM reboot >every 15minutes. The kernel will crash in half an hour.(That is crash on VM second starts). > >Our test go much further. >We test different kernel version. >2.6.32.10 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59 >2.6.32.11 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815 >2.6.32.12 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce >2.6.32.13 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b >2.6.32.15 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd >2.6.32.21 http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a > >There are basic three different result we met. > >i1) grant table issue >The host still function, but use xm dmesg, we have abnormal log. >please refer to the attched log of grant table > >i2) kernel crash on a different place. >Host die during the test, after reboot, we can see nothing abnormal in /var/log/messages > >i3) kernel BUG at arch/x86/xen/mmu.c:1872; >Host die during the test, after reboot, we see the crash log in messages, refer to the attached log of 2.6.32.36 >Summary of the test result, can be classified in two: > >1) 2.6.32.10 >30 machines involved the test, and three has issue (i1), and two has issue (i2), *no* issue (i3) >Other machines run tests successfully till now, more than 8 hours > >2)2.6.32.11 or later version. >Each version containers 10 machine for tests, and all machine crashed in less than half an hour. > >Conclusion: >1) grant table issue exists in all kernel version >2) kernerl crash at different place may exist in all kernel versions, but not happen so frequently, 2 out of 30 >3) We observe the major difference of issue i3), from the test, it looks like it is introduced between the version >2.6.32.10 and 2.6.32.11. > >Hope this help to locate the bug. >Many thanks. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-10 20:14 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
2011/4/10 MaoXiaoyun <tinnycloud@hotmail.com>:> Hi Konrad & Jeremy: > > I think we finally located the missing patch for this commit. > We test commit > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=c97f681f138039425c87f35ea46a92385d81e70e > which is works. > > We test commit > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=221c64dbf860d37f841f40893bddf8d804aa55bd > which server crashed. > > Later I found the comments for this commit: > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec > > So It looks like this fix is not applied on 2.6.32.36, Could you > take a look at this? > > Many thanks. > > ====================================================>>Hi Konrad & Jeremy: >> >> I''d like to open this BUG in a new thread, since the old thread is too >> long for easy read. >> >> We recently want to upgrade our kernel to 2.6.32, but unfortunately, >> we confront a kernel crash bug. >>Our test case is simple, start 24 win2003 HVMS on our physical machine, and >> each HVM reboot >>every 15minutes. The kernel will crash in half an hour.(That is crash on VM >> second starts). >> >>Our test go much further. >>We test different kernel version. >>2.6.32.10 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=d945b014ac5df9592c478bf9486d97e8914aab59 >>2.6.32.11 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27f948a3bf365a5bc3d56119637a177d41147815 >>2.6.32.12 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=ba739f9abd3f659b907a824af1161926b420a2ce >>2.6.32.13 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=f6fe6583b77a49b569eef1b66c3d761eec2e561b >>2.6.32.15 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=27ed1b0e0dae5f1d5da5c76451bc84cb529128bd >>2.6.32.21 >> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=69e50db231723596ed8ef9275d0068d6697f466a >> >>There are basic three different result we met. >> >>i1) grant table issue >>The host still function, but use xm dmesg, we have abnormal log. >>please refer to the attched log of grant table >> >>i2) kernel crash on a different place. >>Host die during the test, after reboot, we can see nothing abnormal in >> /var/log/messages >> >>i3) kernel BUG at arch/x86/xen/mmu.c:1872; >>Host die during the test, after reboot, we see the crash log in messages, >> refer to the attached log of 2.6.32.36 >>Summary of the test result, can be classified in two: >> >>1) 2.6.32.10 >>30 machines involved the test, and three has issue (i1), and two has issue >> (i2), *no* issue (i3) >>Other machines run tests successfully till now, more than 8 hours >> >>2)2.6.32.11 or later version. >>Each version containers 10 machine for tests, and all machine crashed in >> less than half an hour. >> >>Conclusion: >>1) grant table issue exists in all kernel version >>2) kernerl crash at different place may exist in all kernel versions, but >> not happen so frequently, 2 out of 30 >>3) We observe the major difference of issue i3), from the test, it looks >> like it is introduced between the version >>2.6.32.10 and 2.6.32.11. >> >>Hope this help to locate the bug. >>Many thanks. >> >> >Hi, Sorry, since this mmu related BUG has been troubled me for very long... I really want to "kill" this BUG but my knowledge in kernel hacking and/or xen is very limited. While waiting for Jeremy or Konrad or others ... Many thanks for spending time to track down this mmu related BUG. I have backported the commit from http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec to 2.6.32.36 PVOPS kernel and patch attached. I won''t know whether did I backport it correctly nor does it affects anything. I am currently testing the 2.6.32.36 PVOPS kernel with this patch applied and also unset CONFIG_DEBUG_PAGEALLOC. Currently running testcrash.sh loop 1000 as I am unable to reproduce this mmu BUG 1872 in testcrash.sh loop 100. Please note that when CONFIG_DEBUG_PAGEALLOC is unset, I can reproduce this mmu BUG 1872 easily within <50 testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36 kernel. Now test with this backport patch to see whether I can reproduce this mmu BUG... ... Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-11 12:16 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
> > Hi, > > Sorry, since this mmu related BUG has been troubled me for very > long... I really want to "kill" this BUG but my knowledge in kernel > hacking and/or xen is very limited. > > While waiting for Jeremy or Konrad or others ... > > Many thanks for spending time to track down this mmu related BUG. I > have backported the commit from > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec > to 2.6.32.36 PVOPS kernel and patch attached. I won''t know whether > did I backport it correctly nor does it affects anything. I am > currently testing the 2.6.32.36 PVOPS kernel with this patch applied > and also unset CONFIG_DEBUG_PAGEALLOC. Currently running testcrash.sh > loop 1000 as I am unable to reproduce this mmu BUG 1872 in > testcrash.sh loop 100. Please note that when CONFIG_DEBUG_PAGEALLOC > is unset, I can reproduce this mmu BUG 1872 easily within <50 > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36 > kernel. Now test with this backport patch to see whether I can > reproduce this mmu BUG... ... > > Kindest regards, > Giam Teck Choon >I have tested with my backport patch and it is working fine as I am unable to reproduce the mmu.c 1872 or 1860 bug with CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100 and 1000. Now doing testcrash.sh loop 10000. Xiaoyun, is it possible for you to test my patch and see whether can you reproduce the mmu.c 1872/1860 bug? Can anyone of you review my patch? I will post a format patch according to Documentation/SubmittingPatches in my next reply and hopefully can be reviewed. Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-11 12:22 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
From: Giam Teck Choon <giamteckchoon@gmail.com> vmalloc: eagerly clear ptes on vunmap Backport from commit 64141da587241301ce8638cc945f8b67853156ec to 2.6.32.36 URL: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec Without this patch, kernel BUG at arch/x86/xen/mmu.c:1860 or kernel BUG at arch/x86/xen/mmu.c:1872 is easily triggered when CONFIG_DEBUG_PAGEALLOC is unset especially doing LVM snapshots. Signed-off-by: Giam Teck Choon <giamteckchoon@gmail.com> --- arch/x86/xen/mmu.c | 2 -- include/linux/vmalloc.h | 2 -- mm/vmalloc.c | 28 +++++++++++++++++----------- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index fa36ab8..204e3ba 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2430,8 +2430,6 @@ void __init xen_init_mmu_ops(void) x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start; x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done; pv_mmu_ops = xen_mmu_ops; - - vmap_lazy_unmap = false; } /* Protected by xen_reservation_lock. */ diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 1a2ba21..3c123c3 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -7,8 +7,6 @@ struct vm_area_struct; /* vma defining user mapping in mm_types.h */ -extern bool vmap_lazy_unmap; - /* bits in flags of vmalloc''s vm_struct below */ #define VM_IOREMAP 0x00000001 /* ioremap() and friends */ #define VM_ALLOC 0x00000002 /* vmalloc() */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4f701c2..80cbd7b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -31,8 +31,6 @@ #include <asm/tlbflush.h> #include <asm/shmparam.h> -bool vmap_lazy_unmap __read_mostly = true; - /*** Page table manipulation functions ***/ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end) @@ -503,9 +501,6 @@ static unsigned long lazy_max_pages(void) { unsigned int log; - if (!vmap_lazy_unmap) - return 0; - log = fls(num_online_cpus()); return log * (32UL * 1024 * 1024 / PAGE_SIZE); @@ -566,7 +561,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; - unmap_vmap_area(va); list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; @@ -612,10 +606,11 @@ static void purge_vmap_area_lazy(void) } /* - * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been - * called for the correct range previously. + * Free a vmap area, caller ensuring that the area has been unmapped + * and flush_cache_vunmap had been called for the correct range + * previously. */ -static void free_unmap_vmap_area_noflush(struct vmap_area *va) +static void free_vmap_area_noflush(struct vmap_area *va) { va->flags |= VM_LAZY_FREE; atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); @@ -624,6 +619,16 @@ static void free_unmap_vmap_area_noflush(struct vmap_area *va) } /* + * Free and unmap a vmap area, caller ensuring flush_cache_vunmap had been + * called for the correct range previously. + */ +static void free_unmap_vmap_area_noflush(struct vmap_area *va) +{ + unmap_vmap_area(va); + free_vmap_area_noflush(va); +} + +/* * Free and unmap a vmap area */ static void free_unmap_vmap_area(struct vmap_area *va) @@ -799,7 +804,7 @@ static void free_vmap_block(struct vmap_block *vb) spin_unlock(&vmap_block_tree_lock); BUG_ON(tmp != vb); - free_unmap_vmap_area_noflush(vb->va); + free_vmap_area_noflush(vb->va); call_rcu(&vb->rcu_head, rcu_free_vb); } @@ -936,6 +941,8 @@ static void vb_free(const void *addr, unsigned long size) rcu_read_unlock(); BUG_ON(!vb); + vunmap_page_range((unsigned long)addr, (unsigned long)addr + size); + spin_lock(&vb->lock); BUG_ON(bitmap_allocate_region(vb->dirty_map, offset >> PAGE_SHIFT, order)); @@ -988,7 +995,6 @@ void vm_unmap_aliases(void) s = vb->va->va_start + (i << PAGE_SHIFT); e = vb->va->va_start + (j << PAGE_SHIFT); - vunmap_page_range(s, e); flush = 1; if (s < start) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: I believe this is the fix at much extent. Since I have my own test cases which with this patch, my test case will success in 30 rounds run. Every round takes 8hours. While without this patch, tests fail evey round in 15minutes. So this really means fix most of the things. But during running, I met another crash, from the log it it looks like has relation with this BUG, since the crash log shows it is tlb related and this BUG also tlb related. Well, I''m also have poor knowledge of kernel. Hope someone from Xen Devel offer some help. Many thanks.> Date: Mon, 11 Apr 2011 20:16:53 +0800 > Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872 > From: giamteckchoon@gmail.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; keir@xen.org > > > > > Hi, > > > > Sorry, since this mmu related BUG has been troubled me for very > > long... I really want to "kill" this BUG but my knowledge in kernel > > hacking and/or xen is very limited. > > > > While waiting for Jeremy or Konrad or others ... > > > > Many thanks for spending time to track down this mmu related BUG. I > > have backported the commit from > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec > > to 2.6.32.36 PVOPS kernel and patch attached. I won''t know whether > > did I backport it correctly nor does it affects anything. I am > > currently testing the 2.6.32.36 PVOPS kernel with this patch applied > > and also unset CONFIG_DEBUG_PAGEALLOC. Currently running testcrash.sh > > loop 1000 as I am unable to reproduce this mmu BUG 1872 in > > testcrash.sh loop 100. Please note that when CONFIG_DEBUG_PAGEALLOC > > is unset, I can reproduce this mmu BUG 1872 easily within <50 > > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36 > > kernel. Now test with this backport patch to see whether I can > > reproduce this mmu BUG... ... > > > > Kindest regards, > > Giam Teck Choon > > > > I have tested with my backport patch and it is working fine as I am > unable to reproduce the mmu.c 1872 or 1860 bug with > CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100 > and 1000. Now doing testcrash.sh loop 10000. > > Xiaoyun, is it possible for you to test my patch and see whether can > you reproduce the mmu.c 1872/1860 bug? > > Can anyone of you review my patch? > > I will post a format patch according to > Documentation/SubmittingPatches in my next reply and hopefully can be > reviewed. > > Thanks. > > Kindest regards, > Giam Teck Choon_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-11 15:25 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
2011/4/11 MaoXiaoyun <tinnycloud@hotmail.com>:> Hi: > > I believe this is the fix at much extent. > Since I have my own test cases which with this patch, my test case will > success in 30 rounds run. > Every round takes 8hours. While without this patch, tests fail evey > round in 15minutes. > > So this really means fix most of the things. > > But during running, I met another crash, from the log it it looks like > has relation with > this BUG, since the crash log shows it is tlb related and this BUG also tlb > related.Are you able to run another test with cpuidle=0 cpufreq=none in kernel boot option? Just curious whether can you reproduce the tlb bug when you boot with cpuidle=0 cpufreq=none... ...> > Well, I''m also have poor knowledge of kernel. > Hope someone from Xen Devel offer some help. > > Many thanks. > >> Date: Mon, 11 Apr 2011 20:16:53 +0800 >> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872 >> From: giamteckchoon@gmail.com >> To: tinnycloud@hotmail.com >> CC: xen-devel@lists.xensource.com; dave@ivt.com.au; >> ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; >> keir@xen.org >> >> > >> > Hi, >> > >> > Sorry, since this mmu related BUG has been troubled me for very >> > long... I really want to "kill" this BUG but my knowledge in kernel >> > hacking and/or xen is very limited. >> > >> > While waiting for Jeremy or Konrad or others ... >> > >> > Many thanks for spending time to track down this mmu related BUG. I >> > have backported the commit from >> > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec >> > to 2.6.32.36 PVOPS kernel and patch attached. I won''t know whether >> > did I backport it correctly nor does it affects anything. I am >> > currently testing the 2.6.32.36 PVOPS kernel with this patch applied >> > and also unset CONFIG_DEBUG_PAGEALLOC. Currently running testcrash.sh >> > loop 1000 as I am unable to reproduce this mmu BUG 1872 in >> > testcrash.sh loop 100. Please note that when CONFIG_DEBUG_PAGEALLOC >> > is unset, I can reproduce this mmu BUG 1872 easily within <50 >> > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36 >> > kernel. Now test with this backport patch to see whether I can >> > reproduce this mmu BUG... ... >> > >> > Kindest regards, >> > Giam Teck Choon >> > >> >> I have tested with my backport patch and it is working fine as I am >> unable to reproduce the mmu.c 1872 or 1860 bug with >> CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100 >> and 1000. Now doing testcrash.sh loop 10000. >> >> Xiaoyun, is it possible for you to test my patch and see whether can >> you reproduce the mmu.c 1872/1860 bug? >> >> Can anyone of you review my patch? >> >> I will post a format patch according to >> Documentation/SubmittingPatches in my next reply and hopefully can be >> reviewed. >> >> Thanks. >> >> Kindest regards, >> Giam Teck Choon >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Apr-11 18:08 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
On 04/11/2011 05:31 AM, MaoXiaoyun wrote:> Hi: > > I believe this is the fix at much extent. > Since I have my own test cases which with this patch, my test case > will success in 30 rounds run. > Every round takes 8hours. While without this patch, tests fail evey > round in 15minutes. > > So this really means fix most of the things. > > But during running, I met another crash, from the log it it looks like > has relation with > this BUG, since the crash log shows it is tlb related and this BUG > also tlb related. > > Well, I''m also have poor knowledge of kernel. > Hope someone from Xen Devel offer some help.Thanks for confirming; it makes sense and explains the symptoms, so I''m glad it also works ;) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: I have just kicked off cpuidle=0 "cpufreq=none" tests. What is your Xen version? Do you use the backend driver of 2.6.32.36? Beside the "TLB BUG ", I''ve met at least two other issues 1)Xen4.0.1 + 2.6.32.36 kernel + backend driver from 2.6.31 ==> will cause "Bad grant reference " log in serial output 2)Xen4.0.1 + 2.6.32.36 kernel with its owen backend driver ==> will cause disk error like belows. sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device end_request: I/O error, dev tdb, sector 28699593 end_request: I/O error, dev tdb, sector 28699673 end_request: I/O error, dev tdb, sector 28699753 end_request: I/O error, dev tdb, sector 28699833 end_request: I/O error, dev tdb, sector 28699913 end_request: I/O error, dev tdb, sector 28699993 end_request: I/O error, dev tdb, sector 28700073 thanks.> Date: Mon, 11 Apr 2011 23:25:19 +0800 > Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872 > From: giamteckchoon@gmail.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; keir@xen.org > > 2011/4/11 MaoXiaoyun <tinnycloud@hotmail.com>: > > Hi: > > > > I believe this is the fix at much extent. > > Since I have my own test cases which with this patch, my test case will > > success in 30 rounds run. > > Every round takes 8hours. While without this patch, tests fail evey > > round in 15minutes. > > > > So this really means fix most of the things. > > > > But during running, I met another crash, from the log it it looks like > > has relation with > > this BUG, since the crash log shows it is tlb related and this BUG also tlb > > related. > > Are you able to run another test with cpuidle=0 cpufreq=none in kernel > boot option? Just curious whether can you reproduce the tlb bug when > you boot with cpuidle=0 cpufreq=none... ... > > > > > Well, I''m also have poor knowledge of kernel. > > Hope someone from Xen Devel offer some help. > > > > Many thanks. > > > >> Date: Mon, 11 Apr 2011 20:16:53 +0800 > >> Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872 > >> From: giamteckchoon@gmail.com > >> To: tinnycloud@hotmail.com > >> CC: xen-devel@lists.xensource.com; dave@ivt.com.au; > >> ian.campbell@citrix.com; konrad.wilk@oracle.com; jeremy@goop.org; > >> keir@xen.org > >> > >> > > >> > Hi, > >> > > >> > Sorry, since this mmu related BUG has been troubled me for very > >> > long... I really want to "kill" this BUG but my knowledge in kernel > >> > hacking and/or xen is very limited. > >> > > >> > While waiting for Jeremy or Konrad or others ... > >> > > >> > Many thanks for spending time to track down this mmu related BUG. I > >> > have backported the commit from > >> > > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=64141da587241301ce8638cc945f8b67853156ec > >> > to 2.6.32.36 PVOPS kernel and patch attached. I won''t know whether > >> > did I backport it correctly nor does it affects anything. I am > >> > currently testing the 2.6.32.36 PVOPS kernel with this patch applied > >> > and also unset CONFIG_DEBUG_PAGEALLOC. Currently running testcrash.sh > >> > loop 1000 as I am unable to reproduce this mmu BUG 1872 in > >> > testcrash.sh loop 100. Please note that when CONFIG_DEBUG_PAGEALLOC > >> > is unset, I can reproduce this mmu BUG 1872 easily within <50 > >> > testcrash.sh loop cycle with PVOPS version 2.6.32.24 to 2.6.32.36 > >> > kernel. Now test with this backport patch to see whether I can > >> > reproduce this mmu BUG... ... > >> > > >> > Kindest regards, > >> > Giam Teck Choon > >> > > >> > >> I have tested with my backport patch and it is working fine as I am > >> unable to reproduce the mmu.c 1872 or 1860 bug with > >> CONFIG_DEBUG_PAGEALLOC not set. I tested with testcrash.sh loop 100 > >> and 1000. Now doing testcrash.sh loop 10000. > >> > >> Xiaoyun, is it possible for you to test my patch and see whether can > >> you reproduce the mmu.c 1872/1860 bug? > >> > >> Can anyone of you review my patch? > >> > >> I will post a format patch according to > >> Documentation/SubmittingPatches in my next reply and hopefully can be > >> reviewed. > >> > >> Thanks. > >> > >> Kindest regards, > >> Giam Teck Choon > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Thanks for your reply and comfirm. Well, what''s your opinion of TLB bug? Is it related to this patch or a new bug? Attached it the new log I''ve got in 28 machine tests, one crashed.> Date: Mon, 11 Apr 2011 11:08:10 -0700 > From: jeremy@goop.org > To: tinnycloud@hotmail.com > CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; dave@ivt.com.au; ian.campbell@citrix.com; konrad.wilk@oracle.com; keir@xen.org > Subject: Re: kernel BUG at arch/x86/xen/mmu.c:1872 > > On 04/11/2011 05:31 AM, MaoXiaoyun wrote: > > Hi: > > > > I believe this is the fix at much extent. > > Since I have my own test cases which with this patch, my test case > > will success in 30 rounds run. > > Every round takes 8hours. While without this patch, tests fail evey > > round in 15minutes. > > > > So this really means fix most of the things. > > > > But during running, I met another crash, from the log it it looks like > > has relation with > > this BUG, since the crash log shows it is tlb related and this BUG > > also tlb related. > > > > Well, I''m also have poor knowledge of kernel. > > Hope someone from Xen Devel offer some help. > > Thanks for confirming; it makes sense and explains the symptoms, so I''m > glad it also works ;) > > > J_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: We are just about to try the new Kernel, but confront Error on grant table. 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. Please refer to error log from serial output I''ve traced the log a bit, and the log is from xen/common/grant_table.c 1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from 1715 if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){ 1716 PIN_FAIL(unlock_out, GNTST_bad_gntref, 1717 "Bad grant reference %ld\n", gref); 1718 BUG(); 1719 } 2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from grant_table.c:1967 => __acquire_grant_for_copy => _set_status ( not from __gnttab_map_grant_ref, since I add some log to identify this ) The log shows that all are from gnttab_copy, which I later found only netback has grant copy hypercall. I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but still met these errors. So it looks like it is kernel related. What happened for this, will this harmful for the usage of HVM? Many thanks. =-====================================(XEN) Xen trace buffers: disabled (XEN) Std. Loglevel: Errors and warnings (XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings) (XEN) Xen is relinquishing VGA console. (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three times to switch input to Xen) (XEN) Freed 168kB init memory. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 17 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 13 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 11 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 11 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 10 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 6 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 10 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 15 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 8 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 15 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 29 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 25 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 25 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 19 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 27 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 27 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 5 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 10 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 15 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 8 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 8 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 9 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 7 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 5 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 2 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:1717:d0 Bad grant reference 4294965983 (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 3 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) printk: 1 messages suppressed. (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) (XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137) (XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137) (XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137) (XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137) (XEN) grant_table.c:578:d0 Iomem mapping not permitted ffffffffffffffff (domain 137) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 08:46 UTC
Re: [Xen-devel] Grant Table Error on 2.6.32.36 + Xen 4.0.1
On Tue, Apr 12, 2011 at 02:48:36PM +0800, MaoXiaoyun wrote:> > Hi: > > We are just about to try the new Kernel, but confront Error on grant table.Please open a new thread on this one. This is getting confusing.> > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. > Please refer to error log from serial output > > I''ve traced the log a bit, and the log is from xen/common/grant_table.c > > 1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from > > 1715 if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){ > 1716 PIN_FAIL(unlock_out, GNTST_bad_gntref, > 1717 "Bad grant reference %ld\n", gref); > 1718 BUG(); > 1719 } > > 2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from > > grant_table.c:1967 => __acquire_grant_for_copy => _set_status > > ( not from __gnttab_map_grant_ref, since I add some log to identify this ) > > The log shows that all are from gnttab_copy, which I later found only netback > has grant copy hypercall. > > I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but > still met these errors. So it looks like it is kernel related. > > What happened for this, will this harmful for the usage of HVM?What is the storage for your HVM guests? iSCSI? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
MaoXiaoyun
2011-Apr-12 09:02 UTC
RE: [Xen-devel] Grant Table Error on 2.6.32.36 + Xen 4.0.1
Thanks Konrad. I will new a thread on TLB bug. For grant table error. I add some debug log on netback.c , line 388. 358 static u16 netbk_gop_frag(struct xen_netif *netif, struct netbk_rx_meta *meta, 359 int i, struct netrx_pending_operations *npo, 360 struct page *page, unsigned long size, 361 unsigned long offset) 362 { 363 struct gnttab_copy *copy_gop; 364 struct xen_netif_rx_request *req; 365 unsigned long old_mfn; 366 int idx = netif_page_index(page); 367 368 old_mfn = virt_to_mfn(page_address(page)); 369 370 req = RING_GET_REQUEST(&netif->rx, netif->rx.req_cons + i); 371 372 copy_gop = npo->copy + npo->copy_prod++; 373 copy_gop->flags = GNTCOPY_dest_gref; 374 if (idx > -1) { 375 struct pending_tx_info *src_pend = &pending_tx_info[idx]; 376 copy_gop->source.domid = src_pend->netif->domid; 377 copy_gop->source.u.ref = src_pend->req.gref; 378 copy_gop->flags |= GNTCOPY_source_gref; 379 } else { 380 copy_gop->source.domid = DOMID_SELF; 381 copy_gop->source.u.gmfn = old_mfn; 382 } 383 copy_gop->source.offset = offset; 384 copy_gop->dest.domid = netif->domid; 385 copy_gop->dest.offset = 0; 386 copy_gop->dest.u.ref = req->gref; 387 copy_gop->len = size; 388 if(req->gref > 16384) 389 IPRINTK("dom %d, req gref %d size = %lu\n", netif->domid, req->gref, size); 390 391 return req->id; 392 } And the output below, indicates something might wrong on grant table. Apr 12 16:38:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 270 Apr 12 16:38:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 72 Apr 12 16:38:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 270 Apr 12 16:38:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:40 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:40 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:42 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:42 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:44 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:44 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:57 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:57 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:38:59 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:22 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:39:22 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:26 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:39:26 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:29 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:29 xmao kernel: xen_net: dom 14, req gref -1313 size = 42 Apr 12 16:39:29 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:29 xmao kernel: xen_net: dom 14, req gref 5242956 size = 42 Apr 12 16:39:30 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:30 xmao kernel: xen_net: dom 14, req gref 1817341261 size = 42 Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:31 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 38 Apr 12 16:39:31 xmao kernel: xen_net: dom 14, req gref -1313 size = 72 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 42 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:32 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 38 Apr 12 16:39:32 xmao kernel: xen_net: dom 14, req gref -1408 size = 72 Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref -1408 size = 42 Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref -1408 size = 42 Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:33 xmao kernel: xen_net: dom 14, req gref 1850305869 size = 38 Apr 12 16:39:33 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 38 Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 72 Apr 12 16:39:34 xmao kernel: xen_net: dom 23, req gref -1313 size = 42 Apr 12 16:39:34 xmao kernel: xen_net: dom 14, req gref -1313 size = 42 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 270 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 38 Apr 12 16:39:35 xmao kernel: xen_net: dom 23, req gref -1313 size = 72> Date: Tue, 12 Apr 2011 04:46:29 -0400 > From: konrad.wilk@oracle.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; tim.deegan@citrix.com; george.dunlap@eu.citrix.com; giamteckchoon@gmail.com; ian.campbell@citrix.com; keir.fraser@eu.citrix.com > Subject: Re: [Xen-devel] Grant Table Error on 2.6.32.36 + Xen 4.0.1 > > On Tue, Apr 12, 2011 at 02:48:36PM +0800, MaoXiaoyun wrote: > > > > Hi: > > > > We are just about to try the new Kernel, but confront Error on grant table. > > Please open a new thread on this one. This is getting confusing. > > > > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > > > > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. > > Please refer to error log from serial output > > > > I''ve traced the log a bit, and the log is from xen/common/grant_table.c > > > > 1) log " grant_table.c:1717:d0 Bad grant reference 4294965983 " if from > > > > 1715 if ( unlikely(gref >= nr_grant_entries(rd->grant_table)) ){ > > 1716 PIN_FAIL(unlock_out, GNTST_bad_gntref, > > 1717 "Bad grant reference %ld\n", gref); > > 1718 BUG(); > > 1719 } > > > > 2) log "grant_table.c:266:d0 Bad flags (0) or dom (0). (expected dom 0) " is from > > > > grant_table.c:1967 => __acquire_grant_for_copy => _set_status > > > > ( not from __gnttab_map_grant_ref, since I add some log to identify this ) > > > > The log shows that all are from gnttab_copy, which I later found only netback > > has grant copy hypercall. > > > > I also tried netback code from 2.6.31(which works well with kernel 2.6.31), but > > still met these errors. So it looks like it is kernel related. > > > > What happened for this, will this harmful for the usage of HVM? > > What is the storage for your HVM guests? iSCSI?_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi : We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug. 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61 Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s suggestion. Any comments, thanks. ===============crash log=========================INIT: Id "s0" respawning too fast: disabled for 5 minutes __ratelimit: 14 callbacks suppressed blktap_sysfs_destroy blktap_sysfs_destroy ------------[ cut here ]------------ kernel BUG at arch/x86/mm/tlb.c:61! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb CPU 1 Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table] Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 RSP: e02b:ffff88002805be48 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0 RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200 R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880 R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000 FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40) Stack: ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88 <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78 <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108 Call Trace: <IRQ> [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef [<ffffffff81042fcf>] ? need_resched+0x23/0x2d [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e [<ffffffff8101155d>] ? sys_execve+0x43/0x5d [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e [<ffffffff81013daa>] ? child_rip+0xa/0x20 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 [<ffffffff81013da0>] ? child_rip+0x0/0x20 Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 RSP <ffff88002805be48> ---[ end trace ce9cee6832a9c503 ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1 Call Trace: <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a [<ffffffff8144008a>] ? init_amd+0x296/0x37a [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8e2>] ? check_events+0x12/0x20 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 [<ffffffff81448185>] oops_end+0xb6/0xc6 [<ffffffff810166e5>] die+0x5a/0x63 [<ffffffff81447a5c>] do_trap+0x115/0x124 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23 [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11 [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67 [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81 [<ffffffff81013b3b>] invalid_op+0x1b/0x20 [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef [<ffffffff81042fcf>] ? need_resched+0x23/0x2d [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 [<ffffffff81114362>] ? do_execve+0x1c3/0x29e [<ffffffff8101155d>] ? sys_execve+0x43/0x5d [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e [<ffffffff81013daa>] ? child_rip+0xa/0x20 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 [<ffffffff81013da0>] ? child_rip+0x0/0x20 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-12 10:00 UTC
[Xen-devel] Re: Kernel BUG at arch/x86/mm/tlb.c:61
On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote:> > Hi : > > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug. > > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes.What is the storage that you are using for your guests? AoE? Local disks?> About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61 > > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s suggestion. > > Any comments, thanks. > > ===============crash log=========================> INIT: Id "s0" respawning too fast: disabled for 5 minutes > __ratelimit: 14 callbacks suppressed > blktap_sysfs_destroy > blktap_sysfs_destroy > ------------[ cut here ]------------ > kernel BUG at arch/x86/mm/tlb.c:61! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb > CPU 1 > Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table] > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > RSP: e02b:ffff88002805be48 EFLAGS: 00010046 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0 > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001 > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200 > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880 > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000 > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40) > Stack: > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88 > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78 > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108 > Call Trace: > <IRQ> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > <EOI> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > RSP <ffff88002805be48> > ---[ end trace ce9cee6832a9c503 ]--- > Kernel panic - not syncing: Fatal exception in interrupt > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1 > Call Trace: > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > [<ffffffff8144008a>] ? init_amd+0x296/0x37a > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf > [<ffffffff8100f8e2>] ? check_events+0x12/0x20 > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > [<ffffffff81448185>] oops_end+0xb6/0xc6 > [<ffffffff810166e5>] die+0x5a/0x63 > [<ffffffff81447a5c>] do_trap+0x115/0x124 > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23 > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11 > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67 > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81 > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
VHD file in local disk. disk = [ ''tap:vhd:/mnt/xmao/test/img/win2003.cp1.vhd,hda,w''] thanks.> Date: Tue, 12 Apr 2011 06:00:00 -0400 > From: konrad.wilk@oracle.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; jeremy@goop.org > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote: > > > > Hi : > > > > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug. > > > > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > > > > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. > > What is the storage that you are using for your guests? AoE? Local disks? > > > About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61 > > > > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s suggestion. > > > > Any comments, thanks. > > > > ===============crash log=========================> > INIT: Id "s0" respawning too fast: disabled for 5 minutes > > __ratelimit: 14 callbacks suppressed > > blktap_sysfs_destroy > > blktap_sysfs_destroy > > ------------[ cut here ]------------ > > kernel BUG at arch/x86/mm/tlb.c:61! > > invalid opcode: 0000 [#1] SMP > > last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb > > CPU 1 > > Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table] > > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 > > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP: e02b:ffff88002805be48 EFLAGS: 00010046 > > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0 > > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001 > > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200 > > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880 > > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000 > > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40) > > Stack: > > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88 > > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78 > > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108 > > Call Trace: > > <IRQ> > > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> > > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP <ffff88002805be48> > > ---[ end trace ce9cee6832a9c503 ]--- > > Kernel panic - not syncing: Fatal exception in interrupt > > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1 > > Call Trace: > > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > > [<ffffffff8144008a>] ? init_amd+0x296/0x37a > > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf > > [<ffffffff8100f8e2>] ? check_events+0x12/0x20 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > > [<ffffffff81448185>] oops_end+0xb6/0xc6 > > [<ffffffff810166e5>] die+0x5a/0x63 > > [<ffffffff81447a5c>] do_trap+0x115/0x124 > > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23 > > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11 > > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67 > > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81 > > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-12 16:08 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
If it is possible, please try not to top-post as this make reading more confusing for me at least. Thanks ;) 2011/4/12 MaoXiaoyun <tinnycloud@hotmail.com>:> Hi: > > I have just kicked off cpuidle=0 "cpufreq=none" tests.Let see whether are you able to reproduce the tlb BUG with the above.> > What is your Xen version? Do you use the backend driver of > 2.6.32.36?You are asking me? xen-4.0.2-rc3-pre latest changeset and also xen-4.1.1-rc1-pre. What do you mean backend driver? My testing are mostly on PV domU and HVM on windows with LVM as storage. I do not use VDH or any PV drivers for windows.> > Beside the "TLB BUG ", I''ve met at least two other issues > 1)Xen4.0.1 + 2.6.32.36 kernel + backend driver from 2.6.31 ==> will > cause "Bad grant reference " log in serial output > 2)Xen4.0.1 + 2.6.32.36 kernel with its owen backend driver ==> will > cause disk error like belows. > > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > end_request: I/O error, dev tdb, sector 28699593 > end_request: I/O error, dev tdb, sector 28699673 > end_request: I/O error, dev tdb, sector 28699753 > end_request: I/O error, dev tdb, sector 28699833 > end_request: I/O error, dev tdb, sector 28699913 > end_request: I/O error, dev tdb, sector 28699993 > end_request: I/O error, dev tdb, sector 28700073Is this related to VDH? What is the specific backend driver? These started to surface after you applied my backport patch or regardless the patch applied it is already there? Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2011-Apr-12 16:32 UTC
[Xen-devel] Re: kernel BUG at arch/x86/xen/mmu.c:1872
2011/4/12 Jeremy Fitzhardinge <jeremy@goop.org>:> On 04/11/2011 05:31 AM, MaoXiaoyun wrote: >> Hi: >> >> I believe this is the fix at much extent. >> Since I have my own test cases which with this patch, my test case >> will success in 30 rounds run. >> Every round takes 8hours. While without this patch, tests fail evey >> round in 15minutes. >> >> So this really means fix most of the things. >> >> But during running, I met another crash, from the log it it looks like >> has relation with >> this BUG, since the crash log shows it is tlb related and this BUG >> also tlb related. >> >> Well, I''m also have poor knowledge of kernel. >> Hope someone from Xen Devel offer some help. > > Thanks for confirming; it makes sense and explains the symptoms, so I''m > glad it also works ;) > > > J >Thanks Jeremy, I can see the needed backport patch is in your xen/next-2.6.32 tree now ;) Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: I''ve done test with "cpuidle=0 cpufreq=none", two machine crashed. blktap_sysfs_destroy blktap_sysfs_destroy blktap_sysfs_create: adding attributes for dev ffff8800ad581000 blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00 ------------[ cut here ]------------ kernel BUG at arch/x86/mm/tlb.c:61! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/block/tapdeve/dev CPU 0 Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2 serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT CO_wdt pata_acpi soundcore iTCO_vendor_ support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa ded: freq_table] Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285 RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 RSP: e02b:ffff88002803ee48 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980 RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200 R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80 R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000 FS: 00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000) Stack: ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88 <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78 <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8 Call Trace: <IRQ> [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef [<ffffffff81042fcf>] ? need_resched+0x23/0x2d [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e [<ffffffff8101155d>] ? sys_execve+0x43/0x5d [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e [<ffffffff81013daa>] ? child_rip+0xa/0x20 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 [<ffffffff81013da0>] ? c hild_rip+0x0/0x20 Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 RSP <ffff88002803ee48> ---[ end trace 1522f17fdfc9162d ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 8022, comm: khelper Tainted: G D 2.6.32.36xen #1 Call Trace: <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a [<ffffffff8144006a>] ? init_amd+0x296/0x37a [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8c2>] ? check_events+0x12/0x20 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 [<ffffffff81448165>] oops_end+0xb6/0xc6 [<ffffffff810166e5>] die+0x5a/0x63 [<ffffffff81447a3c>] do_trap+0x115/0x124 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23 [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11 [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67 [<ffffffff81013b3b>] invalid_op+0x1b/0x20 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef [<ffffffff81042fcf>] ? need_resched+0x23/0x 2d [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e [<ffffffff8101155d>] ? sys_execve+0x43/0x5d [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e [<ffffffff81013daa>] ? child_rip+0xa/0x20 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 [<ffffffff81013da0>] ? child_rip+0x0/0x20 (XEN) Domain 0 crashed: ''noreboot'' set - not rebooting.> Date: Tue, 12 Apr 2011 06:00:00 -0400 > From: konrad.wilk@oracle.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; jeremy@goop.org > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote: > > > > Hi : > > > > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel panic bug. > > > > 2.6.32.36 Kernel: http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > > > > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes in restart every 15minutes. > > What is the storage that you are using for your guests? AoE? Local disks? > > > About 17 machines are invovled in the test, after 10 hours run, one confrontted a crash at arch/x86/mm/tlb.c:61 > > > > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s suggestion. > > > > Any comments, thanks. > > > > ===============crash log=========================> > INIT: Id "s0" respawning too fast: disabled for 5 minutes > > __ratelimit: 14 callbacks suppressed > > blktap_sysfs_destroy > > blktap_sysfs_destroy > > ------------[ cut here ]------------ > > kernel BUG at arch/x86/mm/tlb.c:61! > > invalid opcode: 0000 [#1] SMP > > last sysfs file: /sys/devices/system/xen_memory/xen_memory0/info/current_kb > > CPU 1 > > Modules linked in: 8021q garp xen_netback xen_blkback blktap blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table] > > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 > > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP: e02b:ffff88002805be48 EFLAGS: 00010046 > > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0 > > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001 > > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200 > > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880 > > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000 > > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process khelper (pid: 25581, threadinfo ffff88007691e000, task ffff88009b92db40) > > Stack: > > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88 > > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78 > > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108 > > Call Trace: > > <IRQ> > > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> > > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP <ffff88002805be48> > > ---[ end trace ce9cee6832a9c503 ]--- > > Kernel panic - not syncing: Fatal exception in interrupt > > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1 > > Call Trace: > > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > > [<ffffffff8144008a>] ? init_amd+0x296/0x37a > > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf > > [<ffffffff8100f8e2>] ? check_events+0x12/0x20 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > > [<ffffffff81448185>] oops_end+0xb6/0xc6 > > [<ffffffff810166e5>] die+0x5a/0x63 > > [<ffffffff81447a5c>] do_trap+0x115/0x124 > > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23 > > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11 > > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67 > > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81 > > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>:> Hi: > > I''ve done test with "cpuidle=0 cpufreq=none", two machine crashed. > > blktap_sysfs_destroy > blktap_sysfs_destroy > blktap_sysfs_create: adding attributes for dev ffff8800ad581000 > blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00 > ------------[ cut here ]------------ > kernel BUG at arch/x86/mm/tlb.c:61! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/block/tapdeve/dev > CPU 0 > Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms > ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2 > serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT > CO_wdt pata_acpi soundcore iTCO_vendor_ > support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa > ded: freq_table] > Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285 > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > RSP: e02b:ffff88002803ee48 EFLAGS: 00010046 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980 > RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200 > R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80 > R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000 > FS: 00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000) > Stack: > ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88 > <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78 > <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8 > Call Trace: > <IRQ> > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > <EOI> > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > [<ffffffff81013da0>] ? c > hild_rip+0x0/0x20 > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > RSP <ffff88002803ee48> > ---[ end trace 1522f17fdfc9162d ]--- > Kernel panic - not syncing: Fatal exception in interrupt > Pid: 8022, comm: khelper Tainted: G D 2.6.32.36xen #1 > Call Trace: > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > [<ffffffff8144006a>] ? init_amd+0x296/0x37aHmmm... both machines are using AMD CPU? Did you hit the same bug on Intel CPU?> [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf > [<ffffffff8100f8c2>] ? check_events+0x12/0x20 > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > [<ffffffff81448165>] oops_end+0xb6/0xc6 > [<ffffffff810166e5>] die+0x5a/0x63 > [<ffffffff81447a3c>] do_trap+0x115/0x124 > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23 > [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11 > [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67 > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > <EOI> [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > [<ffffffff81042fcf>] ? need_resched+0x23/0x > 2d > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > (XEN) Domain 0 crashed: ''noreboot'' set - not rebooting. > >> Date: Tue, 12 Apr 2011 06:00:00 -0400 >> From: konrad.wilk@oracle.com >> To: tinnycloud@hotmail.com >> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; >> jeremy@goop.org >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 >> >> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote: >> > >> > Hi : >> > >> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel >> > panic bug. >> > >> > 2.6.32.36 Kernel: >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 >> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 >> > >> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes >> > in restart every 15minutes. >> >> What is the storage that you are using for your guests? AoE? Local disks? >> >> > About 17 machines are invovled in the test, after 10 hours run, one >> > confrontted a crash at arch/x86/mm/tlb.c:61 >> > >> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s >> > suggestion. >> > >> > Any comments, thanks. >> > >> > ===============crash log=========================>> > INIT: Id "s0" respawning too fast: disabled for 5 minutes >> > __ratelimit: 14 callbacks suppressed >> > blktap_sysfs_destroy >> > blktap_sysfs_destroy >> > ------------[ cut here ]------------ >> > kernel BUG at arch/x86/mm/tlb.c:61! >> > invalid opcode: 0000 [#1] SMP >> > last sysfs file: >> > /sys/devices/system/xen_memory/xen_memory0/info/current_kb >> > CPU 1 >> > Modules linked in: 8021q garp xen_netback xen_blkback blktap >> > blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si >> > ipmi_msghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output >> > sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy snd_seq_oss >> > snd_seq_midi_event snd_seq snd_seq_device serio_raw bnx2 snd_pcm_oss >> > snd_mixer_oss snd_pcm snd_timer iTCO_wdt snd soundcore snd_page_alloc >> > i2c_i801 iTCO_vendor_support i2c_core pcspkr pata_acpi ata_generic ata_piix >> > shpchp mptsas mptscsih mptbase [last unloaded: freq_table] >> > Pid: 25581, comm: khelper Not tainted 2.6.32.36fixxen #1 Tecal RH2285 >> > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 >> > RSP: e02b:ffff88002805be48 EFLAGS: 00010046 >> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88015f8e2da0 >> > RDX: ffff88002805be78 RSI: 0000000000000000 RDI: 0000000000000001 >> > RBP: ffff88002805be48 R08: ffff88009d662000 R09: dead000000200200 >> > R10: dead000000100100 R11: ffffffff814472b2 R12: ffff88009bfc1880 >> > R13: ffff880028063020 R14: 00000000000004f6 R15: 0000000000000000 >> > FS: 00007f62362d66e0(0000) GS:ffff880028058000(0000) >> > knlGS:0000000000000000 >> > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b >> > CR2: 0000003aabc11909 CR3: 000000009b8ca000 CR4: 0000000000002660 >> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> > Process khelper (pid: 25581, threadinfo ffff88007691e000, task >> > ffff88009b92db40) >> > Stack: >> > ffff88002805be68 ffffffff8100e4ae 0000000000000001 ffff88009d733b88 >> > <0> ffff88002805be98 ffffffff81087224 ffff88002805be78 ffff88002805be78 >> > <0> ffff88015f808360 00000000000004f6 ffff88002805bea8 ffffffff81010108 >> > Call Trace: >> > <IRQ> >> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 >> > [<ffffffff81087224>] >> > generic_smp_call_function_single_interrupt+0xd8/0xfc >> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 >> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 >> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e >> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d >> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 >> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 >> > <EOI> >> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 >> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 >> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef >> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d >> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 >> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e >> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 >> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e >> > [<ffffffff81013daa>] ? child_rip+0xa/0x20 >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b >> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 >> > [<ffffffff81013da0>] ? child_rip+0x0/0x20 >> > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 >> > 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe >> > 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 >> > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 >> > RSP <ffff88002805be48> >> > ---[ end trace ce9cee6832a9c503 ]--- >> > Kernel panic - not syncing: Fatal exception in interrupt >> > Pid: 25581, comm: khelper Tainted: G D 2.6.32.36fixxen #1 >> > Call Trace: >> > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a >> > [<ffffffff8144008a>] ? init_amd+0x296/0x37a >> > [<ffffffff8100f17d>] ? xen_force_evtchn_callback+0xd/0xf >> > [<ffffffff8100f8e2>] ? check_events+0x12/0x20 >> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 >> > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 >> > [<ffffffff81448185>] oops_end+0xb6/0xc6 >> > [<ffffffff810166e5>] die+0x5a/0x63 >> > [<ffffffff81447a5c>] do_trap+0x115/0x124 >> > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 >> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 >> > [<ffffffff8100f6fa>] ? xen_clocksource_read+0x21/0x23 >> > [<ffffffff8100f26c>] ? HYPERVISOR_vcpu_op+0xf/0x11 >> > [<ffffffff8100f767>] ? xen_vcpuop_set_next_event+0x52/0x67 >> > [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81 >> > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 >> > [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 >> > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 >> > [<ffffffff8100e4ae>] drop_other_mm_ref+0x2a/0x53 >> > [<ffffffff81087224>] >> > generic_smp_call_function_single_interrupt+0xd8/0xfc >> > [<ffffffff81010108>] xen_call_function_single_interrupt+0x13/0x28 >> > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 >> > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e >> > [<ffffffff8128c1c0>] __xen_evtchn_do_upcall+0x1ab/0x27d >> > [<ffffffff8128dd11>] xen_evtchn_do_upcall+0x33/0x46 >> > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 >> > <EOI> [<ffffffff814472b2>] ? _spin_unlock_irqrestore+0x15/0x17 >> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 >> > [<ffffffff81113f71>] ? flush_old_exec+0x3ac/0x500 >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff8115115d>] ? load_elf_binary+0x398/0x17ef >> > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d >> > [<ffffffff811f4648>] ? process_measurement+0xc0/0xd7 >> > [<ffffffff81150dc5>] ? load_elf_binary+0x0/0x17ef >> > [<ffffffff81113094>] ? search_binary_handler+0xc8/0x255 >> > [<ffffffff81114362>] ? do_execve+0x1c3/0x29e >> > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff8100f8cf>] ? xen_restore_fl_direct_end+0x0/0x1 >> > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e >> > [<ffffffff81013daa>] ? child_rip+0xa/0x20 >> > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f >> > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b >> > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 >> > [<ffffffff81013da0>] ? child_rip+0x0/0x20 >> > >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Date: Thu, 14 Apr 2011 15:26:14 +0800 > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > From: giamteckchoon@gmail.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; jeremy@goop.org; konrad.wilk@oracle.com > > 2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>: > > Hi: > > > > I''ve done test with "cpuidle=0 cpufreq=none", two machine crashed. > > > > blktap_sysfs_destroy > > blktap_sysfs_destroy > > blktap_sysfs_create: adding attributes for dev ffff8800ad581000 > > blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00 > > ------------[ cut here ]------------ > > kernel BUG at arch/x86/mm/tlb.c:61! > > invalid opcode: 0000 [#1] SMP > > last sysfs file: /sys/block/tapdeve/dev > > CPU 0 > > Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms > > ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2 > > serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT > > CO_wdt pata_acpi soundcore iTCO_vendor_ > > support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa > > ded: freq_table] > > Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285 > > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP: e02b:ffff88002803ee48 EFLAGS: 00010046 > > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980 > > RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000 > > RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200 > > R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80 > > R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000 > > FS: 00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000) > > Stack: > > ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88 > > <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78 > > <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8 > > Call Trace: > > <IRQ> > > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> > > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > > > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? c > > hild_rip+0x0/0x20 > > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP <ffff88002803ee48> > > ---[ end trace 1522f17fdfc9162d ]--- > > Kernel panic - not syncing: Fatal exception in interrupt > > Pid: 8022, comm: khelper Tainted: G D 2.6.32.36xen #1 > > Call Trace: > > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > > [<ffffffff8144006a>] ? init_amd+0x296/0x37a > > Hmmm... both machines are using AMD CPU? Did you hit the same bug on Intel CPU? > >It is Intel CPU, not AMD. model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz> > [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf > > [<ffffffff8100f8c2>] ? check_events+0x12/0x20 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > > [<ffffffff81448165>] oops_end+0xb6/0xc6 > > [<ffffffff810166e5>] die+0x5a/0x63 > > [<ffffffff81447a3c>] do_trap+0x115/0x124 > > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23 > > [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11 > > [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67 > > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x > > 2d > > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > (XEN) Domain 0 crashed: ''noreboot'' set - not rebooting. > > > >> Date: Tue, 12 Apr 2011 06:00:00 -0400 > >> From: konrad.wilk@oracle.com > >> To: tinnycloud@hotmail.com > >> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; > >> jeremy@goop.org > >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > >> > >> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote: > >> > > >> > Hi : > >> > > >> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel > >> > panic bug. > >> > > >> > 2.6.32.36 Kernel: > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > >> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > >> > > >> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes > >> > in restart every 15minutes. > >> > >> What is the storage that you are using for your guests? AoE? Local disks? > >> > >> > About 17 machines are invovled in the test, after 10 hours run, one > >> > confrontted a crash at arch/x86/mm/tlb.c:61 > >> > > >> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s > >> > suggestion. > >> > > >> > Any comments, thanks. > >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: As I go through the code. From tlb.c:60, it looks like it cpu_tlbstate.state is TLBSTATE_OK, which indicates in user space, but the caller, in mmu.c:1512, (active_mm == mm) indicates kernel space, that the conflict. Well, the panic CPU is processing IPI interrupt, could it be something wrong with CPU mask? thanks. ======arch/x86/mm/tlb.c== 58 void leave_mm(int cpu) 59 { 60 <+++if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) 61 <+++<+++BUG(); 62 <+++cpumask_clear_cpu(cpu, 63 <+++<+++<+++ mm_cpumask(percpu_read(cpu_tlbstate.active_mm))); 64 <+++load_cr3(swapper_pg_dir); 65 } 66 EXPORT_SYMBOL_GPL(leave_mm); 67 ///arch/x86/xen/mmu.c 1502 #ifdef CONFIG_SMP 1503 /* Another cpu may still have their %cr3 pointing at the pagetable, so 1504 we need to repoint it somewhere else before we can unpin it. */ 1505 static void drop_other_mm_ref(void *info) 1506 { 1507 <+++struct mm_struct *mm = info; 1508 <+++struct mm_struct *active_mm; 1509 1510 <+++active_mm = percpu_read(cpu_tlbstate.active_mm); 1511 1512 <+++if (active_mm == mm) 1513 <+++<+++leave_mm(smp_processor_id()); 1514 1515 <+++/* If this cpu still has a stale cr3 reference, then make sure 1516 <+++ it has been flushed. */ 1517 <+++if (percpu_read(xen_current_cr3) == __pa(mm->pgd)) 1518 <+++<+++load_cr3(swapper_pg_dir); 1519 }> Date: Thu, 14 Apr 2011 15:26:14 +0800 > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > From: giamteckchoon@gmail.com > To: tinnycloud@hotmail.com > CC: xen-devel@lists.xensource.com; jeremy@goop.org; konrad.wilk@oracle.com > > 2011/4/14 MaoXiaoyun <tinnycloud@hotmail.com>: > > Hi: > > > > I''ve done test with "cpuidle=0 cpufreq=none", two machine crashed. > > > > blktap_sysfs_destroy > > blktap_sysfs_destroy > > blktap_sysfs_create: adding attributes for dev ffff8800ad581000 > > blktap_sysfs_create: adding attributes for dev ffff8800a48e3e00 > > ------------[ cut here ]------------ > > kernel BUG at arch/x86/mm/tlb.c:61! > > invalid opcode: 0000 [#1] SMP > > last sysfs file: /sys/block/tapdeve/dev > > CPU 0 > > Modules linked in: 8021q garp blktap xen_netback xen_blkback blkback_pagemap nbd bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_ms > > ghandler lockd sunrpc bonding ipv6 xenfs dm_multipath video output sbs sbshc parport_pc lp parport ses enclosure snd_seq_dummy bnx2 > > serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer i2c_core snd iT > > CO_wdt pata_acpi soundcore iTCO_vendor_ > > support ata_generic snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloa > > ded: freq_table] > > Pid: 8022, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285 > > RIP: e030:[<ffffffff8103a3cb>] [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP: e02b:ffff88002803ee48 EFLAGS: 00010046 > > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81675980 > > RDX: ffff88002803ee78 RSI: 0000000000000000 RDI: 0000000000000000 > > RBP: ffff88002803ee48 R08: ffff8800a4929000 R09: dead000000200200 > > R10: dead000000100100 R11: ffffffff81447292 R12: ffff88012ba07b80 > > R13: ffff880028046020 R14: 00000000000004fb R15: 0000000000000000 > > FS: 00007f410af416e0(0000) GS:ffff88002803b000(0000) knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000469000 CR3: 00000000ad639000 CR4: 0000000000002660 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process khelper (pid: 8022, threadinfo ffff8800a4846000, task ffff8800a9ed0000) > > Stack: > > ffff88002803ee68 ffffffff8100e4a4 0000000000000001 ffff880097de3b88 > > <0> ffff88002803ee98 ffffffff81087224 ffff88002803ee78 ffff88002803ee78 > > <0> ffff88015f808180 00000000000004fb ffff88002803eea8 ffffffff810100e8 > > Call Trace: > > <IRQ> > > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> > > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x2d > > > > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? c > > hild_rip+0x0/0x20 > > Code: 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 e8 17 ff ff ff c9 c3 55 48 89 e5 0f 1f 44 00 00 65 8b 04 25 c8 55 01 00 ff c8 75 04 <0f> 0b eb fe 65 48 8b 34 25 c0 55 01 00 48 81 c6 b8 02 00 00 e8 > > RIP [<ffffffff8103a3cb>] leave_mm+0x15/0x46 > > RSP <ffff88002803ee48> > > ---[ end trace 1522f17fdfc9162d ]--- > > Kernel panic - not syncing: Fatal exception in interrupt > > Pid: 8022, comm: khelper Tainted: G D 2.6.32.36xen #1 > > Call Trace: > > <IRQ> [<ffffffff8105682e>] panic+0xe0/0x19a > > [<ffffffff8144006a>] ? init_amd+0x296/0x37a > > [<ffffffff8100f169>] ? xen_force_evtchn_callback+0xd/0xf > > [<ffffffff8100f8c2>] ? check_events+0x12/0x20 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 > > [<ffffffff81448165>] oops_end+0xb6/0xc6 > > [<ffffffff810166e5>] die+0x5a/0x63 > > [<ffffffff81447a3c>] do_trap+0x115/0x124 > > [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100f6e6>] ? xen_clocksource_read+0x21/0x23 > > [<ffffffff8100f258>] ? HYPERVISOR_vcpu_op+0xf/0x11 > > [<ffffffff8100f753>] ? xen_vcpuop_set_next_event+0x52/0x67 > > [<ffffffff81013b3b>] invalid_op+0x1b/0x20 > > [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8103a3cb>] ? leave_mm+0x15/0x46 > > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 > > <EOI> [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff81113f75>] ? flush_old_exec+0x3ac/0x500 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef > > [<ffffffff81042fcf>] ? need_resched+0x23/0x > > 2d > > [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 > > [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef > > [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 > > [<ffffffff81114366>] ? do_execve+0x1c3/0x29e > > [<ffffffff8101155d>] ? sys_execve+0x43/0x5d > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff8100f8af>] ? xen_restore_fl_direct_end+0x0/0x1 > > [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e > > [<ffffffff81013daa>] ? child_rip+0xa/0x20 > > [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f > > [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b > > [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 > > [<ffffffff81013da0>] ? child_rip+0x0/0x20 > > (XEN) Domain 0 crashed: ''noreboot'' set - not rebooting. > > > >> Date: Tue, 12 Apr 2011 06:00:00 -0400 > >> From: konrad.wilk@oracle.com > >> To: tinnycloud@hotmail.com > >> CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; > >> jeremy@goop.org > >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > >> > >> On Tue, Apr 12, 2011 at 05:11:51PM +0800, MaoXiaoyun wrote: > >> > > >> > Hi : > >> > > >> > We are using pvops kernel 2.6.32.36 + xen 4.0.1, but confront a kernel > >> > panic bug. > >> > > >> > 2.6.32.36 Kernel: > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commit;h=bb1a15e55ec665a64c8a9c6bd699b1f16ac01ff4 > >> > Xen 4.0.1 http://xenbits.xen.org/hg/xen-4.0-testing.hg/rev/b536ebfba183 > >> > > >> > Our test is simple, 24 HVMS(Win2003 ) on a single host, each HVM loopes > >> > in restart every 15minutes. > >> > >> What is the storage that you are using for your guests? AoE? Local disks? > >> > >> > About 17 machines are invovled in the test, after 10 hours run, one > >> > confrontted a crash at arch/x86/mm/tlb.c:61 > >> > > >> > Currently I am trying "cpuidle=0 cpufreq=none" tests based on Teck''s > >> > suggestion. > >> > > >> > Any comments, thanks. > >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi: Could the crash related to this patch ? http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before cpumask_clear_cpu(line 49). Could it possible that right after execute line 40 of mmu_context.h, CPU revice IPI from other CPU to flush the mm, and when in interrupt, find the TLB state happened to be TLBSTATE_OK. Which conflicts. Thanks. arch/x86/include/asm/mmu_context.h 33 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 34 <+++<+++<+++ struct task_struct *tsk) 35 { 36 <+++unsigned cpu = smp_processor_id(); 37 38 <+++if (likely(prev != next)) { 39 #ifdef CONFIG_SMP 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK); 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next); 42 #endif 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next)); 44 45 <+++<+++/* Re-load page tables */ 46 <+++<+++load_cr3(next->pgd); 47 48 <+++<+++/* stop flush ipis for the previous mm */ 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev)); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Apr-15 21:22 UTC
[Xen-devel] Re: Kernel BUG at arch/x86/mm/tlb.c:61
On 04/15/2011 05:23 AM, MaoXiaoyun wrote:> Hi: > > Could the crash related to this patch ? > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before > cpumask_clear_cpu(line 49). > Could it possible that right after execute line 40 of mmu_context.h, > CPU revice IPI from other CPU to > flush the mm, and when in interrupt, find the TLB state happened to be > TLBSTATE_OK. Which conflicts.Does reverting it help? J> > Thanks. > > arch/x86/include/asm/mmu_context.h > > 33 static inline void switch_mm(struct mm_struct *prev, struct > mm_struct *next, > 34 <+++<+++<+++ struct task_struct *tsk) > 35 { > 36 <+++unsigned cpu = smp_processor_id(); > 37 > 38 <+++if (likely(prev != next)) { > 39 #ifdef CONFIG_SMP > 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK); > 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next); > 42 #endif > 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next)); > 44 > 45 <+++<+++/* Re-load page tables */ > 46 <+++<+++load_cr3(next->pgd); > 47 > 48 <+++<+++/* stop flush ipis for the previous mm */ > 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev)); > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Date: Fri, 15 Apr 2011 14:22:29 -0700 > From: jeremy@goop.org > To: tinnycloud@hotmail.com > CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > > Hi: > > > > Could the crash related to this patch ? > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > > > > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before > > cpumask_clear_cpu(line 49). > > Could it possible that right after execute line 40 of mmu_context.h, > > CPU revice IPI from other CPU to > > flush the mm, and when in interrupt, find the TLB state happened to be > > TLBSTATE_OK. Which conflicts. > > Does reverting it help? > > JVery likely. Previously in 17 machines test, one to three machines will fail in 10hours, very easily. But after reverting, we have 29machines involved the test, 28 successfuly rung 2 days, 1 fail after 28 hours. Unfortunately I can''t tell wether the failed one related to this bug, since I got no log in messages. And the machine was reboot by someone before I can see something from serial port. But in my opinion the fail points to another bug, which I happened to confront before. Before, one of my develop machine(2.6.32.36kernel+xen4.0.1) completely stop response, including serial console. There is no abnormal message in serial port, looks like Xen runs in deadlock. Well, it is rarely happen, since I only met once till now. Now I am trying to figure out what might cause the deadlock, we never met this before. I don''t have clear thoughts on how to dig it out, but I think this bug exists in Xen. since if dom0 hangs, xen should work, and serial output will response. If so, the bug may be introduced between 4.0.0 and 4.0.1. What do you think, thanks.> > > > Thanks. > > > > arch/x86/include/asm/mmu_context.h > > > > 33 static inline void switch_mm(struct mm_struct *prev, struct > > mm_struct *next, > > 34 <+++<+++<+++ struct task_struct *tsk) > > 35 { > > 36 <+++unsigned cpu = smp_processor_id(); > > 37 > > 38 <+++if (likely(prev != next)) { > > 39 #ifdef CONFIG_SMP > > 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK); > > 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next); > > 42 #endif > > 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next)); > > 44 > > 45 <+++<+++/* Re-load page tables */ > > 46 <+++<+++load_cr3(next->pgd); > > 47 > > 48 <+++<+++/* stop flush ipis for the previous mm */ > > 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Date: Fri, 15 Apr 2011 14:22:29 -0700 > From: jeremy@goop.org > To: tinnycloud@hotmail.com > CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com > Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > > Hi: > > > > Could the crash related to this patch ? > > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > > > > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before > > cpumask_clear_cpu(line 49). > > Could it possible that right after execute line 40 of mmu_context.h, > > CPU revice IPI from other CPU to > > flush the mm, and when in interrupt, find the TLB state happened to be > > TLBSTATE_OK. Which conflicts. > > Does reverting it help? > > JHi Jeremy: The lastest test result shows the reverting didn''t help. Kernel panic exactly at the same place in tlb.c. I have question about TLB state, from the stack, xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref What cpu_tlbstate.state should be, could TLBSTATE_OK or TLBSTATE_LAZY all be possible? That is after a hypercall from userspace, state will be TLBSTATE_OK, and if from kernel space, state will be TLBSTATE_LAZE ? thanks. [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30> > > > > Thanks. > > > > arch/x86/include/asm/mmu_context.h > > > > 33 static inline void switch_mm(struct mm_struct *prev, struct > > mm_struct *next, > > 34 <+++<+++<+++ struct task_struct *tsk) > > 35 { > > 36 <+++unsigned cpu = smp_processor_id(); > > 37 > > 38 <+++if (likely(prev != next)) { > > 39 #ifdef CONFIG_SMP > > 40 <+++<+++percpu_write(cpu_tlbstate.state, TLBSTATE_OK); > > 41 <+++<+++percpu_write(cpu_tlbstate.active_mm, next); > > 42 #endif > > 43 <+++<+++cpumask_set_cpu(cpu, mm_cpumask(next)); > > 44 > > 45 <+++<+++/* Re-load page tables */ > > 46 <+++<+++load_cr3(next->pgd); > > 47 > > 48 <+++<+++/* stop flush ipis for the previous mm */ > > 49 <+++<+++cpumask_clear_cpu(cpu, mm_cpumask(prev)); > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I go through the switch_mm more, and come up one more question: Why we don''t need to clear prev cpumask in line between line 59 and 60? Say 1) Context is switch from process A to kernel, then kernel has active_mm-> A''s mm 2) Context is switch from kernel to A, in sched.c oldmm = A''s mm; mm = A''s mm 3) it will call arch/x86/include/asm/mmu_context.h:60, since prev = next; if another CPU flush A''s mm, but this cpu don''t clear CPU mask, it might enter IPI interrput routine, and also find cpu_tlbstate.state is TLBSTATE_OK. Could this possible? kernel/sched.c 2999 context_switch(struct rq *rq, struct task_struct *prev, 3000 struct task_struct *next) 3001 { 3002 struct mm_struct *mm, *oldmm; 3003 3004 prepare_task_switch(rq, prev, next); 3005 trace_sched_switch(rq, prev, next); 3006 mm = next->mm; 3007 oldmm = prev->active_mm; 3008 /* 3009 * For paravirt, this is coupled with an exit in switch_to to 3010 * combine the page table reload and the switch backend into 3011 * one hypercall. 3012 */ 3013 arch_start_context_switch(prev); 3014 3015 if (unlikely(!mm)) { 3016 next->active_mm = oldmm; 3017 atomic_inc(&oldmm->mm_count); 3018 enter_lazy_tlb(oldmm, next); 3019 } else 3020 switch_mm(oldmm, mm, next); 3021 3022 if (unlikely(!prev->mm)) { 3023 prev->active_mm = NULL; 3024 rq->prev_mm = oldmm; 3025 } 33 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 34 struct task_struct *tsk) 35 { 36 unsigned cpu = smp_processor_id(); 37 38 if (likely(prev != next)) { 39 /* stop flush ipis for the previous mm */ 40 cpumask_clear_cpu(cpu, mm_cpumask(prev)); 41 42 43 #ifdef CONFIG_SMP 44 percpu_write(cpu_tlbstate.state, TLBSTATE_OK); 45 percpu_write(cpu_tlbstate.active_mm, next); 46 #endif 47 cpumask_set_cpu(cpu, mm_cpumask(next)); 48 49 /* Re-load page tables */ 50 load_cr3(next->pgd); 51 52 /* 53 * load the LDT, if the LDT is different: 54 */ 55 if (unlikely(prev->context.ldt != next->context.ldt)) 56 load_LDT_nolock(&next->context); 57 } 58 #ifdef CONFIG_SMP 59 else { 60 percpu_write(cpu_tlbstate.state, TLBSTATE_OK); 61 BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next); 62 63 if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { 64 /* We were in lazy tlb mode and leave_mm disabled 65 * tlb flush IPI delivery. We must reload CR3 66 * to make sure to use no freed page tables. 67 */ 68 load_cr3(next->pgd); 69 load_LDT_nolock(&next->context); 70 } 71 } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Add some debug info in drop_other_mm_ref(line 1516), get on machine crash. log attached, pity I lost prink info. Does current->mm indicates userspace? Thanks. ===========================1502 #ifdef CONFIG_SMP 1503 /* Another cpu may still have their %cr3 pointing at the pagetable, so 1504 we need to repoint it somewhere else before we can unpin it. */ 1505 static void drop_other_mm_ref(void *info) 1506 { 1507 <+++struct mm_struct *mm = info; 1508 <+++struct mm_struct *active_mm; 1509 1510 <+++active_mm = percpu_read(cpu_tlbstate.active_mm); 1511 1512 <+++if (active_mm == mm){ 1513 if(current->mm){ 1514 <+++<+++ printk("in userspace active_mm %p mm %p curr_mm %p tlbstate%d\n", 1515 active_mm, mm, current->mm, percpu_read(cpu_tlbstate.state)); 1516 BUG(); 1517 } 1518 <+++<+++leave_mm(smp_processor_id()); 1519 } 1520 =========================== Starting udev: ------------[ cut here ]------------ kernel BUG at arch/x86/xen/mmu.c:1516! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/class/raw/rawctl/dev CPU 2 Modules linked in: snd_seq_dummy bnx2 snd_seq_oss(+) snd_seq_midi_event snd_seq snd_seq_device serio_raw snd_pcm_oss snd_mixer_oss snd_pcm snd_timer i2c_i801 i2c_core iTCO_wdt snd pata_acpi iTCO_vendor_support ata_generic soundcore snd_page_alloc pcspkr ata_piix shpchp mptsas mptscsih mptbase Pid: 1126, comm: khelper Not tainted 2.6.32.36xen #1 Tecal RH2285 RIP: e030:[<ffffffff8100e4c0>] [<ffffffff8100e4c0>] drop_other_mm_ref+0x46/0x80 RSP: e02b:ffff880028078e58 EFLAGS: 00010092 RAX: 0000000000000015 RBX: 0000000000000001 RCX: 00000000ffff0075 RDX: 0000000000009f9f RSI: ffffffff8144006a RDI: 0000000000000004 RBP: ffff880028078e68 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000028078cf8 R11: 0000000000000246 R12: ffff88012c032680 R13: ffff880028080020 R14: 00000000000004f1 R15: 0000000000000000 FS: 00007f01adcf8710(0000) GS:ffff880028075000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f01adf20648 CR3: 000000012a546000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process khelper (pid: 1126, threadinfo ffff88012d80e000, task ffff88012b880000) Stack: 0000000000000001 ffff88012bb9bb88 ffff880028078e98 ffffffff81087224 <0> ffff880028078e78 ffff880028078e78 ffff88015f808540 00000000000004f1 <0> ffff880028078ea8 ffffffff81010118 ffff880028078ee8 ffffffff810a936a Call Trace: <IRQ> [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff81010118>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8f2>] ? check_events+0x12/0x20 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100d47f>] ? xen_mc_issue+0x2e/0x33 [<ffffffff8100e42f>] ? __xen_pgd_pin+0xc1/0xc9 [<ffffffff8100e449>] ? xen_pgd_pin+0x12/0x14 [<ffffffff8100e470>] ? xen_activate_mm+0x25/0x2f [<ffffffff81113f59>] ? flush_old_exec+0x390/0x500 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef [<ffffffff81042fcf>] ? need_resched+0x23/0x2d [<ffffffff811f463c>] ? process_measurement+0xc0/0xd7 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81113098>] ? search_binary_handler+0xc8/0x255 [<ffffffff81114366>] ? do_execve+0x1c3/0x29e [<ffffffff8101155d>] ? sys_execve+0x43/0x5d [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81013e28>] ? kernel_execve+0x68/0xd0 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8106fb64>] ? ____call_usermodehelper+0x113/0x11e [<ffffffff81013daa>] ? child_rip+0xa/0x20 [<ffffffff8106fc45>] ? __call_usermodehelper+0x0/0x6f [<ffffffff81012f91>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8101371d>] ? retint_restore_args+0x5/0x6 [<ffffffff81013da0>] ? child_rip+0x0/0x20 Code: 75 3a 65 48 8b 04 25 c0 cb 00 00 48 83 b8 78 02 00 00 00 74 1a 65 8b 34 25 c8 55 01 00 48 c7 c7 06 98 5b 81 31 c0 e8 d9 90 04 00 <0f> 0b eb fe 65 8b 3c 25 78 e3 00 00 e8 e5 be 02 00 65 48 8b 1c RIP [<ffffffff8100e4c0>] drop_other_mm_ref+0x46/0x80 RSP <ffff880028078e58> [<ffffffff8144006a>] ? init_amd+0x296/0x37a [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8f2>] ? check_events+0x12/0x20 [<ffffffff81056487>] ? print_oops_end_marker+0x23/0x25 [<ffffffff81448165>] oops_end+0xb6/0xc6 [<ffffffff810166e5>] die+0x5a/0x63 [<ffffffff81447a3c>] do_trap+0x115/0x124 [<ffffffff810148e6>] do_invalid_op+0x9c/0xa5 [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80 [<ffffffff81057640>] ? printk+0xa7/0xa9 [<ffffffff81013b3b>] invalid_op+0x1b/0x20 [<ffffffff8144006a>] ? init_amd+0x296/0x37a [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80 [<ffffffff8100e4c0>] ? drop_other_mm_ref+0x46/0x80 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff81010118>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000 [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f195>] ? xen_force_evtchn_callback+0xd/0xf [<ffffffff8100f8f2>] ? check_events+0x12/0x20 [<ffffffff81447292>] ? _spin_unlock_irqrestore+0x15/0x17 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100f8df>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8100d47f>] ? xen_mc_issue+0x2e/0x33 [<ffffffff8100e42f>] ? __xen_pgd_pin+0xc1/0xc9 [<ffffffff8100e449>] ? xen_pgd_pin+0x12/0x14 [<ffffffff8100e470>] ? xen_activate_mm+0x25/0x2f [<ffffffff81113f59>] ? flush_old_exec+0x390/0x500 [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81150dc9>] ? load_elf_binary+0x0/0x17ef [<ffffffff81151161>] ? load_elf_binary+0x398/0x17ef _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: tinnycloud@hotmail.com >To: jeremy@goop.org >CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com >Subject: RE: Kernel BUG at arch/x86/mm/tlb.c:61 >Date: Mon, 25 Apr 2011 20:54:54 +0800>Add some debug info in drop_other_mm_ref(line 1516), get on machine crash. >log attached, pity I lost prink info.printk info: in userspace active_mm ffff8800a3669f80 mm ffff8800a3669f80 curr_mm ffff88008d73c000 tlbstate 2>Does current->mm indicates userspace? >Thanks._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Please ignore my last two mails, I just learnt that Current is meanless in irq context. Just come up one whole assumption: In my opinion: 1) CPU running in switch_mm has the possiblity of receiving IPI message and enter interrupt 2) Before revert that patch, not matter the if statement is true or not, the cpu_tlbstate.state could be changed to TLBSTATE_OK, right before enter irq routhine 3) Since the cpu_tlbstate is per CPU variable, before calling leave_mm(), test cpu_tlbstate.state in drop_other_mm_ref is feasible and nessary 4) If I am right, strange thing is the code of 2.6.32.36 is same as 2.6.31.x, which we never met tlb bug before. any comments? Many thanks. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: MaoXiaoyun >Sent: Monday, April 25, 2011 11:15 AM >> Date: Fri, 15 Apr 2011 14:22:29 -0700 >> From: jeremy@goop.org >> To: tinnycloud@hotmail.com >> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 >> >> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: >> > Hi: >> > >> > Could the crash related to this patch ? >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 >> > >> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before >> > cpumask_clear_cpu(line 49). >> > Could it possible that right after execute line 40 of mmu_context.h, >> > CPU revice IPI from other CPU to >> > flush the mm, and when in interrupt, find the TLB state happened to be >> > TLBSTATE_OK. Which conflicts. >> >> Does reverting it help? >> >> J > >Hi Jeremy: > > The lastest test result shows the reverting didn't help. > Kernel panic exactly at the same place in tlb.c. > > I have question about TLB state, from the stack, > xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref > > What cpu_tlbstate.state should be, could TLBSTATE_OK or TLBSTATE_LAZY all be possible? > That is after a hypercall from userspace, state will be TLBSTATE_OK, and > if from kernel space, state will be TLBSTATE_LAZE ? > > thanks.it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked before invoking leave_mm(). There's a window between below lines of code: <xen_drop_mm_ref> /* Get the "official" set of cpus referring to our pagetable. */ if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { for_each_online_cpu(cpu) { if (!cpumask_test_cpu(cpu, mm_cpumask(mm)) && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd)) continue; smp_call_function_single(cpu, drop_other_mm_ref, mm, 1); } return; } there's chance that when smp_call_function_single is invoked, actual TLB state has been updated in the other cpu. The upstream kernel patch you referred to earlier just makes this bug exposed more easily. But even without this patch, you may still suffer such issue which is why reverting the patch doesn't help. Could you try adding a check in drop_other_mm_ref? if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) leave_mm(smp_processor_id()); once the interrupted context has TLBSTATE_OK, it implicates that later it will handle the TLB flush and thus no need for leave_mm from interrupt handler, and that's the assumption of doing leave_mm. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
the race window is always there, but whether it will be triggered is not determined. It''s possible that you never met this bug on 2.6.31.x now, but it doesn''t mean you won''t meet it in long run in the future. :) Thanks Kevin From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of MaoXiaoyun Sent: Monday, April 25, 2011 11:05 PM To: jeremy@goop.org Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com Subject: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 Please ignore my last two mails, I just learnt that Current is meanless in irq context. Just come up one whole assumption: In my opinion: 1) CPU running in switch_mm has the possiblity of receiving IPI message and enter interrupt 2) Before revert that patch, not matter the if statement is true or not, the cpu_tlbstate.state could be changed to TLBSTATE_OK, right before enter irq routhine 3) Since the cpu_tlbstate is per CPU variable, before calling leave_mm(), test cpu_tlbstate.state in drop_other_mm_ref is feasible and nessary 4) If I am right, strange thing is the code of 2.6.32.36 is same as 2.6.31.x, which we never met tlb bug before. any comments? Many thanks. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Many thanks, Kevin. I agree on the race window. One thing more, In my understaning, the CPU who send out IPI message, will unpin the pagetable after receive all ACKS from other cpu, if the CPU who received IPI message, enter drop_other_mm_ref, and has TLBSTATE_OK, does nothing, will it possible it possible confronts with stale pagetable (that is unpinned by sender CPU)? So do we need flush tlb when its state is TBLSTATE_OK? if (active_mm == mm){ if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) load_cr3(mm->pgd) else leave_mm(smp_processor_id()); }> From: kevin.tian@intel.com > To: tinnycloud@hotmail.com; jeremy@goop.org > CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com > Date: Tue, 26 Apr 2011 13:52:11 +0800 > Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 > > >From: MaoXiaoyun > >Sent: Monday, April 25, 2011 11:15 AM > >> Date: Fri, 15 Apr 2011 14:22:29 -0700 > >> From: jeremy@goop.org > >> To: tinnycloud@hotmail.com > >> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com > >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > >> > >> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > >> > Hi: > >> > > >> > Could the crash related to this patch ? > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > >> > > >> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before > >> > cpumask_clear_cpu(line 49). > >> > Could it possible that right after execute line 40 of mmu_context.h, > >> > CPU revice IPI from other CPU to > >> > flush the mm, and when in interrupt, find the TLB state happened to be > >> > TLBSTATE_OK. Which conflicts. > >> > >> Does reverting it help? > >> > >> J > > > >Hi Jeremy: > > > > The lastest test result shows the reverting didn''t help. > > Kernel panic exactly at the same place in tlb.c. > > > > I have question about TLB state, from the stack, > > xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref > > > > What cpu_tlbstate.state should be, could TLBSTATE_OK or TLBSTATE_LAZY all be possible? > > That is after a hypercall from userspace, state will be TLBSTATE_OK, and > > if from kernel space, state will be TLBSTATE_LAZE ? > > > > thanks. > > it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked > before invoking leave_mm(). There''s a window between below lines of code: > > <xen_drop_mm_ref> > /* Get the "official" set of cpus referring to our pagetable. */ > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > for_each_online_cpu(cpu) { > if (!cpumask_test_cpu(cpu, mm_cpumask(mm)) > && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd)) > continue; > smp_call_function_single(cpu, drop_other_mm_ref, mm, 1); > } > return; > } > > there''s chance that when smp_call_function_single is invoked, actual TLB state has been > updated in the other cpu. The upstream kernel patch you referred to earlier just makes > this bug exposed more easily. But even without this patch, you may still suffer such issue > which is why reverting the patch doesn''t help. > > Could you try adding a check in drop_other_mm_ref? > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) > leave_mm(smp_processor_id()); > > once the interrupted context has TLBSTATE_OK, it implicates that later it will handle > the TLB flush and thus no need for leave_mm from interrupt handler, and that''s the > assumption of doing leave_mm. > > Thanks > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I think that should be fine. note a later check: /* If this cpu still has a stale cr3 reference, then make sure it has been flushed. */ if (percpu_read(xen_current_cr3) == __pa(mm->pgd)) load_cr3(swapper_pg_dir); this should ensure the stale TLB being flushed if this cpu is still in lazy mode. Thanks Kevin From: MaoXiaoyun [mailto:tinnycloud@hotmail.com] Sent: Tuesday, April 26, 2011 3:05 PM To: Tian, Kevin; jeremy@goop.org Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 Many thanks, Kevin. I agree on the race window. One thing more, In my understaning, the CPU who send out IPI message, will unpin the pagetable after receive all ACKS from other cpu, if the CPU who received IPI message, enter drop_other_mm_ref, and has TLBSTATE_OK, does nothing, will it possible it possible confronts with stale pagetable (that is unpinned by sender CPU)? So do we need flush tlb when its state is TBLSTATE_OK? if (active_mm == mm){ if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) load_cr3(mm->pgd) else leave_mm(smp_processor_id()); }> From: kevin.tian@intel.com > To: tinnycloud@hotmail.com; jeremy@goop.org > CC: xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com > Date: Tue, 26 Apr 2011 13:52:11 +0800 > Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 > > >From: MaoXiaoyun > >Sent: Monday, April 25, 2011 11:15 AM > >> Date: Fri, 15 Apr 2011 14:22:29 -0700 > >> From: jeremy@goop.org > >> To: tinnycloud@hotmail.com > >> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com > >> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > >> > >> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > >> > Hi: > >> > > >> > Could the crash related to this patch ? > >> > http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > >> > > >> > Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before > >> > cpumask_clear_cpu(line 49). > >> > Could it possible that right after execute line 40 of mmu_context.h, > >> > CPU revice IPI from other CPU to > >> > flush the mm, and when in interrupt, find the TLB state happened to be > >> > TLBSTATE_OK. Which conflicts. > >> > >> Does reverting it help? > >> > >> J > > > >Hi Jeremy: > > > > The lastest test result shows the reverting didn't help. > > Kernel panic exactly at the same place in tlb.c. > > > > I have question about TLB state, from the stack, > > xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref > > > > What cpu_tlbstate.state should be, could TLBSTATE_OK or TLBSTATE_LAZY all be possible? > > That is after a hypercall from userspace, state will be TLBSTATE_OK, and > > if from kernel space, state will be TLBSTATE_LAZE ? > > > > thanks. > > it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked > before invoking leave_mm(). There's a window between below lines of code: > > <xen_drop_mm_ref> > /* Get the "official" set of cpus referring to our pagetable. */ > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > for_each_online_cpu(cpu) { > if (!cpumask_test_cpu(cpu, mm_cpumask(mm)) > && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd)) > continue; > smp_call_function_single(cpu, drop_other_mm_ref, mm, 1); > } > return; > } > > there's chance that when smp_call_function_single is invoked, actual TLB state has been > updated in the other cpu. The upstream kernel patch you referred to earlier just makes > this bug exposed more easily. But even without this patch, you may still suffer such issue > which is why reverting the patch doesn't help. > > Could you try adding a check in drop_other_mm_ref? > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) > leave_mm(smp_processor_id()); > > once the interrupted context has TLBSTATE_OK, it implicates that later it will handle > the TLB flush and thus no need for leave_mm from interrupt handler, and that's the > assumption of doing leave_mm. > > Thanks > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Apr-28 23:29 UTC
Re: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61
On 04/25/2011 10:52 PM, Tian, Kevin wrote:>> From: MaoXiaoyun >> Sent: Monday, April 25, 2011 11:15 AM >>> Date: Fri, 15 Apr 2011 14:22:29 -0700 >>> From: jeremy@goop.org >>> To: tinnycloud@hotmail.com >>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; konrad.wilk@oracle.com >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 >>> >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: >>>> Hi: >>>> >>>> Could the crash related to this patch ? >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 >>>> >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is before >>>> cpumask_clear_cpu(line 49). >>>> Could it possible that right after execute line 40 of mmu_context.h, >>>> CPU revice IPI from other CPU to >>>> flush the mm, and when in interrupt, find the TLB state happened to be >>>> TLBSTATE_OK. Which conflicts. >>> Does reverting it help? >>> >>> J >> >> Hi Jeremy: >> >> The lastest test result shows the reverting didn''t help. >> Kernel panic exactly at the same place in tlb.c. >> >> I have question about TLB state, from the stack, >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... ->drop_other_mm_ref >> >> What cpu_tlbstate.state should be, could TLBSTATE_OK or TLBSTATE_LAZY all be possible? >> That is after a hypercall from userspace, state will be TLBSTATE_OK, and >> if from kernel space, state will be TLBSTATE_LAZE ? >> >> thanks. > it looks a bug in drop_other_mm_ref implementation, that current TLB state should be checked > before invoking leave_mm(). There''s a window between below lines of code: > > <xen_drop_mm_ref> > /* Get the "official" set of cpus referring to our pagetable. */ > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > for_each_online_cpu(cpu) { > if (!cpumask_test_cpu(cpu, mm_cpumask(mm)) > && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd)) > continue; > smp_call_function_single(cpu, drop_other_mm_ref, mm, 1); > } > return; > } > > there''s chance that when smp_call_function_single is invoked, actual TLB state has been > updated in the other cpu. The upstream kernel patch you referred to earlier just makes > this bug exposed more easily. But even without this patch, you may still suffer such issue > which is why reverting the patch doesn''t help. > > Could you try adding a check in drop_other_mm_ref? > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) > leave_mm(smp_processor_id()); > > once the interrupted context has TLBSTATE_OK, it implicates that later it will handle > the TLB flush and thus no need for leave_mm from interrupt handler, and that''s the > assumption of doing leave_mm.That seems reasonable. MaoXiaoyun, does it fix the bug for you? Kevin, could you submit this as a proper patch? Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Friday, April 29, 2011 7:29 AM > > On 04/25/2011 10:52 PM, Tian, Kevin wrote: > >> From: MaoXiaoyun > >> Sent: Monday, April 25, 2011 11:15 AM > >>> Date: Fri, 15 Apr 2011 14:22:29 -0700 > >>> From: jeremy@goop.org > >>> To: tinnycloud@hotmail.com > >>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; > >>> konrad.wilk@oracle.com > >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > >>> > >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > >>>> Hi: > >>>> > >>>> Could the crash related to this patch ? > >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi > >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > >>>> > >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is > >>>> before cpumask_clear_cpu(line 49). > >>>> Could it possible that right after execute line 40 of > >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and > >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK. > >>>> Which conflicts. > >>> Does reverting it help? > >>> > >>> J > >> > >> Hi Jeremy: > >> > >> The lastest test result shows the reverting didn't help. > >> Kernel panic exactly at the same place in tlb.c. > >> > >> I have question about TLB state, from the stack, > >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... > >> ->drop_other_mm_ref > >> > >> What cpu_tlbstate.state should be, could TLBSTATE_OK or > TLBSTATE_LAZY all be possible? > >> That is after a hypercall from userspace, state will be TLBSTATE_OK, > and > >> if from kernel space, state will be TLBSTATE_LAZE ? > >> > >> thanks. > > it looks a bug in drop_other_mm_ref implementation, that current TLB > > state should be checked before invoking leave_mm(). There's a window > between below lines of code: > > > > <xen_drop_mm_ref> > > /* Get the "official" set of cpus referring to our pagetable. */ > > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > > for_each_online_cpu(cpu) { > > if (!cpumask_test_cpu(cpu, > mm_cpumask(mm)) > > && per_cpu(xen_current_cr3, cpu) !> __pa(mm->pgd)) > > continue; > > smp_call_function_single(cpu, > drop_other_mm_ref, mm, 1); > > } > > return; > > } > > > > there's chance that when smp_call_function_single is invoked, actual > > TLB state has been updated in the other cpu. The upstream kernel patch > > you referred to earlier just makes this bug exposed more easily. But > > even without this patch, you may still suffer such issue which is why reverting > the patch doesn't help. > > > > Could you try adding a check in drop_other_mm_ref? > > > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) !> TLBSTATE_OK) > > leave_mm(smp_processor_id()); > > > > once the interrupted context has TLBSTATE_OK, it implicates that later > > it will handle the TLB flush and thus no need for leave_mm from > > interrupt handler, and that's the assumption of doing leave_mm. > > That seems reasonable. MaoXiaoyun, does it fix the bug for you? > > Kevin, could you submit this as a proper patch? >I'm waiting for Xiaoyun's test result before submitting a proper patch, since this part of logic is tricky and his test can make sure we don't overlook some corner cases. :-) Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> From: kevin.tian@intel.com > To: jeremy@goop.org > CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com; giamteckchoon@gmail.com; konrad.wilk@oracle.com > Date: Fri, 29 Apr 2011 08:19:44 +0800 > Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 > > > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > > Sent: Friday, April 29, 2011 7:29 AM > > > > On 04/25/2011 10:52 PM, Tian, Kevin wrote: > > >> From: MaoXiaoyun > > >> Sent: Monday, April 25, 2011 11:15 AM > > >>> Date: Fri, 15 Apr 2011 14:22:29 -0700 > > >>> From: jeremy@goop.org > > >>> To: tinnycloud@hotmail.com > > >>> CC: giamteckchoon@gmail.com; xen-devel@lists.xensource.com; > > >>> konrad.wilk@oracle.com > > >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > >>> > > >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > > >>>> Hi: > > >>>> > > >>>> Could the crash related to this patch ? > > >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi > > >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > > >>>> > > >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is > > >>>> before cpumask_clear_cpu(line 49). > > >>>> Could it possible that right after execute line 40 of > > >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and > > >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK. > > >>>> Which conflicts. > > >>> Does reverting it help? > > >>> > > >>> J > > >> > > >> Hi Jeremy: > > >> > > >> The lastest test result shows the reverting didn''t help. > > >> Kernel panic exactly at the same place in tlb.c. > > >> > > >> I have question about TLB state, from the stack, > > >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... > > >> ->drop_other_mm_ref > > >> > > >> What cpu_tlbstate.state should be, could TLBSTATE_OK or > > TLBSTATE_LAZY all be possible? > > >> That is after a hypercall from userspace, state will be TLBSTATE_OK, > > and > > >> if from kernel space, state will be TLBSTATE_LAZE ? > > >> > > >> thanks. > > > it looks a bug in drop_other_mm_ref implementation, that current TLB > > > state should be checked before invoking leave_mm(). There''s a window > > between below lines of code: > > > > > > <xen_drop_mm_ref> > > > /* Get the "official" set of cpus referring to our pagetable. */ > > > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > > > for_each_online_cpu(cpu) { > > > if (!cpumask_test_cpu(cpu, > > mm_cpumask(mm)) > > > && per_cpu(xen_current_cr3, cpu) !> > __pa(mm->pgd)) > > > continue; > > > smp_call_function_single(cpu, > > drop_other_mm_ref, mm, 1); > > > } > > > return; > > > } > > > > > > there''s chance that when smp_call_function_single is invoked, actual > > > TLB state has been updated in the other cpu. The upstream kernel patch > > > you referred to earlier just makes this bug exposed more easily. But > > > even without this patch, you may still suffer such issue which is why reverting > > the patch doesn''t help. > > > > > > Could you try adding a check in drop_other_mm_ref? > > > > > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) !> > TLBSTATE_OK) > > > leave_mm(smp_processor_id()); > > > > > > once the interrupted context has TLBSTATE_OK, it implicates that later > > > it will handle the TLB flush and thus no need for leave_mm from > > > interrupt handler, and that''s the assumption of doing leave_mm. > > > > That seems reasonable. MaoXiaoyun, does it fix the bug for you? > > > > Kevin, could you submit this as a proper patch? > > > > I''m waiting for Xiaoyun''s test result before submitting a proper patch, since this > part of logic is tricky and his test can make sure we don''t overlook some corner > cases. :-) >I think it works. The test has been running over 70 hours successfully. My plan is run one week. Thanks.> Thanks > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
OK, thanks for the update. I’ll send out the patch then Thanks Kevin From: MaoXiaoyun [mailto:tinnycloud@hotmail.com] Sent: Friday, April 29, 2011 9:51 AM To: Tian, Kevin; jeremy@goop.org Cc: xen devel; giamteckchoon@gmail.com; konrad.wilk@oracle.com Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61> From: kevin.tian@intel.com<mailto:kevin.tian@intel.com> > To: jeremy@goop.org<mailto:jeremy@goop.org> > CC: tinnycloud@hotmail.com<mailto:tinnycloud@hotmail.com>; xen-devel@lists.xensource.com<mailto:xen-devel@lists.xensource.com>; giamteckchoon@gmail.com<mailto:giamteckchoon@gmail.com>; konrad.wilk@oracle.com<mailto:konrad.wilk@oracle.com> > Date: Fri, 29 Apr 2011 08:19:44 +0800 > Subject: RE: [Xen-devel] RE: Kernel BUG at arch/x86/mm/tlb.c:61 > > > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]<mailto:[mailto:jeremy@goop.org]> > > Sent: Friday, April 29, 2011 7:29 AM > > > > On 04/25/2011 10:52 PM, Tian, Kevin wrote: > > >> From: MaoXiaoyun > > >> Sent: Monday, April 25, 2011 11:15 AM > > >>> Date: Fri, 15 Apr 2011 14:22:29 -0700 > > >>> From: jeremy@goop.org<mailto:jeremy@goop.org> > > >>> To: tinnycloud@hotmail.com<mailto:tinnycloud@hotmail.com> > > >>> CC: giamteckchoon@gmail.com<mailto:giamteckchoon@gmail.com>; xen-devel@lists.xensource.com<mailto:xen-devel@lists.xensource.com>; > > >>> konrad.wilk@oracle.com<mailto:konrad.wilk@oracle.com> > > >>> Subject: Re: Kernel BUG at arch/x86/mm/tlb.c:61 > > >>> > > >>> On 04/15/2011 05:23 AM, MaoXiaoyun wrote: > > >>>> Hi: > > >>>> > > >>>> Could the crash related to this patch ? > > >>>> http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdi > > >>>> ff;h=45bfd7bfc6cf32f8e60bb91b32349f0b5090eea3 > > >>>> > > >>>> Since now TLB state change to TLBSTATE_OK(mmu_context.h:40) is > > >>>> before cpumask_clear_cpu(line 49). > > >>>> Could it possible that right after execute line 40 of > > >>>> mmu_context.h, CPU revice IPI from other CPU to flush the mm, and > > >>>> when in interrupt, find the TLB state happened to be TLBSTATE_OK. > > >>>> Which conflicts. > > >>> Does reverting it help? > > >>> > > >>> J > > >> > > >> Hi Jeremy: > > >> > > >> The lastest test result shows the reverting didn't help. > > >> Kernel panic exactly at the same place in tlb.c. > > >> > > >> I have question about TLB state, from the stack, > > >> xen_do_hypervisor_callback-> xen_evtchn_do_upcall->... > > >> ->drop_other_mm_ref > > >> > > >> What cpu_tlbstate.state should be, could TLBSTATE_OK or > > TLBSTATE_LAZY all be possible? > > >> That is after a hypercall from userspace, state will be TLBSTATE_OK, > > and > > >> if from kernel space, state will be TLBSTATE_LAZE ? > > >> > > >> thanks. > > > it looks a bug in drop_other_mm_ref implementation, that current TLB > > > state should be checked before invoking leave_mm(). There's a window > > between below lines of code: > > > > > > <xen_drop_mm_ref> > > > /* Get the "official" set of cpus referring to our pagetable. */ > > > if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) { > > > for_each_online_cpu(cpu) { > > > if (!cpumask_test_cpu(cpu, > > mm_cpumask(mm)) > > > && per_cpu(xen_current_cr3, cpu) !> > __pa(mm->pgd)) > > > continue; > > > smp_call_function_single(cpu, > > drop_other_mm_ref, mm, 1); > > > } > > > return; > > > } > > > > > > there's chance that when smp_call_function_single is invoked, actual > > > TLB state has been updated in the other cpu. The upstream kernel patch > > > you referred to earlier just makes this bug exposed more easily. But > > > even without this patch, you may still suffer such issue which is why reverting > > the patch doesn't help. > > > > > > Could you try adding a check in drop_other_mm_ref? > > > > > > if (active_mm == mm && percpu_read(cpu_tlbstate.state) !> > TLBSTATE_OK) > > > leave_mm(smp_processor_id()); > > > > > > once the interrupted context has TLBSTATE_OK, it implicates that later > > > it will handle the TLB flush and thus no need for leave_mm from > > > interrupt handler, and that's the assumption of doing leave_mm. > > > > That seems reasonable. MaoXiaoyun, does it fix the bug for you? > > > > Kevin, could you submit this as a proper patch? > > > > I'm waiting for Xiaoyun's test result before submitting a proper patch, since this > part of logic is tricky and his test can make sure we don't overlook some corner > cases. :-) >I think it works. The test has been running over 70 hours successfully. My plan is run one week. Thanks.> Thanks > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel