Jui-Hao Chiang
2011-Jan-21 16:19 UTC
[Xen-devel] mem_sharing: summarized problems when domain is dying
Hi, Tim:>From tinnycloud''s result, here I summarize the current problem andfindings of mem_sharing due to domain dying. (1) When domain is dying, alloc_domheap_page() and set_shared_p2m_entry() would just fail. So the shr_lock is not enough to ensure that the domain won''t die in the middle of mem_sharing code. As tinnycloud''s code shows, is that better to use rcu_lock_domain_by_id before calling the above two functions? (2) What''s the proper behavior of nominate/share/unshare when domain is dying? The following is just my current guess. Please give comments as well. (2.1) nominate: return fail; but needs to check blktap2''s code to make sure it understand and act properly (should be minor issue now) (2.2) share: return success but skip the gfns of dying domain, i.e., we don''t remove them from the hash list, and don''t update their p2m entry (set_shared_p2m_entry). We believe that the p2m_teardown will clean up them later. (2.3) unshare: it''s the most problematic part. Because we are not able to alloc_domheap_page at this moment, the only thing we can do is simply skip the page and return. But what''s the side effect? (a) If p2m_teardown comes in, there is no problem. Just destroy it and done. (b) hap_nested_page_fault: if we return fail, will this cause problem to guest? or we can simply return success to cheat the guest. But later the guest will trigger another page fault if it write the page again. (c) gnttab_map_grant_ref: this function specify must_succeed to gfn_to_mfn_unshare(), which would BUG if unshare() fails. Do we really need (b) and (c) in the last steps of domain dying? If that''s the case, we need to have a special alloc_domheap_page for dying domain. On Thu, Jan 20, 2011 at 4:19 AM, Tim Deegan <Tim.Deegan@citrix.com> wrote:> At 07:19 +0000 on 20 Jan (1295507976), MaoXiaoyun wrote: >> Hi: >> >> The latest BUG in mem_sharing_alloc_page from mem_sharing_unshare_page. >> I printed heap info, which shows plenty memory left. >> Could domain be NULL during in unshare, or should it be locked by rcu_lock_domain_by_id ? >> > > ''d'' probably isn''t NULL; more likely is that the domain is not allowed > to have any more memory. You should look at the values of d->max_pages > and d->tot_pages when the failure happens. > > Cheers. > > Tim. >Bests, Jui-Hao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2011-Jan-21 16:29 UTC
Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
On Fri, Jan 21, 2011 at 4:19 PM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:> (b) hap_nested_page_fault: if we return fail, will this cause problem > to guest? or we can simply return success to cheat the guest. But > later the guest will trigger another page fault if it write the page > again. > (c) gnttab_map_grant_ref: this function specify must_succeed to > gfn_to_mfn_unshare(), which would BUG if unshare() fails.I took a glance around the code this morning, but it seems like: (b) should never happen. If a domain is dying, all of its vcpus should be offline. If I''m wrong and there''s a race between d->is_dying set and the vcpus being paused, then the vcpus should just be paused if they get an un-handleable page fault. (c) happens because backend drivers may still be servicing requests (finishing disk I/O, incoming network packets) before being torn down. It should be OK for those to fail if the domain is dying. I''m not sure the exact rationale behind the "cannot fail" flag; but it looks like in grant_table.c, both callers of gfn_to_mfn_unshare() handle the case where the returned p2m entry is just _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2011-Jan-21 16:32 UTC
Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
[sorry, accidentally sent too early] On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote:> I''m not sure the exact rationale behind the "cannot fail" flag; but it > looks like in grant_table.c, both callers of gfn_to_mfn_unshare() > handle the case where the returned p2m entry is just...invalid. I wonder if "unsharing" the page, but marking the entry invalid during death would help. I suppose the problem there is that if you''re keeping the VM around but paused for analysis, you''ll have holes in your address space. But just returning an invalid entry to the callers who try to unshare pages might work. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2011-Jan-21 16:41 UTC
Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
Tim / Xiaoyun, do you think something like this might work? -George On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote:> [sorry, accidentally sent too early] > > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote: >> I''m not sure the exact rationale behind the "cannot fail" flag; but it >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare() >> handle the case where the returned p2m entry is just > > ...invalid. I wonder if "unsharing" the page, but marking the entry > invalid during death would help. > > I suppose the problem there is that if you''re keeping the VM around > but paused for analysis, you''ll have holes in your address space. But > just returning an invalid entry to the callers who try to unshare > pages might work. > > -George >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Jan-21 16:53 UTC
Re: [Xen-devel] mem_sharing: summarized problems when domain is dying
At 16:41 +0000 on 21 Jan (1295628107), George Dunlap wrote:> Tim / Xiaoyun, do you think something like this might work?Worth a try. I don''t think it will do much harm -- there should be no cases where dom0 really must map a dying domain''s memory. Tim.> On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote: > > [sorry, accidentally sent too early] > > > > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote: > >> I''m not sure the exact rationale behind the "cannot fail" flag; but it > >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare() > >> handle the case where the returned p2m entry is just > > > > ...invalid. I wonder if "unsharing" the page, but marking the entry > > invalid during death would help. > > > > I suppose the problem there is that if you''re keeping the VM around > > but paused for analysis, you''ll have holes in your address space. But > > just returning an invalid entry to the callers who try to unshare > > pages might work. > > > > -George > >> diff -r 9ca9331c9780 xen/include/asm-x86/p2m.h > --- a/xen/include/asm-x86/p2m.h Fri Jan 21 15:37:36 2011 +0000 > +++ b/xen/include/asm-x86/p2m.h Fri Jan 21 16:41:58 2011 +0000 > @@ -390,7 +390,14 @@ > must_succeed > ? MEM_SHARING_MUST_SUCCEED : 0) ) > { > - BUG_ON(must_succeed); > + if ( must_succeed > + && p2m->domain->is_dying ) > + { > + mfn = INVALID_MFN; > + *p2mt=p2m_invalid; > + } > + else > + BUG_ON(must_succeed); > return mfn; > } > mfn = gfn_to_mfn(p2m, gfn, p2mt);-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jui-Hao Chiang
2011-Jan-21 19:45 UTC
[Xen-devel] Re: mem_sharing: summarized problems when domain is dying
Hi On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote:> Hi, Tim: > > From tinnycloud''s result, here I summarize the current problem and > findings of mem_sharing due to domain dying. > (1) When domain is dying, alloc_domheap_page() and > set_shared_p2m_entry() would just fail. So the shr_lock is not enough > to ensure that the domain won''t die in the middle of mem_sharing code. > As tinnycloud''s code shows, is that better to use > rcu_lock_domain_by_id before calling the above two functions? >There seems no good locking to protect a domain from changing the is_dying state. So the unshare function could fail in the middle in several points, e.g., alloc_domheap_page and set_shared_p2m_entry. If that''s the case, we need to add some checking, and probably revert the things we have done when is_dying is changed in the middle. Any comments? Jui-Hao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
MaoXiaoyun
2011-Jan-22 11:17 UTC
RE: [Xen-devel] mem_sharing: summarized problems when domain is dying
Hi George: Appreciate for your kindly help. I think the page type should be changed inside mem_sharing_unshare_page() in shr_lock too ,to prevent someone might unshare the page again. So your patch and mine makes the whole solution. As for my patch, it seems that use put_page_and_type(page); to clean the page is enough, and don''t need to BUG_ON(set_shared_p2m_entry_invalid(d, gfn)==0); ( which actually calls set_p2m_entry(d, gfn, _mfn(INVALID_MFN), 0, p2m_invalid) ), right? One another thing is rcu_lock_domain_by_id(d->domain_id); When someone hold this lock, d->is_dying = 0, does this mean d->is_dying will not be changed untill it call rcu_unlock_domain? That is to say, the lock actually protects whole d structure?> Date: Fri, 21 Jan 2011 16:41:47 +0000 > Subject: Re: [Xen-devel] mem_sharing: summarized problems when domain is dying > From: George.Dunlap@eu.citrix.com > To: juihaochiang@gmail.com > CC: Tim.Deegan@citrix.com; tinnycloud@hotmail.com; xen-devel@lists.xensource.com > > Tim / Xiaoyun, do you think something like this might work? > > -George > > On Fri, Jan 21, 2011 at 4:32 PM, George Dunlap <dunlapg@umich.edu> wrote: > > [sorry, accidentally sent too early] > > > > On Fri, Jan 21, 2011 at 4:29 PM, George Dunlap <dunlapg@umich.edu> wrote: > >> I''m not sure the exact rationale behind the "cannot fail" flag; but it > >> looks like in grant_table.c, both callers of gfn_to_mfn_unshare() > >> handle the case where the returned p2m entry is just > > > > ...invalid. I wonder if "unsharing" the page, but marking the entry > > invalid during death would help. > > > > I suppose the problem there is that if you''re keeping the VM around > > but paused for analysis, you''ll have holes in your address space. But > > just returning an invalid entry to the callers who try to unshare > > pages might work. > > > > -George > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
MaoXiaoyun
2011-Jan-24 13:14 UTC
[Xen-devel] RE: mem_sharing: summarized problems when domain is dying
Hi: Another BUG found when testing memory sharing. In this test, I start 24 linux HVMS, each of them reboot through "xm reboot" every 30minutes. After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting. The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref() return failure. And no special log found. I was able to dump the crash stack. what could happen? thanks. PID: 2307 TASK: ffff810014166100 CPU: 0 COMMAND: "setfont" #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28 #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa #2 [ffff8100123cd940] panic at ffffffff8009094a #3 [ffff8100123cda30] oops_end at ffffffff80064fca #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0 #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9 [exception RIP: vgacon_do_font_op+363] RIP: ffffffff800515e5 RSP: ffff8100123cdbe8 RFLAGS: 00010203 RAX: 0000000000000000 RBX: ffffffff804b3740 RCX: ffff8100000a03fc RDX: 00000000000003fd RSI: ffff810011cec000 RDI: ffffffff803244c4 RBP: ffff810011cec000 R8: d0d6999996000000 R9: 0000009090b0b0ff R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5 #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b #8 [ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4 #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c #10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9 #11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce #12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766 #13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call) RIP: 00000039294cc557 RSP: 00007fff54c4aec8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 00007fff54c4aee0 RSI: 0000000000004b72 RDI: 0000000000000003 RBP: 000000001d747ab0 R8: 0000000000000010 R9: 0000000000800000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010 R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b> Date: Fri, 21 Jan 2011 14:45:14 -0500 > Subject: Re: mem_sharing: summarized problems when domain is dying > From: juihaochiang@gmail.com > To: Tim.Deegan@citrix.com > CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com > > Hi > > On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote: > > Hi, Tim: > > > > From tinnycloud''s result, here I summarize the current problem and > > findings of mem_sharing due to domain dying. > > (1) When domain is dying, alloc_domheap_page() and > > set_shared_p2m_entry() would just fail. So the shr_lock is not enough > > to ensure that the domain won''t die in the middle of mem_sharing code. > > As tinnycloud''s code shows, is that better to use > > rcu_lock_domain_by_id before calling the above two functions? > > > > There seems no good locking to protect a domain from changing the > is_dying state. So the unshare function could fail in the middle in > several points, e.g., alloc_domheap_page and set_shared_p2m_entry. > If that''s the case, we need to add some checking, and probably revert > the things we have done when is_dying is changed in the middle. > > Any comments? > > Jui-Hao_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Jan-24 14:02 UTC
[Xen-devel] Re: mem_sharing: summarized problems when domain is dying
At 19:45 +0000 on 21 Jan (1295639114), Jui-Hao Chiang wrote:> Hi > > On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> wrote: > > Hi, Tim: > > > > From tinnycloud''s result, here I summarize the current problem and > > findings of mem_sharing due to domain dying. > > (1) When domain is dying, alloc_domheap_page() and > > set_shared_p2m_entry() would just fail. So the shr_lock is not enough > > to ensure that the domain won''t die in the middle of mem_sharing code. > > As tinnycloud''s code shows, is that better to use > > rcu_lock_domain_by_id before calling the above two functions? > > > > There seems no good locking to protect a domain from changing the > is_dying state. So the unshare function could fail in the middle in > several points, e.g., alloc_domheap_page and set_shared_p2m_entry. > If that''s the case, we need to add some checking, and probably revert > the things we have done when is_dying is changed in the middle.That sounds correct. It would be a good idea to handle failures from those functions anyway! Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2011-Jan-24 14:08 UTC
Re: [Xen-devel] RE: mem_sharing: summarized problems when domain is dying
I think it would be best if every separate issue you''re facing is a separate thread. This looks like a Linux crash -- please include the kernel version you''re using, and whatever other information might be appropriate. -George 2011/1/24 MaoXiaoyun <tinnycloud@hotmail.com>:> Hi: > > Another BUG found when testing memory sharing. > In this test, I start 24 linux HVMS, each of them reboot through "xm > reboot" every 30minutes. > After several hours, some of the HVM will crash. All of the crash HVM > are stopped during booting. > The bug still exists even I forbid page sharing by cheating tapdisk > that xc_memshr_nominate_gref() > return failure. > > And no special log found. > > I was able to dump the crash stack. > what could happen? > thanks. > > PID: 2307 TASK: ffff810014166100 CPU: 0 COMMAND: "setfont" > #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28 > #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa > #2 [ffff8100123cd940] panic at ffffffff8009094a > #3 [ffff8100123cda30] oops_end at ffffffff80064fca > #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0 > #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9 > [exception RIP: vgacon_do_font_op+363] > RIP: ffffffff800515e5 RSP: ffff8100123cdbe 8 RFLAGS: 00010203 > RAX: 0000000000000000 RBX: ffffffff804b3740 RCX: ffff8100000a03fc > RDX: 00000000000003fd RSI: ffff810011cec000 RDI: ffffffff803244c4 > RBP: ffff810011cec000 R8: d0d6999996000000 R9: 0000009090b0b0ff > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 > R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5 > #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b > #8  ;[ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4 > #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c > #10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9 > #11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce > #12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766 > #13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call) > RIP: 00000039294cc557 RSP: 00007fff54c4aec8 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff > RDX: 00007fff54c4aee0 RSI: 0000000000004b72 RDI: 0000000000000003 > RBP: 000000001d747ab0 R8: 0000000000000010 R9: 0000000 000800000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010 > R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008 > ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b > >> Date: Fri, 21 Jan 2011 14:45:14 -0500 >> Subject: Re: mem_sharing: summarized problems when domain is dying >> From: juihaochiang@gmail.com >> To: Tim.Deegan@citrix.com >> CC: tinnycloud@hotmail.com; xen-devel@lists.xensource.com >> >> Hi >> >> On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@gmail.com> >> wrote: >> > Hi, Tim: >> > >> > From tinnycloud''s result, here I summarize the current problem and >> > findings of mem_sharing due to domain dying. >> > (1) When domain is dying, alloc_domheap_page() and >> > set_shared_p2m_entry() would just fail. So the shr_lock is not enough >> > to ensure that the domain won''t die in the middle of mem_sharing code. >> > As tinnycloud''s code shows, is that better to use >> > rcu_lock_domain_by_id before calling the above two functions? >> > >> >> There seems no good locking to protect a domain from changing the >> is_dying state. So the unshare function could fail in the middle in >> several points, e.g., alloc_domheap_page and set_shared_p2m_entry. >> If that''s the case, we need to add some checking, and probably revert >> the things we have done when is_dying is changed in the middle. >> >> Any comments? >> >> Jui-Hao > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
MaoXiaoyun
2011-Jan-25 04:13 UTC
[Xen-devel] Linux Guest Crash on stress test of memory sharing
Hi: Follow George''s suggestion to summit the bug in this new thread. Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes. After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting. The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref() return failure. No bug if memory sharing is disabled. (This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page() page type is set to p2m_shared, so later it needs to be unshared when someone try to use it) I remember there is a call routine in memory sharing, hvm_hap_nested_page_fault()->mem_sharing_unshare_page() compare to the crash dump, it might indicates some connections. DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm Xen version: 4.0.0 crash dump stack : crash> bt -l PID: 2422 TASK: ffff810013b40860 CPU: 1 COMMAND: "setfont" #0 [ffff810012cef900] xen_panic_event at ffffffff88001d28 #1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146 #2 [ffff810012cef940] panic at ffffffff8009094a /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101 #3 [ffff810012cefa30] oops_end at ffffffff80064fca /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539 #4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591 #5 [ffff810012cefb30] error_exit at ffffffff8005dde9 [exception RIP: vgacon_do_font_op+435] RIP: ffffffff8005162d RSP: ffff810012cefbe8 RFLAGS: 00010287 RAX: ffff8100000a6000 RBX: ffffffff804b3740 RCX: ffff8100000a4ae0 RDX: ffff810012d16ae1 RSI: ffff810012d14000 RDI: ffffffff803244c4 RBP: ffff810012d14000 R8: d0d6999996000000 R9: 0000009090b0b0ff R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238 #7 [ffff810012cefc60] con_font_op at ffffffff801aa86b /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645 #8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965 #9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340 #10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39 #11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500 #12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520 #13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call) RIP: 00000039294cc557 RSP: 00007fff1a57ed98 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 00007fff1a57edb0 RSI: 0000000000004b72 RDI: 0000000000000003 RBP: 000000001e33dab0 R8: 0000000000000010 R9: 0000000000800000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010 R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
MaoXiaoyun
2011-Jan-25 06:23 UTC
[Xen-devel] RE: Linux Guest Crash on stress test of memory sharing
Hi: Most the core dump has the same stack as submitted before, now we have another stack thanks. crash> bt -l PID: 1 TASK: ffff8100011df7a0 CPU: 0 COMMAND: "init" #0 [ffff8100011fddf0] xen_panic_event at ffffffff88001d28 #1 [ffff8100011fde10] notifier_call_chain at ffffffff80066eaa /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146 #2 [ffff8100011fde30] panic at ffffffff8009094a /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101 #3 [ffff8100011fdf20] do_exit at ffffffff80015477 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/exit.c: 835 #4 [ffff8100011fdf80] system_call at ffffffff8005d116 /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/entry.S RIP: 000000000055a5ff RSP: 00007fff2b8c2e10 RFLAGS: 00010246 RAX: 00000000000000e7 RBX: ffffffff8005d116 RCX: 0000000000000047 RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001 RBP: 0000000000000000 R8: 00000000000000e7 R9: ffffffffffffffb4 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000604ea8 R14: ffffffff80049281 R15: 0000000000000000 ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b crash>>From: tinnycloud@hotmail.com >To: tinnycloud@hotmail.com >Subject: Linux Guest Crash on stress test of memory sharing >Date: Tue, 25 Jan 2011 13:07:15 +0800 > >Hi: > > Follow George''s suggestion to summit the bug in this new thread. > > Start 24 linux HVMS on a physical host, each of them reboot through "xm reboot" every 30minutes. > After several hours, some of the HVM will crash. > > All of the crash HVM are stopped during booting. > The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref() > return failure. No bug if memory sharing is disabled. > (This means only mem_sharing_nominate_page() are called, and in mem_sharing_nominate_page() > page type is set to p2m_shared, so later it needs to be unshared when someone try to use it) > > I remember there is a call routine in memory sharing, > hvm_hap_nested_page_fault()->mem_sharing_unshare_page() > compare to the crash dump, it might indicates some connections. > >DomU kernel is from ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-164.el5.src.rpm >Xen version: 4.0.0 > >crash dump stack : > >crash> bt -l >PID: 2422 TASK: ffff810013b40860 CPU: 1 COMMAND: "setfont" > #0 [ffff810012cef900] xen_panic_event at ffffffff88001d28 > #1 [ffff810012cef920] notifier_call_chain at ffffffff80066eaa > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/sys.c: 146 > #2 [ffff810012cef940] panic at ffffffff8009094a > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/kernel/panic.c: 101 > #3 [ffff810012cefa30] oops_end at ffffffff80064fca > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/kernel/traps.c: 539 > #4 [ffff810012cefa40] do_page_fault at ffffffff80066dc0 > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 591 > #5 [ffff810012cefb30] error_exit at ffffffff8005dde9 > [exception RIP: vgacon_do_font_op+435] > RIP: ffffffff8005162d RSP: ffff810012cefbe8 RFLAGS: 00010287 > RAX: ffff8100000a6000 RBX: ffffffff804b3740 RCX: ffff8100000a4ae0 > RDX: ffff810012d16ae1 RSI: ffff810012d14000 RDI: ffffffff803244c4 > RBP: ffff810012d14000 R8: d0d6999996000000 R9: 0000009090b0b0ff > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 > R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff810012cefc20] vgacon_font_set at ffffffff8016bec5 > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/video/console/vgacon.c: 1238 > #7 [ffff810012cefc60] con_font_op at ffffffff801aa86b > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt.c: 3645 > #8 [ffff810012cefcd0] vt_ioctl at ffffffff801a5af4 > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/vt_ioctl.c: 965 > #9 [ffff810012cefd70] tty_ioctl at ffffffff80038a2c > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/drivers/char/tty_io.c: 3340 >#10 [ffff810012cefeb0] do_ioctl at ffffffff800420d9 > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 39 >#11 [ffff810012cefed0] vfs_ioctl at ffffffff800302ce > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 500 >#12 [ffff810012ceff40] sys_ioctl at ffffffff8004c766 > /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/ioctl.c: 520 >#13 [ffff810012ceff80] tracesys at ffffffff8005d28d (via system_call) > RIP: 00000039294cc557 RSP: 00007fff1a57ed98 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff > RDX: 00007fff1a57edb0 RSI: 0000000000004b72 RDI: 0000000000000003 > RBP: 000000001e33dab0 R8: 0000000000000010 R9: 0000000000800000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010 > R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008 > ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel