James Harper
2009-Jan-01 13:03 UTC
[Xen-devel] domains not shutting down properly - the problem is back again
This was discussed at length (by me :) a while back and I thought I''d resolved it by removing some Debian packages from previous versions of Xen that were still hanging around, but suddenly the problem is back again... Domains don''t die, they just stay in the ''s'' state until you ''xm destroy'' them, and even after that there is still a page or two being used according to ''xm debug q''. I have upgraded to 3.3.1-rc4 but it doesn''t seem to make a difference... james _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-01 13:09 UTC
RE: [Xen-devel] domains not shutting down properly - the problem isback again
> > This was discussed at length (by me :) a while back and I thought I''d > resolved it by removing some Debian packages from previous versions of > Xen that were still hanging around, but suddenly the problem is back > again... > > Domains don''t die, they just stay in the ''s'' state until you ''xm > destroy'' them, and even after that there is still a page or two being > used according to ''xm debug q''. > > I have upgraded to 3.3.1-rc4 but it doesn''t seem to make adifference...>Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom connection'' after starting and then destroying a new domain. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-01 14:08 UTC
Re: [Xen-devel] domains not shutting down properly - the problem isback again
On 01/01/2009 13:09, "James Harper" <james.harper@bendigoit.com.au> wrote:>> Domains don''t die, they just stay in the ''s'' state until you ''xm >> destroy'' them, and even after that there is still a page or two being >> used according to ''xm debug q''. >> >> I have upgraded to 3.3.1-rc4 but it doesn''t seem to make a >> difference... > > Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom > connection'' after starting and then destroying a new domain.Backend driver not cleaning up due to xend not correctly deleting a directory from xenstore, or because of hotplug/udev script problems? There''s some interaction going on with your dom0 installation, since no one else has seen or reported this issue. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Venefax
2009-Jan-01 14:34 UTC
RE: [Xen-devel] domains not shutting down properly - the problem isback again
I have the same issue. And Novell technical services is working on a fix. The memory assigned to a domain is not returned to the system after is killed with "destroy". After several dozen "destroy" and "start" I have to reboot the host. An my host has 128 GB of ram. Federico -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser Sent: Thursday, January 01, 2009 9:09 AM To: James Harper; xen-devel@lists.xensource.com Subject: Re: [Xen-devel] domains not shutting down properly - the problem isback again On 01/01/2009 13:09, "James Harper" <james.harper@bendigoit.com.au> wrote:>> Domains don''t die, they just stay in the ''s'' state until you ''xm >> destroy'' them, and even after that there is still a page or two being >> used according to ''xm debug q''. >> >> I have upgraded to 3.3.1-rc4 but it doesn''t seem to make a >> difference... > > Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom > connection'' after starting and then destroying a new domain.Backend driver not cleaning up due to xend not correctly deleting a directory from xenstore, or because of hotplug/udev script problems? There''s some interaction going on with your dom0 installation, since no one else has seen or reported this issue. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-01 14:41 UTC
Re: [Xen-devel] domains not shutting down properly - the problem isback again
Not releasing any guest memory would possibly be a different bug. James''s guests are hanging around with just a few kilobytes of memory remaining. K. On 01/01/2009 14:34, "Venefax" <venefax@gmail.com> wrote:> I have the same issue. And Novell technical services is working on a fix. > The memory assigned to a domain is not returned to the system after is > killed with "destroy". After several dozen "destroy" and "start" I have to > reboot the host. An my host has 128 GB of ram. > Federico > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser > Sent: Thursday, January 01, 2009 9:09 AM > To: James Harper; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] domains not shutting down properly - the problem > isback again > > On 01/01/2009 13:09, "James Harper" <james.harper@bendigoit.com.au> wrote: > >>> Domains don''t die, they just stay in the ''s'' state until you ''xm >>> destroy'' them, and even after that there is still a page or two being >>> used according to ''xm debug q''. >>> >>> I have upgraded to 3.3.1-rc4 but it doesn''t seem to make a >>> difference... >> >> Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom >> connection'' after starting and then destroying a new domain. > > Backend driver not cleaning up due to xend not correctly deleting a > directory from xenstore, or because of hotplug/udev script problems? There''s > some interaction going on with your dom0 installation, since no one else has > seen or reported this issue. > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-01 23:52 UTC
RE: [Xen-devel] domains not shutting down properly - the problem isback again
> On 01/01/2009 13:09, "James Harper" <james.harper@bendigoit.com.au>wrote:> > >> Domains don''t die, they just stay in the ''s'' state until you ''xm > >> destroy'' them, and even after that there is still a page or twobeing> >> used according to ''xm debug q''. > >> > >> I have upgraded to 3.3.1-rc4 but it doesn''t seem to make a > >> difference... > > > > Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom > > connection'' after starting and then destroying a new domain. > > Backend driver not cleaning up due to xend not correctly deleting a > directory from xenstore, or because of hotplug/udev script problems? > There''s > some interaction going on with your dom0 installation, since no oneelse> has > seen or reported this issue.When I start the domain, ''udevadm monitor'' says: UDEV [1230853478.622189] add /devices/xen-backend/vbd-3-769 (xen-backend) UDEV [1230853478.684237] add /devices/xen-backend/vbd-3-770 (xen-backend) UDEV [1230853478.830864] add /devices/xen-backend/vif-3-0 (xen-backend) UEVENT[1230853478.835024] add /class/net/vif3.0 (net) UEVENT[1230853478.837346] online /devices/xen-backend/vif-3-0 (xen-backend) UDEV [1230853478.838478] online /devices/xen-backend/vif-3-0 (xen-backend) UDEV [1230853478.871352] add /class/net/vif3.0 (net) UEVENT[1230853479.026088] add /devices/xen-backend/console-3-0 (xen-backend) UDEV [1230853479.027654] add /devices/xen-backend/console-3-0 (xen-backend) Then when I issue the shutdown: UDEV [1230853614.593201] offline /devices/xen-backend/vif-3-0 (xen-backend) UDEV [1230853614.593239] remove /class/net/vif3.0 (net) And because the domain is now stuck in the ''s'' state, I have to issue the ''destroy'': UDEV [1230853668.181925] remove /devices/xen-backend/console-3-0 (xen-backend) UDEV [1230853668.219248] remove /devices/xen-backend/vbd-3-769 (xen-backend) UDEV [1230853668.230573] remove /devices/xen-backend/vbd-3-770 (xen-backend) UDEV [1230853668.238480] remove /devices/xen-backend/vif-3-0 (xen-backend) On another machine that does not have this problem, a similar domain (different version of udev) does this on start: UEVENT[1230853752.615378] add@/devices/xen-backend/vbd-10-769 UDEV [1230853752.615378] add@/devices/xen-backend/vbd-10-769 UEVENT[1230853752.653987] add@/devices/xen-backend/vbd-10-770 UDEV [1230853752.653987] add@/devices/xen-backend/vbd-10-770 UEVENT[1230853752.685909] add@/devices/xen-backend/vif-10-0 UDEV [1230853752.685909] add@/devices/xen-backend/vif-10-0 UEVENT[1230853752.689908] add@/class/net/vif10.0 UDEV [1230853752.697530] add@/class/net/vif10.0 UEVENT[1230853752.698462] online@/devices/xen-backend/vif-10-0 UDEV [1230853752.699603] online@/devices/xen-backend/vif-10-0 UEVENT[1230853752.773270] add@/devices/xen-backend/console-10-0 UDEV [1230853752.781188] add@/devices/xen-backend/console-10-0 And this on shutdown (shutdown was actually done before the start): UEVENT[1230853735.596056] remove@/devices/xen-backend/console-7-0 UDEV [1230853735.596056] remove@/devices/xen-backend/console-7-0 UEVENT[1230853735.625884] remove@/devices/xen-backend/vbd-7-769 UDEV [1230853735.625884] remove@/devices/xen-backend/vbd-7-769 UEVENT[1230853735.641803] remove@/devices/xen-backend/vbd-7-770 UDEV [1230853735.641803] remove@/devices/xen-backend/vbd-7-770 UEVENT[1230853735.653536] offline@/devices/xen-backend/vif-7-0 UDEV [1230853735.653536] offline@/devices/xen-backend/vif-7-0 UEVENT[1230853735.764489] remove@/class/net/vif7.0 UDEV [1230853735.764489] remove@/class/net/vif7.0 UEVENT[1230853736.005804] remove@/devices/xen-backend/vif-7-0 UDEV [1230853736.135452] remove@/devices/xen-backend/vif-7-0 Does that suggest anything obvious? Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 02:38 UTC
RE: [Xen-devel] domains not shutting down properly - the problem isback again
> There''s some interaction going on with your dom0 installation, since > no one else has seen or reported this issue. >I have just ''debootstrap''d a new root filesystem for my dom0 (been meaning to do it for a while), but the problem still continues... at least I know now it really isn''t anything from the Debian Xen packages. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 03:40 UTC
RE: [Xen-devel] domains not shutting down properly - the problem isback again
> > > > Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom > > connection'' after starting and then destroying a new domain. > > Backend driver not cleaning up due to xend not correctly deleting a > directory from xenstore, or because of hotplug/udev script problems? > There''s some interaction going on with your dom0 installation, sinceno> one else has seen or reported this issue.When I kill xenstore all the allocated event channels go away, so I''m guessing that''s it. Just need to find out why xenstore isn''t getting told to clean up... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 04:15 UTC
RE: [Xen-devel] domains not shutting down properly - the problemisback again
> > > > > > Also, ''lsevtchn'' shows one extra ''Channel is waiting interdom > > > connection'' after starting and then destroying a new domain. > > > > Backend driver not cleaning up due to xend not correctly deleting a > > directory from xenstore, or because of hotplug/udev script problems? > > There''s some interaction going on with your dom0 installation, since > no > > one else has seen or reported this issue. > > When I kill xenstore all the allocated event channels go away, so I''m > guessing that''s it. > > Just need to find out why xenstore isn''t getting told to clean up... >The obvious next step was to go and litter xenstored with syslog messages so I know what it''s up to. I did that, shut down all the domains, stopped xend, killed xenstored, restarted xend, and it all works perfectly. I then rebooted, and it is back to the normal behaviour. I''m now looking at when xend starts in the boot load order... maybe it''s loading too early? James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 04:48 UTC
RE: [Xen-devel] domains not shutting down properly - the problemisback again
> > The obvious next step was to go and litter xenstored with syslogmessages> so I know what it''s up to. I did that, shut down all the domains,stopped> xend, killed xenstored, restarted xend, and it all works perfectly. > > I then rebooted, and it is back to the normal behaviour. I''m nowlooking> at when xend starts in the boot load order... maybe it''s loading too > early? >Pushing it back in the boot process didn''t make any difference. I even tried starting it manually after the system had been up for 5 minutes. The only thing that seems to make a difference is: /etc/init.d/xend stop killall xenstored /etc/init.d/xend start Once I do that, everything works perfectly. Now I guess I just have to try and find out what is different between starting it on a freshly booted system vs restarting it... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 09:11 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisback again
> > > > The obvious next step was to go and litter xenstored with syslog > messages > > so I know what it''s up to. I did that, shut down all the domains, > stopped > > xend, killed xenstored, restarted xend, and it all works perfectly. > > > > I then rebooted, and it is back to the normal behaviour. I''m now > looking > > at when xend starts in the boot load order... maybe it''s loading too > > early? > > > > Pushing it back in the boot process didn''t make any difference. I even > tried starting it manually after the system had been up for 5 minutes. > > The only thing that seems to make a difference is: > > /etc/init.d/xend stop > killall xenstored > /etc/init.d/xend start > > Once I do that, everything works perfectly. Now I guess I just have to > try and find out what is different between starting it on a freshly > booted system vs restarting it... >Just to clarify, by ''everything works perfectly'' I mean that a domain that doesn''t have any vif or vbd interfaces (just a ''kernel='' line for testing), crashes and cleans up after itself, as opposed to crashing and hanging around. Restarting xenstored obviously breaks the connection to the Dom0 backend drivers. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-02 09:55 UTC
Re: [Xen-devel] domains not shutting down properly - theproblemisback again
On 02/01/2009 09:11, "James Harper" <james.harper@bendigoit.com.au> wrote:>> The only thing that seems to make a difference is: >> >> /etc/init.d/xend stop >> killall xenstored >> /etc/init.d/xend start >> >> Once I do that, everything works perfectly. Now I guess I just have to >> try and find out what is different between starting it on a freshly >> booted system vs restarting it... > > Just to clarify, by ''everything works perfectly'' I mean that a domain > that doesn''t have any vif or vbd interfaces (just a ''kernel='' line for > testing), crashes and cleans up after itself, as opposed to crashing and > hanging around. Restarting xenstored obviously breaks the connection to > the Dom0 backend drivers.Yeah, I was going to say... :-) Anyway, your observations can be explained by the fact that the restarted xenstored will not auto-connect to any domain. So since it holds no resources of the domU, it won''t impede the domU''s destruction. Xenstored is supposed to receive a VIRQ_DOM_EXC when a domain is killed (see xen/common/domain.c:domain_kill(). This should trigger xenstored_domain.c:domain_cleanup() which queries every domain it knows about and if it sees XEN_DOMINF_dying (which gets cooked in libxenctrl into boolean flag dominfo.dying) then it should talloc_free() the domain state and hence release resources. These are the paths you need to log. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 10:11 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisback again
> > Just to clarify, by ''everything works perfectly'' I mean that adomain> > that doesn''t have any vif or vbd interfaces (just a ''kernel='' linefor> > testing), crashes and cleans up after itself, as opposed to crashingand> > hanging around. Restarting xenstored obviously breaks the connectionto> > the Dom0 backend drivers. > > Yeah, I was going to say... :-) > > Anyway, your observations can be explained by the fact that therestarted> xenstored will not auto-connect to any domain. So since it holds no > resources of the domU, it won''t impede the domU''s destruction.Even a domain I create subsequent to restarting xenstored & xend? ''xm console'' doesn''t work in that case so I''m guessing not.> Xenstored is supposed to receive a VIRQ_DOM_EXC when a domain iskilled> (see > xen/common/domain.c:domain_kill(). This should trigger > xenstored_domain.c:domain_cleanup() which queries every domain itknows> about and if it sees XEN_DOMINF_dying (which gets cooked in libxenctrl > into > boolean flag dominfo.dying) then it should talloc_free() the domainstate> and hence release resources. These are the paths you need to log. >I have previously logged xenstored_domain.c:domain_cleanup() - it never gets called during the domain crashing or being shut down. I think the action of creating another domain (or an explicit ''xm destroy'') results in domain_cleanup() getting called somewhere along the way, which mostly cleans up the domain but obviously leaves a few pages and an event channel lying around (as revealed by ''xm debug q'' and ''lsevtchn''). I guess I''ll start adding some logs to domain.c... Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 10:28 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisback again
> Xenstored is supposed to receive a VIRQ_DOM_EXC when a domain iskilled> (see xen/common/domain.c:domain_kill()Just added some more logging - domain_kill is never called either until I explicitly say ''xm destroy'' James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-02 10:28 UTC
Re: [Xen-devel] domains not shutting down properly - theproblemisback again
On 02/01/2009 10:11, "James Harper" <james.harper@bendigoit.com.au> wrote:>> xenstored will not auto-connect to any domain. So since it holds no >> resources of the domU, it won''t impede the domU''s destruction. > > Even a domain I create subsequent to restarting xenstored & xend? ''xm > console'' doesn''t work in that case so I''m guessing not.No. I''d be a bit surprised if you could create a domain without a wotrking dom0 ring connection to xenstored though.> I have previously logged xenstored_domain.c:domain_cleanup() - it never > gets called during the domain crashing or being shut down. I think the > action of creating another domain (or an explicit ''xm destroy'') results > in domain_cleanup() getting called somewhere along the way, which mostly > cleans up the domain but obviously leaves a few pages and an event > channel lying around (as revealed by ''xm debug q'' and ''lsevtchn'').As long as domain_cleanup() gets called at some point it should see the dying domU has dominfo.dying and then release resources.> I guess I''ll start adding some logs to domain.c...Good idea. Maybe something is going wrong in domain_kill(). That will be called by the DOMCTL_destroydomain hypercall, which should be triggered by ''xm destroy''. Note the hypercall is preemptable, requiring a loop on EAGAIN in libxenctrl, to make sure it completes its work. The notification on VIRQ_DOM_EXC is near the end of the function. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-02 10:41 UTC
Re: [Xen-devel] domains not shutting down properly - theproblemisback again
On 02/01/2009 10:28, "James Harper" <james.harper@bendigoit.com.au> wrote:>> Xenstored is supposed to receive a VIRQ_DOM_EXC when a domain is > killed >> (see xen/common/domain.c:domain_kill() > > Just added some more logging - domain_kill is never called either until > I explicitly say ''xm destroy''That''s expected. But you don''t expect a domain to disappear until you do ''xm destroy'', unless you have on_{poweroff,destroy,crash} = destroy in your domain config file. In which case the call to domain_kill() should be made automatically by xend. Your problem is the domain doesn''t disappear even after explicitly doing ''xm destroy'', right? That''s the first thing to track down. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 10:48 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisback again
> On 02/01/2009 10:28, "James Harper" <james.harper@bendigoit.com.au>wrote:> > >> Xenstored is supposed to receive a VIRQ_DOM_EXC when a domain is > > killed > >> (see xen/common/domain.c:domain_kill() > > > > Just added some more logging - domain_kill is never called eitheruntil> > I explicitly say ''xm destroy'' > > That''s expected. But you don''t expect a domain to disappear until youdo> ''xm > destroy'', unless you have on_{poweroff,destroy,crash} = destroy inyour> domain config file. In which case the call to domain_kill() should bemade> automatically by xend. > > Your problem is the domain doesn''t disappear even after explicitlydoing> ''xm destroy'', right? That''s the first thing to track down. >After an ''xm destroy'', the domain no longer shows up in ''xm list'', but there is evidence of it still holding resources in ''xm debug q''. The domain in question is purely a kernel. there is no initrd, no vif''d, and no vbd''s. It should start, crash, then disappear (on_crash ''destroy''). The last bit doesn''t happen though - it just stays in ''xm list'' until I ''xm destroy'' it. My suspicion is that the procedure that should happen automatically when the domain crashes is hanging somewhere - an ''xm destroy'' makes the domain (mostly) go away, but because the orderly shutdown didn''t happen, resources are left, and don''t go away until I kill xenstore. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 11:15 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisback again
> > Just added some more logging - domain_kill is never called eitheruntil> > I explicitly say ''xm destroy'' > > That''s expected. But you don''t expect a domain to disappear until youdo> ''xm > destroy'', unless you have on_{poweroff,destroy,crash} = destroy inyour> domain config file. In which case the call to domain_kill() should bemade> automatically by xend. > > Your problem is the domain doesn''t disappear even after explicitlydoing> ''xm > destroy'', right? That''s the first thing to track down. >I just ran xend with ''xend start_trace'' and wow... a lot of logging! :) When I start another domain (secondary smtp server running linux), and then issue a shutdown in it, nothing gets logged at all in the xend trace. Should ''xend start_trace'' log some activity if a domain dies, if all was working well? (eg would xend be the process to do the cleanup etc or does that all happen in xenstored?) The domain just sticks in the ''s'' state until I ''xm destroy'' it, then it disappears from ''xm list'' but is still using resources in ''xm debug q''. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 11:47 UTC
RE: [Xen-devel] domains not shutting down properly - theproblemisbackagain
I''ve added even more logging, but still don''t appear to be any closer to figuring out what''s going on... I start my smtp server, wait for it to finish booting, then give it a ''xm shutdown''. It does its orderly shutdown and then I see: do_sched_op gets called with SCHEDOP_shutdown domain_shutdown gets called __domain_finalise_shutdown gets called send_guest_global_virq(dom0, VIRQ_DOM_EXC) gets called Then nothing. Nothing in xend.log. What should happen next? Should the domain get destroyed before the backend stuff gets cleaned up, or is it the other way around? I forgot to make xenstored trace so I''ll run that again and see what that tells me James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-02 12:34 UTC
Re: [Xen-devel] domains not shutting down properly - theproblemisbackagain
On 02/01/2009 11:47, "James Harper" <james.harper@bendigoit.com.au> wrote:> send_guest_global_virq(dom0, VIRQ_DOM_EXC) gets called > > Then nothing. Nothing in xend.log. What should happen next? Should the > domain get destroyed before the backend stuff gets cleaned up, or is it > the other way around?I think xenstored should get kicked, run domain_cleanup() and that should cause @releaseDomain watch to fire, which should kick xend into doing whatever it does when a domain shuts down. If it is configured to ''destroy'' this domain on domain shutdown then indeed it should tell Xen to run domain_kill() and it should notify backends to tear down. -- Keir> I forgot to make xenstored trace so I''ll run that again and see what > that tells me_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 12:52 UTC
RE: [Xen-devel] domains not shutting down properly -theproblemisbackagain
> > I''ve added even more logging, but still don''t appear to be any closerto> figuring out what''s going on... > > I start my smtp server, wait for it to finish booting, then give it a > ''xm shutdown''. It does its orderly shutdown and then I see: > > do_sched_op gets called with SCHEDOP_shutdown > domain_shutdown gets called > __domain_finalise_shutdown gets called > send_guest_global_virq(dom0, VIRQ_DOM_EXC) gets called > > Then nothing. Nothing in xend.log. What should happen next? Should the > domain get destroyed before the backend stuff gets cleaned up, or isit> the other way around? > > I forgot to make xenstored trace so I''ll run that again and see what > that tells me > > James >I would expect that upon xen doing "send_guest_global_virq(dom0, VIRQ_DOM_EXC)", that xenstored would get an event on the port that it previously bound to VIRQ_DOM_EXC, but this isn''t happening... When this code executes: " if ((rc = xc_evtchn_bind_virq(xce_handle, VIRQ_DOM_EXC)) == -1) barf_perror("Failed to bind to domain exception virq port"); virq_port = rc; " virq_port is set to 18. handle_event only ever sees ports 17 (often) and 4 (seldom), never 18... sure enough, if I remove the ''if (port == virq_port)'' in ''handle_event'' and make it always call domain_cleanup then everything works as it should, but obviously something is really wrong... despite what you said about restarting xenstored, if I do restart it, the VIRQ_DOM_EXC signalling from xen to Dom0 works correctly... curious... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-02 13:27 UTC
RE: [Xen-devel] domains not shutting down properly -theproblemisbackagain
> > send_guest_global_virq(dom0, VIRQ_DOM_EXC) gets called > > > > Then nothing. Nothing in xend.log. What should happen next? Shouldthe> > domain get destroyed before the backend stuff gets cleaned up, or isit> > the other way around? > > I think xenstored should get kicked, run domain_cleanup() and thatshould> cause @releaseDomain watch to fire, which should kick xend into doing > whatever it does when a domain shuts down. If it is configured to > ''destroy'' > this domain on domain shutdown then indeed it should tell Xen to run > domain_kill() and it should notify backends to tear down. >As per a subsequent email, ''send_guest_global_virq(dom0, VIRQ_DOM_EXC)'' never reaches Dom0 (unless xenstored is restarted, then a whole lot of other stuff doesn''t work but the send_guest_global_virq actually does work). Debugging in xenstored shows that port 18 was returned from the kernel. Debugging in xen shows that port 18 is definitely what the hypervisor allocated. I added some debugging to the send_guest_global_virq code path too, but as soon as the DomU I created for testing shut down, my Dom0 became unresponsive so I suspect I have botched my debug statements. It''s nearly 12:30am here and the server is at work so I''m done for the day. The most frustrating thing is that I just know it''s going to be some stupid little configuration thing on my server causing all of this!!! James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-02 15:31 UTC
Re: [Xen-devel] domains not shutting down properly -theproblemisbackagain
On 02/01/2009 13:27, "James Harper" <james.harper@bendigoit.com.au> wrote:> I added some debugging to the send_guest_global_virq code path too, but > as soon as the DomU I created for testing shut down, my Dom0 became > unresponsive so I suspect I have botched my debug statements. It''s > nearly 12:30am here and the server is at work so I''m done for the day. > > The most frustrating thing is that I just know it''s going to be some > stupid little configuration thing on my server causing all of this!!!Perhaps multiprocessor dom0, plus a bug in the dom0 kernel which means that the VCPU which Xen notifies for the virq is not the one which dom0 kernel is expecting to receive the notification to? What do you use as dom0 kernel? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 00:57 UTC
RE: [Xen-devel] domains not shutting down properly -theproblemisbackagain
> On 02/01/2009 13:27, "James Harper" <james.harper@bendigoit.com.au>wrote:> > > I added some debugging to the send_guest_global_virq code path too,but> > as soon as the DomU I created for testing shut down, my Dom0 became > > unresponsive so I suspect I have botched my debug statements. It''s > > nearly 12:30am here and the server is at work so I''m done for theday.> > > > The most frustrating thing is that I just know it''s going to be some > > stupid little configuration thing on my server causing all ofthis!!!> > Perhaps multiprocessor dom0, plus a bug in the dom0 kernel which means > that > the VCPU which Xen notifies for the virq is not the one which dom0kernel> is > expecting to receive the notification to? What do you use as dom0kernel?>My dom0 kernel is http://xenbits.xensource.com/linux-2.6.18-xen.hg The crash I had last night was just: " void send_guest_global_virq(struct domain *d, int virq) { unsigned long flags; int port; struct vcpu *v; struct evtchn *chn; ASSERT(virq_is_global(virq)); if (virq == VIRQ_DOM_EXC) printk("send_guest_global_virq\n"); " adding the last two lines above. There should be no reason for that to cause a crash that I can see, but as soon as I take those lines away it all works. I have printk''s in the function that calls send_guest_global_virq and they work just fine. Is my version of gcc a problem "gcc (Debian 4.3.2-1) 4.3.2"? When things like adding (seemingly) benign print statements start causing crashes (or making crashes go away :), I start to get suspicious that there is something a whole lot different going on... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 02:25 UTC
RE: [Xen-devel] domains not shutting down properly-theproblemisbackagain
> My dom0 kernel is http://xenbits.xensource.com/linux-2.6.18-xen.hg > > The crash I had last night was just: > > " > void send_guest_global_virq(struct domain *d, int virq) > { > unsigned long flags; > int port; > struct vcpu *v; > struct evtchn *chn; > > ASSERT(virq_is_global(virq)); > > if (virq == VIRQ_DOM_EXC) > printk("send_guest_global_virq\n"); > " > > adding the last two lines above. There should be no reason for that to > cause a crash that I can see, but as soon as I take those lines awayit> all works. I have printk''s in the function that calls > send_guest_global_virq and they work just fine.It turns out it wasn''t those two lines, it was a debug statement a bit further down, just after " port = v->virq_to_evtchn[virq];", with v->virq_lock held. I''m guessing that printk can''t execute with virq_lock held? So no crash anymore but the event isn''t getting through... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 05:12 UTC
RE: [Xen-devel] domains not shutting down properly -theproblemisbackagain
> > On 02/01/2009 13:27, "James Harper" <james.harper@bendigoit.com.au>wrote:> > > I added some debugging to the send_guest_global_virq code path too,but> > as soon as the DomU I created for testing shut down, my Dom0 became > > unresponsive so I suspect I have botched my debug statements. It''s > > nearly 12:30am here and the server is at work so I''m done for theday.> > > > The most frustrating thing is that I just know it''s going to be some > > stupid little configuration thing on my server causing all ofthis!!!> > Perhaps multiprocessor dom0, plus a bug in the dom0 kernel which means > that > the VCPU which Xen notifies for the virq is not the one which dom0kernel> is > expecting to receive the notification to? What do you use as dom0kernel?>That appears to be the problem. 1. xenstore starts up and binds VIRQ_DOM_EXC to port 18 2. xend starts and sets the number of cpus to 1 (dom0-cpus = 1) 3. xen notifies xenstore on port=18, vcpu=1, but vcpu 1 doesn''t exist anymore so the event never gets anywhere The curious thing is that IOCTL_EVTCHN_BIND_VIRQ explicitly sets vcpu 0, so why is the event getting delivered to vcpu 1??? Thinking back, this bug must have reappeared once I changed dom0-cpus from 0 to 1... wish I had clicked back then :( James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 05:36 UTC
RE: [Xen-devel] domains not shutting down properly-theproblemisbackagain
> > Perhaps multiprocessor dom0, plus a bug in the dom0 kernel whichmeans> > that > > the VCPU which Xen notifies for the virq is not the one which dom0 > kernel > > is > > expecting to receive the notification to? What do you use as dom0 > kernel? > > > > That appears to be the problem. > > 1. xenstore starts up and binds VIRQ_DOM_EXC to port 18 > 2. xend starts and sets the number of cpus to 1 (dom0-cpus = 1) > 3. xen notifies xenstore on port=18, vcpu=1, but vcpu 1 doesn''t exist > anymore so the event never gets anywhere > > The curious thing is that IOCTL_EVTCHN_BIND_VIRQ explicitly sets vcpu > 0, so why is the event getting delivered to vcpu 1??? >Something is making a call to evtchn_bind_vcpu... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 06:29 UTC
[Xen-devel] bug in evtchn_cpu_notify (was domains not shutting down properly-theproblemisbackagain)
> > > > Perhaps multiprocessor dom0, plus a bug in the dom0 kernel whichmeans> > > that > > > the VCPU which Xen notifies for the virq is not the one which dom0 > > kernel > > > is > > > expecting to receive the notification to? What do you use as dom0 > > kernel? > > > > > > > That appears to be the problem. > > > > 1. xenstore starts up and binds VIRQ_DOM_EXC to port 18 > > 2. xend starts and sets the number of cpus to 1 (dom0-cpus = 1) > > 3. xen notifies xenstore on port=18, vcpu=1, but vcpu 1 doesn''texist> > anymore so the event never gets anywhere > > > > The curious thing is that IOCTL_EVTCHN_BIND_VIRQ explicitly setsvcpu > > 0, so why is the event getting delivered to vcpu 1???> > > > Something is making a call to evtchn_bind_vcpu... >I think I''ve figured out what is going on... the ''per user data'' in drivers/xen/evtchn/evtchn.c is per connection to the event channel device, so the same ''per user data'' may be assigned to multiple ports Initially all the event channels opened by xenstored (eg 17 and 18) have ''1'' in the vcpu of their user data, indicating that ports on that connection are bound to vcpu 1. In evtchn_cpu_notify(CPU_DOWN_PREPARE) (when xend starts, reducing the number of cpu''s in dom0 to 1), every port is looped through. Port 17 is found to be bound to vcpu 1 (via the per user data) which is about to go away, so the port is rebound to vcpu 0 and the user data is updated to reflect the new vcpu (I only have 2 cpu''s, so it is set to 0 as 1 is going away). Port 18 is checked but because the per-user data has been updated to vcpu=0 so nothing is done and the port stays bound to vcpu 1. I''ll try and come up with a solution when I get back to my computer in a few hours if nobody beats me to it... is there another way to check what vcpu a port is bound to than checking the per-user value of bind_vcpu? James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2009-Jan-03 07:04 UTC
[Xen-devel] RE: bug in evtchn_cpu_notify (was domains not shutting down properly-theproblemisbackagain)
> > I think I''ve figured out what is going on... the ''per user data'' in > drivers/xen/evtchn/evtchn.c is per connection to the event channeldevice,> so the same ''per user data'' may be assigned to multiple ports > > Initially all the event channels opened by xenstored (eg 17 and 18)have> ''1'' in the vcpu of their user data, indicating that ports on that > connection are bound to vcpu 1. > > In evtchn_cpu_notify(CPU_DOWN_PREPARE) (when xend starts, reducing the > number of cpu''s in dom0 to 1), every port is looped through. Port 17is> found to be bound to vcpu 1 (via the per user data) which is about togo> away, so the port is rebound to vcpu 0 and the user data is updated to > reflect the new vcpu (I only have 2 cpu''s, so it is set to 0 as 1 isgoing> away). Port 18 is checked but because the per-user data has beenupdated> to vcpu=0 so nothing is done and the port stays bound to vcpu 1. > > I''ll try and come up with a solution when I get back to my computer ina> few hours if nobody beats me to it... is there another way to checkwhat> vcpu a port is bound to than checking the per-user value of bind_vcpu? >This patch fixes the problem for me... it doesn''t require any new data structures and evtchn_cpu_notify is hardly a performance critical code path so I think we can wear the extra looping... diff -r 618fc299e2f1 drivers/xen/evtchn/evtchn.c --- a/drivers/xen/evtchn/evtchn.c Thu Dec 18 11:51:36 2008 +0000 +++ b/drivers/xen/evtchn/evtchn.c Sat Jan 03 18:01:04 2009 +1100 @@ -497,7 +497,7 @@ { int hotcpu = (unsigned long)hcpu; cpumask_t map = cpu_online_map; - int port, newcpu; + int port, port2, newcpu; struct per_user_data *u; switch (action) { @@ -508,7 +508,9 @@ if ((u = port_user[port]) != NULL && u->bind_cpu == hotcpu && (newcpu = next_bind_cpu(map)) < NR_CPUS) { - rebind_evtchn_to_cpu(port, newcpu); + for (port2 = port; port2 < NR_EVENT_CHANNELS; port2++) + if (port_user[port2] == u) + rebind_evtchn_to_cpu(port2, newcpu); u->bind_cpu = newcpu; } } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jan-03 08:52 UTC
[Xen-devel] Re: bug in evtchn_cpu_notify (was domains not shutting down properly-theproblemisbackagain)
On 03/01/2009 07:04, "James Harper" <james.harper@bendigoit.com.au> wrote:> This patch fixes the problem for me... it doesn''t require any new data > structures and evtchn_cpu_notify is hardly a performance critical code > path so I think we can wear the extra looping...Looks good. Thanks James! A real bug after all. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel