Hi folks, I have two machines running Xen, one has v4.1.1 and the other v4.1.2. On top I have a DRBD block device on which my test domU runs (Slackware v13.37 using a kernel.org kernel version 3.2.6). The domU runs fine, but when I migrate it to the other dom0 something weird happens: At first everything seems fine, the domain is gone on host 1 and runs on host 2. (a ssh session to the domU survives and works fine as well), but after a minute orso suddenly the domain is completely gone! (no corpse left on either dom0) The logs show a normal shutdown as far as I can tell, the only hint as to why it shut down is in ''xm dmesg'' on the migration target dom0: (XEN) Watchdog timer fired for domain 19 Does anyone know what is happenening here and why the migration triggers this? (and possible solutions) Is this a bug in Xen or did I do something stupid? In the domU a /dev/watchdog exists, but I never touch it myself. Thanks for reading :) Regards, Wouter.
On Thu, 2012-03-15 at 15:03 +0000, Wouter de Geus wrote:> Hi folks, > > I have two machines running Xen, one has v4.1.1 and the other v4.1.2. > On top I have a DRBD block device on which my test domU runs (Slackware v13.37 > using a kernel.org kernel version 3.2.6). > > The domU runs fine, but when I migrate it to the other dom0 something weird happens: > At first everything seems fine, the domain is gone on host 1 and runs on host 2. > (a ssh session to the domU survives and works fine as well), but after a minute orso > suddenly the domain is completely gone! (no corpse left on either dom0) > > The logs show a normal shutdown as far as I can tell, the only hint as to why it > shut down is in ''xm dmesg'' on the migration target dom0: > (XEN) Watchdog timer fired for domain 19 > > Does anyone know what is happenening here and why the migration triggers this? > (and possible solutions) > Is this a bug in Xen or did I do something stupid? > > In the domU a /dev/watchdog exists, but I never touch it myself.Does your distro start a watchdogd for you? Even if it did then it should obviously continue to poke the watchdog even after a migration, but it might provide some hints. Ian.> > Thanks for reading :) > > Regards, > > Wouter. > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xen.org > http://lists.xen.org/xen-users
* Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]:> > In the domU a /dev/watchdog exists, but I never touch it myself. > > Does your distro start a watchdogd for you? > > Even if it did then it should obviously continue to poke the watchdog > even after a migration, but it might provide some hints.Nope, it doesn''t. As an experiemtn I tried sending a ''V'' to /dev/watchdog just after migration, which seems to be the magic byte to stop the watchdog according to the linux kernel docs. (according to Documentation/watchdog/watchdog-api.txt) - "If a driver supports "Magic Close", the driver will not disable the watchdog unless a specific magic character ''V'' has been sent to /dev/watchdog just before closing the file." Surprisingly, the domU seemed to stay alive after doing this. So it would appear that somehow the watchdog is triggered after migration. No idea why though. Wouter.
I think this is a bug, moving to xen-devel@ and CCing the drivers author (Hi Jan). On Fri, 2012-03-16 at 12:29 +0000, Wouter de Geus wrote:> * Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]: > > > > In the domU a /dev/watchdog exists, but I never touch it myself. > > > > Does your distro start a watchdogd for you? > > > > Even if it did then it should obviously continue to poke the watchdog > > even after a migration, but it might provide some hints. > > Nope, it doesn''t. > As an experiemtn I tried sending a ''V'' to /dev/watchdog just after > migration, which seems to be the magic byte to stop the watchdog > according to the linux kernel docs. > > (according to Documentation/watchdog/watchdog-api.txt) - > "If a driver supports "Magic Close", the driver will not disable the > watchdog unless a specific magic character ''V'' has been sent to > /dev/watchdog just before closing the file." > > Surprisingly, the domU seemed to stay alive after doing this. > So it would appear that somehow the watchdog is triggered after migration. > No idea why though.drivers/watchdog/xen_wdt.c:xen_wdt_resume() unconditionally calls xen_wdt_start(). Shouldn''t it only do this if the wdt is active? Ian.
>>> On 16.03.12 at 13:32, Ian Campbell <Ian.Campbell@citrix.com> wrote: > I think this is a bug, moving to xen-devel@ and CCing the drivers author > (Hi Jan). > > On Fri, 2012-03-16 at 12:29 +0000, Wouter de Geus wrote: >> * Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]: >> >> > > In the domU a /dev/watchdog exists, but I never touch it myself. >> > >> > Does your distro start a watchdogd for you? >> > >> > Even if it did then it should obviously continue to poke the watchdog >> > even after a migration, but it might provide some hints. >> >> Nope, it doesn''t. >> As an experiemtn I tried sending a ''V'' to /dev/watchdog just after >> migration, which seems to be the magic byte to stop the watchdog >> according to the linux kernel docs. >> >> (according to Documentation/watchdog/watchdog-api.txt) - >> "If a driver supports "Magic Close", the driver will not disable the >> watchdog unless a specific magic character ''V'' has been sent to >> /dev/watchdog just before closing the file." >> >> Surprisingly, the domU seemed to stay alive after doing this. >> So it would appear that somehow the watchdog is triggered after migration. >> No idea why though. > > drivers/watchdog/xen_wdt.c:xen_wdt_resume() unconditionally calls > xen_wdt_start(). Shouldn''t it only do this if the wdt is active?Oh, yes, of course. I don''t really recall what other driver I used as skeleton, but when adjusting the bits for Xen I obviously screwed this up. Will get a fix out soon. Jan
>>> On 16.03.12 at 13:32, Ian Campbell <Ian.Campbell@citrix.com> wrote: > I think this is a bug, moving to xen-devel@ and CCing the drivers author > (Hi Jan). > > On Fri, 2012-03-16 at 12:29 +0000, Wouter de Geus wrote: >> * Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]: >> >> > > In the domU a /dev/watchdog exists, but I never touch it myself. >> > >> > Does your distro start a watchdogd for you? >> > >> > Even if it did then it should obviously continue to poke the watchdog >> > even after a migration, but it might provide some hints. >> >> Nope, it doesn''t. >> As an experiemtn I tried sending a ''V'' to /dev/watchdog just after >> migration, which seems to be the magic byte to stop the watchdog >> according to the linux kernel docs. >> >> (according to Documentation/watchdog/watchdog-api.txt) - >> "If a driver supports "Magic Close", the driver will not disable the >> watchdog unless a specific magic character ''V'' has been sent to >> /dev/watchdog just before closing the file." >> >> Surprisingly, the domU seemed to stay alive after doing this. >> So it would appear that somehow the watchdog is triggered after migration. >> No idea why though. > > drivers/watchdog/xen_wdt.c:xen_wdt_resume() unconditionally calls > xen_wdt_start(). Shouldn''t it only do this if the wdt is active?Could you give the patch below a try? Jan --- a/drivers/watchdog/xen_wdt.c +++ b/drivers/watchdog/xen_wdt.c @@ -297,11 +297,19 @@ static void xen_wdt_shutdown(struct plat static int xen_wdt_suspend(struct platform_device *dev, pm_message_t state) { - return xen_wdt_stop(); + typeof(wdt.id) id = wdt.id; + int rc = xen_wdt_stop(); + + wdt.id = id; + + return rc; } static int xen_wdt_resume(struct platform_device *dev) { + if (!wdt.id) + return 0; + wdt.id = 0; return xen_wdt_start(); }
On Fri, 2012-03-16 at 14:34 +0000, Jan Beulich wrote:> >>> On 16.03.12 at 13:32, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > I think this is a bug, moving to xen-devel@ and CCing the drivers author > > (Hi Jan). > > > > On Fri, 2012-03-16 at 12:29 +0000, Wouter de Geus wrote: > >> * Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]: > >> > >> > > In the domU a /dev/watchdog exists, but I never touch it myself. > >> > > >> > Does your distro start a watchdogd for you? > >> > > >> > Even if it did then it should obviously continue to poke the watchdog > >> > even after a migration, but it might provide some hints. > >> > >> Nope, it doesn''t. > >> As an experiemtn I tried sending a ''V'' to /dev/watchdog just after > >> migration, which seems to be the magic byte to stop the watchdog > >> according to the linux kernel docs. > >> > >> (according to Documentation/watchdog/watchdog-api.txt) - > >> "If a driver supports "Magic Close", the driver will not disable the > >> watchdog unless a specific magic character ''V'' has been sent to > >> /dev/watchdog just before closing the file." > >> > >> Surprisingly, the domU seemed to stay alive after doing this. > >> So it would appear that somehow the watchdog is triggered after migration. > >> No idea why though. > > > > drivers/watchdog/xen_wdt.c:xen_wdt_resume() unconditionally calls > > xen_wdt_start(). Shouldn''t it only do this if the wdt is active? > > Could you give the patch below a try? > > Jan > > --- a/drivers/watchdog/xen_wdt.c > +++ b/drivers/watchdog/xen_wdt.c > @@ -297,11 +297,19 @@ static void xen_wdt_shutdown(struct plat > > static int xen_wdt_suspend(struct platform_device *dev, pm_message_t state) > { > - return xen_wdt_stop(); > + typeof(wdt.id) id = wdt.id;typeof here is a bit odd.> + int rc = xen_wdt_stop(); > + > + wdt.id = id; > + > + return rc; > } > > static int xen_wdt_resume(struct platform_device *dev) > { > + if (!wdt.id) > + return 0;Can''t you check is_active instead and avoid having to play tricks in xen_wdt_suspend to preserve a non-0 wdt.id when the watchdog is active?> + wdt.id = 0; > return xen_wdt_start(); > } > > >
>>> On 16.03.12 at 15:58, Ian Campbell <Ian.Campbell@citrix.com> wrote: > On Fri, 2012-03-16 at 14:34 +0000, Jan Beulich wrote: >> >>> On 16.03.12 at 13:32, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> > I think this is a bug, moving to xen-devel@ and CCing the drivers author >> > (Hi Jan). >> > >> > On Fri, 2012-03-16 at 12:29 +0000, Wouter de Geus wrote: >> >> * Ian Campbell <Ian.Campbell@citrix.com> [2012-03-16 10:55:34 +0000]: >> >> >> >> > > In the domU a /dev/watchdog exists, but I never touch it myself. >> >> > >> >> > Does your distro start a watchdogd for you? >> >> > >> >> > Even if it did then it should obviously continue to poke the watchdog >> >> > even after a migration, but it might provide some hints. >> >> >> >> Nope, it doesn''t. >> >> As an experiemtn I tried sending a ''V'' to /dev/watchdog just after >> >> migration, which seems to be the magic byte to stop the watchdog >> >> according to the linux kernel docs. >> >> >> >> (according to Documentation/watchdog/watchdog-api.txt) - >> >> "If a driver supports "Magic Close", the driver will not disable the >> >> watchdog unless a specific magic character ''V'' has been sent to >> >> /dev/watchdog just before closing the file." >> >> >> >> Surprisingly, the domU seemed to stay alive after doing this. >> >> So it would appear that somehow the watchdog is triggered after migration. >> >> No idea why though. >> > >> > drivers/watchdog/xen_wdt.c:xen_wdt_resume() unconditionally calls >> > xen_wdt_start(). Shouldn''t it only do this if the wdt is active? >> >> Could you give the patch below a try? >> >> Jan >> >> --- a/drivers/watchdog/xen_wdt.c >> +++ b/drivers/watchdog/xen_wdt.c >> @@ -297,11 +297,19 @@ static void xen_wdt_shutdown(struct plat >> >> static int xen_wdt_suspend(struct platform_device *dev, pm_message_t state) >> { >> - return xen_wdt_stop(); >> + typeof(wdt.id) id = wdt.id; > > typeof here is a bit odd.But I want to match the original field''s type.>> + int rc = xen_wdt_stop(); >> + >> + wdt.id = id; >> + >> + return rc; >> } >> >> static int xen_wdt_resume(struct platform_device *dev) >> { >> + if (!wdt.id) >> + return 0; > > Can''t you check is_active instead and avoid having to play tricks in > xen_wdt_suspend to preserve a non-0 wdt.id when the watchdog is active?I first thought of this too, but is_active doesn''t represent whether a watchdog is actually engaged - it merely says whether the watchdog device is currently open (but watchdog setup itself may have failed). Looking at the code for this again, I see another problem though: is_active gets cleared in xen_wdt_release() even when a release was not expected (similar to how iTCO_wdt, pcwd_pci, or pcwd_usb behave, but imo buggy nevertheless), or when xen_wdt_stop() failed. Jan>> + wdt.id = 0; >> return xen_wdt_start(); >> } >> >> >>
* Jan Beulich <JBeulich@suse.com> [2012-03-16 14:34:10 +0000]:> Could you give the patch below a try?Patched it on vanilla 3.2.11 kernel and booted my test domU with it. It seems that it no longer shutdowns after migration, so as far as I''m concerned this fixed my problems :) Thanks! Wouter.
On Fri, 2012-03-16 at 15:14 +0000, Jan Beulich wrote:> >> --- a/drivers/watchdog/xen_wdt.c > >> +++ b/drivers/watchdog/xen_wdt.c > >> @@ -297,11 +297,19 @@ static void xen_wdt_shutdown(struct plat > >> > >> static int xen_wdt_suspend(struct platform_device *dev, pm_message_t state) > >> { > >> - return xen_wdt_stop(); > >> + typeof(wdt.id) id = wdt.id; > > > > typeof here is a bit odd. > > But I want to match the original field''s type.So why not use that type?> >> + int rc = xen_wdt_stop(); > >> + > >> + wdt.id = id; > >> + > >> + return rc; > >> } > >> > >> static int xen_wdt_resume(struct platform_device *dev) > >> { > >> + if (!wdt.id) > >> + return 0; > > > > Can''t you check is_active instead and avoid having to play tricks in > > xen_wdt_suspend to preserve a non-0 wdt.id when the watchdog is active? > > I first thought of this too, but is_active doesn''t represent whether > a watchdog is actually engaged - it merely says whether the > watchdog device is currently open (but watchdog setup itself > may have failed). >Could track whether the w/dog is actually engaged in another variable? Ian.