Andrew Cooper
2013-Aug-09 21:17 UTC
[RFC] x86/watchdog: Always disable watchdog before console_force_unlock()
Depending on the state of the conring and serial_tx_buffer, console_force_unlock() can be a long running operation, usually because of serial_start_sync() XenServer testing has found a reliable case where console_force_unlock() on one PCPU takes long enough for another PCPU to timeout due to the watchdog (such as waiting for a tlb flush callin). The watchdog timeout causes the second PCPU to repeat the console_force_unlock(), at which point the first PCPU typically fails an assertion in spin_unlock_irqrestore(&port->tx_lock) (because the tx_lock has been unlocked behind itself). console_force_unlock() is only on emergency paths, so one way or another the host is going down. Disable the watchdog before forcing the console lock to help prevent having pcpus completing with each other to bring the host down. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> --- xen/arch/x86/cpu/mcheck/mce.c | 1 + xen/arch/x86/nmi.c | 1 + xen/arch/x86/traps.c | 3 +++ 3 files changed, 5 insertions(+) diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c index 93d7ae1..4c679f3 100644 --- a/xen/arch/x86/cpu/mcheck/mce.c +++ b/xen/arch/x86/cpu/mcheck/mce.c @@ -1537,6 +1537,7 @@ static void mc_panic_dump(void) void mc_panic(char *s) { is_mc_panic = 1; + watchdog_disable(); console_force_unlock(); printk("Fatal machine check: %s\n", s); diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c index c93812f..091e520 100644 --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -439,6 +439,7 @@ void nmi_watchdog_tick(struct cpu_user_regs * regs) this_cpu(alert_counter)++; if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz ) { + watchdog_disable(); console_force_unlock(); printk("Watchdog timer detects that CPU%d is stuck!\n", smp_processor_id()); diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 57dbd0c..b12869e 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3163,6 +3163,7 @@ static void pci_serr_error(struct cpu_user_regs *regs) raise_softirq(PCI_SERR_SOFTIRQ); break; default: /* ''fatal'' */ + watchdog_disable(); console_force_unlock(); printk("\n\nNMI - PCI system error (SERR)\n"); fatal_trap(TRAP_nmi, regs); @@ -3178,6 +3179,7 @@ static void io_check_error(struct cpu_user_regs *regs) case ''i'': /* ''ignore'' */ break; default: /* ''fatal'' */ + watchdog_disable(); console_force_unlock(); printk("\n\nNMI - I/O ERROR\n"); fatal_trap(TRAP_nmi, regs); @@ -3197,6 +3199,7 @@ static void unknown_nmi_error(struct cpu_user_regs *regs, unsigned char reason) case ''i'': /* ''ignore'' */ break; default: /* ''fatal'' */ + watchdog_disable(); console_force_unlock(); printk("Uhhuh. NMI received for unknown reason %02x.\n", reason); printk("Do you have a strange power saving mode enabled?\n"); -- 1.7.10.4
Jan Beulich
2013-Aug-12 08:50 UTC
Re: [RFC] x86/watchdog: Always disable watchdog before console_force_unlock()
>>> On 09.08.13 at 23:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Depending on the state of the conring and serial_tx_buffer, > console_force_unlock() can be a long running operation, usually because of > serial_start_sync() > > XenServer testing has found a reliable case where console_force_unlock() on > one PCPU takes long enough for another PCPU to timeout due to the watchdog > (such as waiting for a tlb flush callin). > > The watchdog timeout causes the second PCPU to repeat the > console_force_unlock(), at which point the first PCPU typically fails an > assertion in spin_unlock_irqrestore(&port->tx_lock) (because the tx_lock has > been unlocked behind itself). > > console_force_unlock() is only on emergency paths, so one way or another the > host is going down. Disable the watchdog before forcing the console lock to > help prevent having pcpus completing with each other to bring the host down.So perhaps rather than calling watchdog_disable() before calling console_force_unlock(), would we not better call the former first thing from the latter? Jan
Andrew Cooper
2013-Aug-12 09:35 UTC
Re: [RFC] x86/watchdog: Always disable watchdog before console_force_unlock()
On 12/08/13 09:50, Jan Beulich wrote:>>>> On 09.08.13 at 23:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> Depending on the state of the conring and serial_tx_buffer, >> console_force_unlock() can be a long running operation, usually because of >> serial_start_sync() >> >> XenServer testing has found a reliable case where console_force_unlock() on >> one PCPU takes long enough for another PCPU to timeout due to the watchdog >> (such as waiting for a tlb flush callin). >> >> The watchdog timeout causes the second PCPU to repeat the >> console_force_unlock(), at which point the first PCPU typically fails an >> assertion in spin_unlock_irqrestore(&port->tx_lock) (because the tx_lock has >> been unlocked behind itself). >> >> console_force_unlock() is only on emergency paths, so one way or another the >> host is going down. Disable the watchdog before forcing the console lock to >> help prevent having pcpus completing with each other to bring the host down. > So perhaps rather than calling watchdog_disable() before calling > console_force_unlock(), would we not better call the former first > thing from the latter? > > Jan >That was indeed my first attempt, but console_force_unlock() is common while watchdog_* is x86. I could convert the watchdog to arch specific. I suppose it is possible/likely that the Arm folk might want to implement and use watchdogs ? ~Andrew
Jan Beulich
2013-Aug-12 09:43 UTC
Re: [RFC] x86/watchdog: Always disable watchdog before console_force_unlock()
>>> On 12.08.13 at 11:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 12/08/13 09:50, Jan Beulich wrote: >>>>> On 09.08.13 at 23:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> console_force_unlock() is only on emergency paths, so one way or another the >>> host is going down. Disable the watchdog before forcing the console lock to >>> help prevent having pcpus completing with each other to bring the host down. >> So perhaps rather than calling watchdog_disable() before calling >> console_force_unlock(), would we not better call the former first >> thing from the latter? > > That was indeed my first attempt, but console_force_unlock() is common > while watchdog_* is x86. > > I could convert the watchdog to arch specific. I suppose it is > possible/likely that the Arm folk might want to implement and use > watchdogs ?So would I think. Just have ARM have an empty function for the moment. Jan
Keir Fraser
2013-Aug-12 11:31 UTC
Re: [RFC] x86/watchdog: Always disable watchdog before console_force_unlock()
On 12/08/2013 10:43, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 12.08.13 at 11:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 12/08/13 09:50, Jan Beulich wrote: >>>>>> On 09.08.13 at 23:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> console_force_unlock() is only on emergency paths, so one way or another >>>> the >>>> host is going down. Disable the watchdog before forcing the console lock >>>> to >>>> help prevent having pcpus completing with each other to bring the host >>>> down. >>> So perhaps rather than calling watchdog_disable() before calling >>> console_force_unlock(), would we not better call the former first >>> thing from the latter? >> >> That was indeed my first attempt, but console_force_unlock() is common >> while watchdog_* is x86. >> >> I could convert the watchdog to arch specific. I suppose it is >> possible/likely that the Arm folk might want to implement and use >> watchdogs ? > > So would I think. Just have ARM have an empty function for the > moment.Yes, watchdog is not an x86-specific concept so making some general-purpose watchdog functions common and stubbed out on some architectures makes total sense. -- Keir> Jan >