Andrew Cooper
2013-May-23 20:32 UTC
[PATCH] x86/watchdog: Use real timestamps for watchdog timeout
Do not assume that we will only receive interrupts at a rate of nmi_hz. On a test system being debugged, I observed a PCI SERR being continuously asserted without the SERR bit being set. The result was Xen "exceeding" a 300 second timeout within 1 second. Change the nmi_watchdog_tick() timecounting to use timestamps rather than a rate calculated from nmi_hz (which itself has been seen to deviate on some systems due to Turbo/Pstates). Also, move the comment to a more appropriate place (as we would expect to enter the old if() block once a second anyway), and fix up two trailing whitespace issues. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> diff -r c6fb586f83a0 -r ebb0070be9fd xen/arch/x86/nmi.c --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -396,7 +396,7 @@ static struct notifier_block cpu_nmi_nfb }; static DEFINE_PER_CPU(unsigned int, last_irq_sums); -static DEFINE_PER_CPU(unsigned int, alert_counter); +static DEFINE_PER_CPU(s_time_t, last_irq_change); static atomic_t watchdog_disable_count = ATOMIC_INIT(1); @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r if ( (this_cpu(last_irq_sums) == sum) && !atomic_read(&watchdog_disable_count) ) { - /* - * Ayiee, looks like this CPU is stuck ... wait for the timeout - * before doing the oops ... - */ - this_cpu(alert_counter)++; - if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz ) + s_time_t last_change = this_cpu(last_irq_change); + + if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) ) { + /* Ayiee, looks like this CPU is stuck. */ + console_force_unlock(); printk("Watchdog timer detects that CPU%d is stuck!\n", smp_processor_id()); fatal_trap(TRAP_nmi, regs); } - } - else + } + else { this_cpu(last_irq_sums) = sum; - this_cpu(alert_counter) = 0; + this_cpu(last_irq_change) = NOW(); } if ( nmi_perfctr_msr )
Jan Beulich
2013-May-24 07:09 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
>>> On 23.05.13 at 22:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Do not assume that we will only receive interrupts at a rate of nmi_hz. On a > test system being debugged, I observed a PCI SERR being continuously > asserted > without the SERR bit being set. The result was Xen "exceeding" a 300 second > timeout within 1 second.This is a questionable rationale for the patch, no matter that conceptually I don''t mind a change like this. On broken systems like this it may be more reasonable to require the watchdog (which is disabled by default anyway) to not get enabled.> @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r > if ( (this_cpu(last_irq_sums) == sum) && > !atomic_read(&watchdog_disable_count) ) > { > - /* > - * Ayiee, looks like this CPU is stuck ... wait for the timeout > - * before doing the oops ... > - */ > - this_cpu(alert_counter)++; > - if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz ) > + s_time_t last_change = this_cpu(last_irq_change); > + > + if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) )You can''t use NOW() here - while the time updating code is safe against normal interrupts, it''s not atomic wrt NMIs. Since you don''t really need this calculation to be very precise, doing it using plain TSC values might be acceptable, with the one caveat that you''d need to check that doing this in the middle of a TSC update (by time_calibration_tsc_rendezvous()) is still safe and sufficiently precise. The only other (non-generic) alternative I see is to use the HPET main counter if 64-bit capable _and_ we ran it as 64-bit counter. But that would neither cover all machines nor be backportable (because of 32-bit''s constraints). Jan> { > + /* Ayiee, looks like this CPU is stuck. */ > + > console_force_unlock(); > printk("Watchdog timer detects that CPU%d is stuck!\n", > smp_processor_id()); > fatal_trap(TRAP_nmi, regs); > } > - } > - else > + } > + else > { > this_cpu(last_irq_sums) = sum; > - this_cpu(alert_counter) = 0; > + this_cpu(last_irq_change) = NOW(); > } > > if ( nmi_perfctr_msr )
Tim Deegan
2013-May-24 09:37 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote:> Do not assume that we will only receive interrupts at a rate of nmi_hz. On a > test system being debugged, I observed a PCI SERR being continuously asserted > without the SERR bit being set. The result was Xen "exceeding" a 300 second > timeout within 1 second.Sounds like the CPU is indeed stuck, and the watchdog has just optimized away the 5 minutes of back-to-back NMIs. :) Handling this case it nice, but I wonder whether this patch ought to detect and report ludicrous NMI rates rather than silently ignoring them. I guess that''s hard to do in an NMI handler, other than by adjusting the printk when we crash. Tim.
Andrew Cooper
2013-May-24 09:57 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 08:09, Jan Beulich wrote:>>>> On 23.05.13 at 22:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> Do not assume that we will only receive interrupts at a rate of nmi_hz. On a >> test system being debugged, I observed a PCI SERR being continuously >> asserted >> without the SERR bit being set. The result was Xen "exceeding" a 300 second >> timeout within 1 second. > This is a questionable rationale for the patch, no matter that > conceptually I don''t mind a change like this. On broken systems > like this it may be more reasonable to require the watchdog (which > is disabled by default anyway) to not get enabled.In this particular system there is already a need for a BIOS fix (for another issue), so adding a fix of the SERR bit to list is also doable. I suppose that I should say that the reason for the patch is not because *this* system entered that bad state, but because it is possible with a sufficiently high rate of NMIs in general.> >> @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r >> if ( (this_cpu(last_irq_sums) == sum) && >> !atomic_read(&watchdog_disable_count) ) >> { >> - /* >> - * Ayiee, looks like this CPU is stuck ... wait for the timeout >> - * before doing the oops ... >> - */ >> - this_cpu(alert_counter)++; >> - if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz ) >> + s_time_t last_change = this_cpu(last_irq_change); >> + >> + if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) ) > You can''t use NOW() here - while the time updating code is safe > against normal interrupts, it''s not atomic wrt NMIs.But NMIs are latched at the hardware level. If we get a nested NMI the Xen will be toast on the exit path anyway. We dont currently detect that yet because I have not had sufficient time to complete my reentrant issues patch series yet.> > Since you don''t really need this calculation to be very precise, > doing it using plain TSC values might be acceptable, with the one > caveat that you''d need to check that doing this in the middle of a > TSC update (by time_calibration_tsc_rendezvous()) is still safe > and sufficiently precise.I dont see how using raw TSC values differs from using NOW() when it comes to racing with time_calibration_tsc_rendezvous(). NOW() has an rdtsc in it. From a quick eyeball of the code, I cant spot any errors which would occur. time_calibration_tsc_rendezvous() already has the possibility of an NMI interrupting it which might skew the calculation. On the other hand, NOW() is read-only with respect to any system state The only issue might be the write_tsc() causing the tsc to step backwards, but that is already a potential problem using NOW() on either side of a tsc_rendezvous() anyway. ~Andrew> > The only other (non-generic) alternative I see is to use the HPET > main counter if 64-bit capable _and_ we ran it as 64-bit counter. > But that would neither cover all machines nor be backportable > (because of 32-bit''s constraints). > > Jan > >> { >> + /* Ayiee, looks like this CPU is stuck. */ >> + >> console_force_unlock(); >> printk("Watchdog timer detects that CPU%d is stuck!\n", >> smp_processor_id()); >> fatal_trap(TRAP_nmi, regs); >> } >> - } >> - else >> + } >> + else >> { >> this_cpu(last_irq_sums) = sum; >> - this_cpu(alert_counter) = 0; >> + this_cpu(last_irq_change) = NOW(); >> } >> >> if ( nmi_perfctr_msr ) > >
Andrew Cooper
2013-May-24 10:03 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 10:37, Tim Deegan wrote:> At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote: >> Do not assume that we will only receive interrupts at a rate of nmi_hz. On a >> test system being debugged, I observed a PCI SERR being continuously asserted >> without the SERR bit being set. The result was Xen "exceeding" a 300 second >> timeout within 1 second. > Sounds like the CPU is indeed stuck, and the watchdog has just optimized > away the 5 minutes of back-to-back NMIs. :) > > Handling this case it nice, but I wonder whether this patch ought to > detect and report ludicrous NMI rates rather than silently ignoring > them. I guess that''s hard to do in an NMI handler, other than by > adjusting the printk when we crash. > > Tim.Actually I suspect the system was livelocked with PCI SERRs being issued from a PCIe switch. I only have second granularity on the serial console, but can confirm that cpu0 was perfectly alive and well within the same second as the watchdog supposedly expiring. I was considering trying to work around a ludicrous rate of interrupts, but decided to go for the easier patch first ~Andrew
Tim Deegan
2013-May-24 10:13 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote:> On 24/05/13 08:09, Jan Beulich wrote: > > You can''t use NOW() here - while the time updating code is safe > > against normal interrupts, it''s not atomic wrt NMIs. > > But NMIs are latched at the hardware level. If we get a nested NMI the > Xen will be toast on the exit path anyway.The problem is that an NMI can arrive while local_time_calibration() is writing its results, so calling NOW() in the NMI handler might return garbage.> > Handling this case it nice, but I wonder whether this patch ought to > > detect and report ludicrous NMI rates rather than silently ignoring > > them. I guess that''s hard to do in an NMI handler, other than by > > adjusting the printk when we crash.Actually on second thoughts it''s easier: as well as having this patch (or near equivalent) to avoid premature watchdog expiry, we cna detect the NMI rate in, say, the timer softirq and report if it''s gone mad. Cheers, Tim.
Andrew Cooper
2013-May-24 10:33 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 11:13, Tim Deegan wrote:> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: >> On 24/05/13 08:09, Jan Beulich wrote: >>> You can''t use NOW() here - while the time updating code is safe >>> against normal interrupts, it''s not atomic wrt NMIs. >> But NMIs are latched at the hardware level. If we get a nested NMI the >> Xen will be toast on the exit path anyway. > The problem is that an NMI can arrive while local_time_calibration() is > writing its results, so calling NOW() in the NMI handler might return > garbage.Aah - I see. Sorry - I misunderstood the original point. Yes - that is an issue. Two solutions come to mind. 1) Along with the local_irq_disable()/enable() pairs in local_time_calibration, having an atomic_t indicating "time data update in progress", allowing the NMI handler to decide to bail early. 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and a different atomic_t to indicate which one is consistent. This would allow the NMI handler to always use one consistent set of timing information.> >>> Handling this case it nice, but I wonder whether this patch ought to >>> detect and report ludicrous NMI rates rather than silently ignoring >>> them. I guess that''s hard to do in an NMI handler, other than by >>> adjusting the printk when we crash. > Actually on second thoughts it''s easier: as well as having this patch > (or near equivalent) to avoid premature watchdog expiry, we cna detect > the NMI rate in, say, the timer softirq and report if it''s gone mad. > > Cheers, > > Tim.I was thinking along that line, but had not yet worked out where to put it. That looks like the best place. ~Andrew
Jan Beulich
2013-May-24 11:36 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
>>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote: >> > Handling this case it nice, but I wonder whether this patch ought to >> > detect and report ludicrous NMI rates rather than silently ignoring >> > them. I guess that''s hard to do in an NMI handler, other than by >> > adjusting the printk when we crash. > > Actually on second thoughts it''s easier: as well as having this patch > (or near equivalent) to avoid premature watchdog expiry, we cna detect > the NMI rate in, say, the timer softirq and report if it''s gone mad.I may not get how you imagine this to work, but why would you assume the timer softirq gets any chance to run anymore with an absurd NMI rate (or at least get run enough times to prevent the watchdog to fire)? Jan
Jan Beulich
2013-May-24 11:42 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 24/05/13 11:13, Tim Deegan wrote: >> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: >>> On 24/05/13 08:09, Jan Beulich wrote: >>>> You can''t use NOW() here - while the time updating code is safe >>>> against normal interrupts, it''s not atomic wrt NMIs. >>> But NMIs are latched at the hardware level. If we get a nested NMI the >>> Xen will be toast on the exit path anyway. >> The problem is that an NMI can arrive while local_time_calibration() is >> writing its results, so calling NOW() in the NMI handler might return >> garbage. > > Aah - I see. Sorry - I misunderstood the original point. > > Yes - that is an issue. > > Two solutions come to mind. > > 1) Along with the local_irq_disable()/enable() pairs in > local_time_calibration, having an atomic_t indicating "time data update > in progress", allowing the NMI handler to decide to bail early. > > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and > a different atomic_t to indicate which one is consistent. This would > allow the NMI handler to always use one consistent set of timing > information.What''s the advantage of either over using the plain TSC values? Accuracy of the expiry, but the accuracy here doesn''t matter much (as long as its not off by an order of a magnitude), and would be achieved by some presumably not very neat code of no other (general) use. And if picking up one of these, then 2 seems the preferable one to me. Jan
Andrew Cooper
2013-May-24 12:00 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 12:42, Jan Beulich wrote:>>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 24/05/13 11:13, Tim Deegan wrote: >>> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: >>>> On 24/05/13 08:09, Jan Beulich wrote: >>>>> You can''t use NOW() here - while the time updating code is safe >>>>> against normal interrupts, it''s not atomic wrt NMIs. >>>> But NMIs are latched at the hardware level. If we get a nested NMI the >>>> Xen will be toast on the exit path anyway. >>> The problem is that an NMI can arrive while local_time_calibration() is >>> writing its results, so calling NOW() in the NMI handler might return >>> garbage. >> Aah - I see. Sorry - I misunderstood the original point. >> >> Yes - that is an issue. >> >> Two solutions come to mind. >> >> 1) Along with the local_irq_disable()/enable() pairs in >> local_time_calibration, having an atomic_t indicating "time data update >> in progress", allowing the NMI handler to decide to bail early. >> >> 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and >> a different atomic_t to indicate which one is consistent. This would >> allow the NMI handler to always use one consistent set of timing >> information. > What''s the advantage of either over using the plain TSC values? > Accuracy of the expiry, but the accuracy here doesn''t matter > much (as long as its not off by an order of a magnitude), and > would be achieved by some presumably not very neat code of no > other (general) use. > > And if picking up one of these, then 2 seems the preferable one to > me. > > Jan >The TSC scale is recalculated as part of local_time_calibration(). I couldn''t spot any other sensible way of converting a TSC delta to something approximating seconds. Have I missed something? ~Andrew
Tim Deegan
2013-May-24 12:41 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote:> >>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote: > >> > Handling this case it nice, but I wonder whether this patch ought to > >> > detect and report ludicrous NMI rates rather than silently ignoring > >> > them. I guess that''s hard to do in an NMI handler, other than by > >> > adjusting the printk when we crash. > > > > Actually on second thoughts it''s easier: as well as having this patch > > (or near equivalent) to avoid premature watchdog expiry, we cna detect > > the NMI rate in, say, the timer softirq and report if it''s gone mad. > > I may not get how you imagine this to work, but why would you > assume the timer softirq gets any chance to run anymore with an > absurd NMI rate (or at least get run enough times to prevent the > watchdog to fire)?In that case, the watchdog will fire, if we can add something to the watchdog printk if the NMI count is mugh higher than (timeout * nmi_hz). I was worrying about the case where there are lots of NMIs but not so many that the system livelocks entirely. It would be nice to have some non-fatal warning that something is wrong. A one-time (or at least very limited) printk in the softirq would do that for not much extra cost. At 12:42 +0100 on 24 May (1369399344), Jan Beulich wrote:> >>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 24/05/13 11:13, Tim Deegan wrote: > >> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: > >>> On 24/05/13 08:09, Jan Beulich wrote: > >>>> You can''t use NOW() here - while the time updating code is safe > >>>> against normal interrupts, it''s not atomic wrt NMIs. > >>> But NMIs are latched at the hardware level. If we get a nested NMI the > >>> Xen will be toast on the exit path anyway. > >> The problem is that an NMI can arrive while local_time_calibration() is > >> writing its results, so calling NOW() in the NMI handler might return > >> garbage. > > > > Aah - I see. Sorry - I misunderstood the original point. > > > > Yes - that is an issue. > > > > Two solutions come to mind. > > > > 1) Along with the local_irq_disable()/enable() pairs in > > local_time_calibration, having an atomic_t indicating "time data update > > in progress", allowing the NMI handler to decide to bail early. > > > > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and > > a different atomic_t to indicate which one is consistent. This would > > allow the NMI handler to always use one consistent set of timing > > information.Of those two, I prefer (1), just because it doesn''t add any cost to the normal users of NOW(). Using TSC to gate the actual watchdog crash might get a bit messy, especially if it ends up adding code to the users of write_tsc(). But all we need is to double-check the existing count -- I think it would be enough to say "the watchdog fired but it looks like it was caused by an NMI storm" and crash anyway. (At least assuming timeout * nmi_hz is a a largish number.) And nmi_watchdog_tick() can just check regs->eip as a hint not to trust the scale factors. :) Tim.
Andrew Cooper
2013-May-24 12:48 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 13:41, Tim Deegan wrote:> At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote: >>>>> On 24.05.13 at 12:13, Tim Deegan <tim@xen.org> wrote: >>>>> Handling this case it nice, but I wonder whether this patch ought to >>>>> detect and report ludicrous NMI rates rather than silently ignoring >>>>> them. I guess that''s hard to do in an NMI handler, other than by >>>>> adjusting the printk when we crash. >>> Actually on second thoughts it''s easier: as well as having this patch >>> (or near equivalent) to avoid premature watchdog expiry, we cna detect >>> the NMI rate in, say, the timer softirq and report if it''s gone mad. >> I may not get how you imagine this to work, but why would you >> assume the timer softirq gets any chance to run anymore with an >> absurd NMI rate (or at least get run enough times to prevent the >> watchdog to fire)? > In that case, the watchdog will fire, if we can add something to the > watchdog printk if the NMI count is mugh higher than (timeout * nmi_hz). > I was worrying about the case where there are lots of NMIs but not so > many that the system livelocks entirely. It would be nice to have some > non-fatal warning that something is wrong. A one-time (or at least very > limited) printk in the softirq would do that for not much extra cost. > > At 12:42 +0100 on 24 May (1369399344), Jan Beulich wrote: >>>>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 24/05/13 11:13, Tim Deegan wrote: >>>> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: >>>>> On 24/05/13 08:09, Jan Beulich wrote: >>>>>> You can''t use NOW() here - while the time updating code is safe >>>>>> against normal interrupts, it''s not atomic wrt NMIs. >>>>> But NMIs are latched at the hardware level. If we get a nested NMI the >>>>> Xen will be toast on the exit path anyway. >>>> The problem is that an NMI can arrive while local_time_calibration() is >>>> writing its results, so calling NOW() in the NMI handler might return >>>> garbage. >>> Aah - I see. Sorry - I misunderstood the original point. >>> >>> Yes - that is an issue. >>> >>> Two solutions come to mind. >>> >>> 1) Along with the local_irq_disable()/enable() pairs in >>> local_time_calibration, having an atomic_t indicating "time data update >>> in progress", allowing the NMI handler to decide to bail early. >>> >>> 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and >>> a different atomic_t to indicate which one is consistent. This would >>> allow the NMI handler to always use one consistent set of timing >>> information. > Of those two, I prefer (1), just because it doesn''t add any cost to the > normal users of NOW(). > > Using TSC to gate the actual watchdog crash might get a bit messy, > especially if it ends up adding code to the users of write_tsc(). But > all we need is to double-check the existing count -- I think it would be > enough to say "the watchdog fired but it looks like it was caused by an > NMI storm" and crash anyway. (At least assuming timeout * nmi_hz is a a > largish number.) And nmi_watchdog_tick() can just check regs->eip as a > hint not to trust the scale factors. :) > > Tim.I was not planning to make any requirement to change users of NOW(). The only consumers which need to care about the shadow set or "update in progress" flag would be the NMI and MCE handlers, being the only ones which can interrupt the current structure update. The shadow set can be neatly hidden away from anyone who doesn''t wish to know. ~Andrew
Jan Beulich
2013-May-24 13:11 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
>>> On 24.05.13 at 14:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > The TSC scale is recalculated as part of local_time_calibration(). I > couldn''t spot any other sensible way of converting a TSC delta to > something approximating seconds. Have I missed something?Since we''re not after a precise amount of time, using cpu_khz as basis and accepting that the period until the watchdog may fire could be, say, twice the nominal amount doesn''t seem that bad to me. Or, if we''re going the double checking route, retaining the current counting and simply using way too low a TSC delta to alter the printed message, largely like what Tim suggested, could be an option. In that case, even a factor 10 mis-scaling of the TSC count wouldn''t hurt. The only real problem would be when the TSC stops counting while in deep C states, but that again is a non-issue here because you can''t at the same time get NMI storms and be in deep C states permanently. Jan
Jan Beulich
2013-May-24 13:17 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
>>> On 24.05.13 at 14:41, Tim Deegan <tim@xen.org> wrote: > At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote: >> > 1) Along with the local_irq_disable()/enable() pairs in >> > local_time_calibration, having an atomic_t indicating "time data update >> > in progress", allowing the NMI handler to decide to bail early. >> > >> > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and >> > a different atomic_t to indicate which one is consistent. This would >> > allow the NMI handler to always use one consistent set of timing >> > information. > > Of those two, I prefer (1), just because it doesn''t add any cost to the > normal users of NOW().The reason I dislike 1 is because you then have however small a probability of many/all NMI instance just happening while the time gets updated, resulting in all of them bailing early, and the watchdog never firing.> Using TSC to gate the actual watchdog crash might get a bit messy, > especially if it ends up adding code to the users of write_tsc().The only problematic write_tsc() user is that to recover from a stopped counter in deep C states. This is not a meaningful problem because - as just said in another reply - NMI storms and long times spent in deep C states exclude each other.> And nmi_watchdog_tick() can just check regs->eip as a > hint not to trust the scale factors. :)By doing a range check looking for it to point into some function? How would that cope with LTO? Jan
Tim Deegan
2013-May-24 13:55 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote:> On 24/05/13 13:41, Tim Deegan wrote: > > Of those two, I prefer (1), just because it doesn''t add any cost to the > > normal users of NOW(). > > I was not planning to make any requirement to change users of NOW().Well, you were planning to make NOW() slightly more expensive by needing to look up which of the banekd alternatives is valid. In any case, I think some sort of approximate version based on tsc will do. Tim.
Tim Deegan
2013-May-24 14:01 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 14:17 +0100 on 24 May (1369405052), Jan Beulich wrote:> > Using TSC to gate the actual watchdog crash might get a bit messy, > > especially if it ends up adding code to the users of write_tsc(). > > The only problematic write_tsc() user is that to recover from a > stopped counter in deep C states. This is not a meaningful problem > because - as just said in another reply - NMI storms and long times > spent in deep C states exclude each other.Yep. But we could go into a C-state and be woken by an NMI storm, meaning that our ''before'' TSC value would be immediately wrong. The more I think about this the less I think we should be trying to measure elapsed time except to change the printout.> > And nmi_watchdog_tick() can just check regs->eip as a > > hint not to trust the scale factors. :) > > By doing a range check looking for it to point into some function? > How would that cope with LTO?Unreliably. :) I imagine in most cases it would be fine, since that function has its address taken, but unless we called out to an explicit .S routine to do the update it ouwld be at best a hint. Tim.
Andrew Cooper
2013-May-24 14:29 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 14:55, Tim Deegan wrote:> At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote: >> On 24/05/13 13:41, Tim Deegan wrote: >>> Of those two, I prefer (1), just because it doesn''t add any cost to the >>> normal users of NOW(). >> I was not planning to make any requirement to change users of NOW(). > Well, you were planning to make NOW() slightly more expensive by needing > to look up which of the banekd alternatives is valid. In any case, I > think some sort of approximate version based on tsc will do. > > Tim.I was planning to memcpy the shadow set over the main set as part of calibration, leaving no alteration whatsoever to NOW(). An approximation from the TSC alone would be better so long as it is a reasonable approximation. I am concerned about how accuate a dumb approximation would be for non-stable TSCs etc. ~Andrew
Tim Deegan
2013-May-24 17:10 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 15:29 +0100 on 24 May (1369409374), Andrew Cooper wrote:> On 24/05/13 14:55, Tim Deegan wrote: > > At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote: > >> On 24/05/13 13:41, Tim Deegan wrote: > >>> Of those two, I prefer (1), just because it doesn''t add any cost to the > >>> normal users of NOW(). > >> I was not planning to make any requirement to change users of NOW(). > > Well, you were planning to make NOW() slightly more expensive by needing > > to look up which of the banekd alternatives is valid. In any case, I > > think some sort of approximate version based on tsc will do. > > > > Tim. > > I was planning to memcpy the shadow set over the main set as part of > calibration, leaving no alteration whatsoever to NOW().Sorry, yes, I see how that works now. And so I too prefer (2). :)> An approximation from the TSC alone would be better so long as it is a > reasonable approximation. I am concerned about how accuate a dumb > approximation would be for non-stable TSCs etc.Yep. I''m more and more convinced that we should gate on the number of NMIs we''ve taken without seeing a timer tick. I''m more afratid of funny TSC edge cases (and remember we might take an NMI anywhere in the s3 wakeup) than I am of machines with really bad NMI storms. So even if the approximate time is wildly off we just print the wrong thing. In the case where you saw this (and cpu0 was alive for a while before it managed a burst of enough NMIs), would detecting and warning about high NMI rates be enough to point out what''s gone wrong? Tim.
Andrew Cooper
2013-May-24 17:27 UTC
Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
On 24/05/13 18:10, Tim Deegan wrote:> At 15:29 +0100 on 24 May (1369409374), Andrew Cooper wrote: >> On 24/05/13 14:55, Tim Deegan wrote: >>> At 13:48 +0100 on 24 May (1369403327), Andrew Cooper wrote: >>>> On 24/05/13 13:41, Tim Deegan wrote: >>>>> Of those two, I prefer (1), just because it doesn''t add any cost to the >>>>> normal users of NOW(). >>>> I was not planning to make any requirement to change users of NOW(). >>> Well, you were planning to make NOW() slightly more expensive by needing >>> to look up which of the banekd alternatives is valid. In any case, I >>> think some sort of approximate version based on tsc will do. >>> >>> Tim. >> I was planning to memcpy the shadow set over the main set as part of >> calibration, leaving no alteration whatsoever to NOW(). > Sorry, yes, I see how that works now. And so I too prefer (2). :) > >> An approximation from the TSC alone would be better so long as it is a >> reasonable approximation. I am concerned about how accuate a dumb >> approximation would be for non-stable TSCs etc. > Yep. I''m more and more convinced that we should gate on the number of > NMIs we''ve taken without seeing a timer tick. I''m more afratid of funny > TSC edge cases (and remember we might take an NMI anywhere in the s3 > wakeup) than I am of machines with really bad NMI storms. So even if > the approximate time is wildly off we just print the wrong thing. > > In the case where you saw this (and cpu0 was alive for a while before > it managed a burst of enough NMIs), would detecting and warning > about high NMI rates be enough to point out what''s gone wrong? > > Tim.Not directly, butb or my debugging case knowing when the NMI storm has started is very useful so I can dump lspci -vvvxxxx to get the debug state from whichever PCI device is the original cause of the SERR storm. I certainly don''t think there is anything useful Xen could automatically do when discovering an NMI storm, but leaving a message on the serial or in a crash state certainly helps someone trying to investigate why the server reset. It is certainly better than finding an NMI watchdog timeout with serial timestamps proving that the watchdog didn''t actually time out :), and starting debugging wondering WTF was going on. ~Andrew