Jan Beulich
2007-May-30 15:20 UTC
RE: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
>I''ve been seeing these pretty regularly on a single-socket dual-core Athlon >system for the last couple of months, and only on Friday finally found time >to start looking into these. Besides the messages above, I also see hangs >in about every other boot attempt but only if I do *not* use serial output >(which makes debugging a little harder), and never once initial boot finished >- this is why I finally needed to find time to look into the problem. I shall >note though that the kernel we use does not disable CONFIG_GENERIC_TIME >and makes use of a Xen clocksource as posted by Jeremy among the >paravirt ops patches. >What happens when the hang occurs (in do_nanosleep context) is that the >time read/interpolated from the Xen provided values is in the past compared >to the last value read (and cached inside the kernel), resulting in a huge >timeout value rather than the intended 50ms one. >Without having collected data proving this (will do later today), I currently >think that the interpolation parameters are too imprecise until the first time >local_time_calibration() runs on each CPU, i.e. during little less than the first >second of dom0''s life).The box I''m looking at takes 600ms to enable ACPI mode, during which time no interrupts get delivered. Since it is not having a (visible) HPET, it has to use the PIT, the 16-bit counter of which manages to roll over 11 times during this process. The result is that the TSC is considered running too fast and hence getting slowed down. Since this slow-down doesn''t happen at exactly the time (it can''t be expected to), one CPU starts reporting measurably smaller nano-second time values than the other, hence monotonicity gets violated pretty significantly. I''m therefore considering: - making the PIT timer recover from being disabled for periods longer than what the 16-bit counter can tolerate (by means of estimating the number of roll-overs based on the TSC) - this would probably work well close after boot or at any time all TSCs are sufficiently synchronized, but could go pretty wrong as the individual TSCs drift apart - inventing a method in the kernel that can cover even significantly non- monotonic values interpolated on different CPUs (it is clear from the data collected that small deviations from monotonic values must be accounted for in any case, but that could be done by simply returning the most recently returned value in case it turns out that the interpolated value is smaller than that, so the issue is really how to reasonably bridge large gaps) Suggestions/opinions? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-May-30 15:39 UTC
Re: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
On 30/5/07 16:20, "Jan Beulich" <jbeulich@novell.com> wrote:> The box I''m looking at takes 600ms to enable ACPI mode!!! What does this entail? Is it some blob of AML? :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-May-30 16:02 UTC
Re: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
>>> Keir Fraser <keir@xensource.com> 30.05.07 17:39 >>> >On 30/5/07 16:20, "Jan Beulich" <jbeulich@novell.com> wrote: > >> The box I''m looking at takes 600ms to enable ACPI mode > >!!! What does this entail? Is it some blob of AML? :-)No, this is simply the port write (in acpi_hw_set_mode(), case ACPI_SYS_MODE_ACPI), obviously followed by lengthy SMM execution. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-May-30 16:25 UTC
Re: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
On 30/5/07 17:02, "Jan Beulich" <jbeulich@novell.com> wrote:> No, this is simply the port write (in acpi_hw_set_mode(), case > ACPI_SYS_MODE_ACPI), > obviously followed by lengthy SMM execution.Ah, okay. How about if we add support for ACPI PM timer as platform clock source? Should be easy, any ACPI system will most likely provide it, and it should be much better to use than the PIT (even a 24-bit PM timer should only wrap every 5 seconds). We''ll have to prevent dom0 monkeying with the timer: either disallow accesses, or emulate the register. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-May-30 17:31 UTC
Re: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
On 30/5/07 17:25, "Keir Fraser" <keir@xensource.com> wrote:> How about if we add support for ACPI PM timer as platform clock source? > Should be easy, any ACPI system will most likely provide it, and it should > be much better to use than the PIT (even a 24-bit PM timer should only wrap > every 5 seconds). We''ll have to prevent dom0 monkeying with the timer: > either disallow accesses, or emulate the register.I''ve now done this in c/s 15189:2d7d33ac982. Works okay on my test box and is now available from the staging tree. I didn''t implement anything to stop dom0 clobbering the timer register, as we don''t currently do so for cyclone or hpet time sources either. We can add it if it turns out to be a problem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Apparently Analagous Threads
- [Bug 61947] New: nullpointer dereference causes xorg-server segfault when nouveau DRI driver is loaded
- Bad FADT and timer going backwards
- [PATCH v2 10/11] vmware: set cpu capabilities during platform initialization
- [PATCH][Retry 1] 1/4: cpufreq/PowerNow! in Xen: Xen timer changes
- Timer ISR: Timer went backwards [NetBSD 3.1 / Xen 2.0.7]