Tom W
2013-Apr-22 22:50 UTC
Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
Hello Xen Developers! After fully researching ourselves and talking to many Xen consultants, we have been advised to inquire here about a rare Xen bug we are possibly experiencing. Any help or advice would be much appreciated, thanks in advance! We''re also open to offering some financial support to solve this problem. *Here is a summary of the problem:* -very infrequently the domU clock is instantly jumping ahead a massive amount of time and then appearing to lock on the new time (i.e. time stops) -this has only happened 3 times since Jan. 2013 for us on two different physical rack mounted machines that are still running today with very similar parts and configuration -the clock jumped ahead to the year 2264 in the first two occurrences and only 3 days ahead in the third *Here are more specific details:* -when the incident occurred there was no heavy load, no high temperatures, no hardware/memory/EDAC errors, no swapping, no errors reported anywhere -the dom0 and other concurrently running domUs had no clock issues, the hardware BIOS clock remained OK as well -the clock did not slowly skew/drift ahead nor have we ever had any skewing/drifting clock problems, it appears to have simply jumped to the new date and stopped -the hardware has only had CentOS dom0s and domUs (PV) running for multiple years without incident, domUs have slowly been added with time -we have many additional nearly identical production servers with multiple domUs on each with very similar setups (same motherboard, CPU, RAM, OS, etc.) that have had no clock issues yet -the jumped domU clock can be corrected by running a "date -s" command with any value which then syncs the domU clock back up with the dom0 -we don''t use live migration, no saving/restoring and no maintenance was taking place anywhere near or during time of jumps -for all dom0s & domUs: independent_wallclock=0, ntpd is running, clocksource=jiffies, Xen version 3.1.2 -incident #1: ~Sun Jan 13 13:31:01 CST 2013 to Sun Mar 6 04:39:20 CST 2264 | dom0=Centos 5.8, Linux 2.6.18-308.20.1.el5xen | domU=Centos 5.5, Linux 2.6.18-194.8.1.el5xen -incident #2: ~Thu Mar 28 11:54:22 CDT 2013 to Thu May 19 07:32:28 CST 2264 | dom0=Centos 5.5, Linux 2.6.18-194.11.3.el5xen | domU=Centos 5.8, Linux 2.6.18-308.24.1.el5xen -incident #3: ~Sun Mar 31 10:42:14 CDT 2013 to Wed Apr 3 14:28:31 CDT 2013 | dom versions same as #2 -dom0 specs: TYAN S5397 w/ latest BIOS v1.07, guest count=6/3, DDR ECC RAM=48/64GB, 2 x Xeon E5420, LSI/Adaptec RAID, ~4 years old *We already do or have now done the following:* -full monitoring/logging for memory, disk, RAID, CPU, temperature, clock, log watch etc. (nothing bad to report) -enabled XEND and XENSTORED debugging (since last failure to provide more info for potential future jumps) -ran MemTest for hours under increased heat conditions and minor "stresstest" run, no errors reported, fsck passed as well -visual inspection of the hardware (no corrosion, matched CPUs, identical properly slotted RAM, etc.) -full dom0 & domU updates to CentOS 5.9, disabled ntpd on domU, kept domuU independent_wallclock=0 We have found no references to the same jump & stop clock issue on a domU given our circumstances. From other clock issue discussions, it appears that our root issue is probably with the jump itself and the clock stopping behavior is probably just the domU waiting for the dom0 time to catch up. We initially thought and were advised that bad hardware could be to blame but that may not be true given the exact same issue surfaced on very similar but separate hardware and by the fact that the dom0 and other resident domUs were totally unaffected clock wise. With all independent_wallclock=0 (i.e. dependent), we know NTP does not need to be running in the domU because it''s getting its clock from the dom0, but we run NTP anyway in the domU to aid in our monitoring of the domU clock and it should not matter because nothing on the domU can set the clock when independent_wallclock=0. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Apr-23 09:12 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
On Mon, 2013-04-22 at 23:50 +0100, Tom W wrote:> Hello Xen Developers! After fully researching ourselves and talking to > many Xen consultants, we have been advised to inquire here about a > rare Xen bug we are possibly experiencing. Any help or advice would be > much appreciated, thanks in advance! We''re also open to offering some > financial support to solve this problem.Does your hypervisor tree have this commit in it: commit 84628ee52a427b0f0fe50502eb8ffd0eedad0f03 Author: Jan Beulich <jbeulich@suse.com> Date: Mon Nov 26 17:20:39 2012 +0100 x86/time: fix scale_delta() inline assembly That was responsible for a rash of strange time jumps, although IIRC it affected the whole system and not individual VMs. It might be worth looking at the scale_delta function in your kernel, which I think you will find in arch/i386/kernel/time-xen.c. There was a fix made to this code in the upstream kernel which may be missing there: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de2d1a524e94a79078d9fe22c57c0c6009237547 I have no idea if this fix is relevant to the kernel (or compiler etc) you are using, but it looks interesting...> Here is a summary of the problem: > -very infrequently the domU clock is instantly jumping ahead a massive > amount of time and then appearing to lock on the new time (i.e. time > stops)Some kernels (I expect including yours) contain a "latch" so that time always appears monotonic, which means that if time glitches forwards and then back again it will appear to lock at the later time. Look for monotonic in arch/i386/kernel/time-xen.c for the code. If you were able to add some debugging to the kernel you should be able to observe this latching, in fact a single shot debug print when the latched time is way ahead of the current time would be a useful diagnostic tool IMHO.> -for all dom0s & domUs: independent_wallclock=0, ntpd is running, > clocksource=jiffies, Xen version 3.1.2I know there is a lot of suggestions to set clocksource=jiffies floating around on the Internet but I am far from convinced that it is a good idea. I won''t rule out it being a useful workaround for kernel+hypervisors of the vintage you are using, but I think it would be interesting to try without it. You''ve already noticed that independent_wallclock=0 and ntpd are inconsistent, so that''s good. Ian.
Jan Beulich
2013-Apr-23 09:29 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
>>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote: > -for all dom0s & domUs: independent_wallclock=0, ntpd is running, > clocksource=jiffies, Xen version 3.1.2I can only second Ian''s recommendation to drop this clocksourceoption. And unless this was a typo, you surely want to get off that really old hypervisor. Nobody''s going to help you with issues there, if they''re not reproducible on recent Xen. Even if it was meant to read 4.1.2, you should update to (or at least check against) 4.1.4 or 4.1.5-rc before claiming to have an unsolved problem. Jan
Ian Campbell
2013-Apr-23 09:33 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
On Tue, 2013-04-23 at 10:29 +0100, Jan Beulich wrote:> >>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote: > > -for all dom0s & domUs: independent_wallclock=0, ntpd is running, > > clocksource=jiffies, Xen version 3.1.2 > > I can only second Ian''s recommendation to drop this clocksource> option. > > And unless this was a typo, you surely want to get off that really > old hypervisor.FWIW I had assumed this was the RHEL5/CentOS5 supplied hypervisor (it''s the right era at least) and not a typo.> Nobody''s going to help you with issues there,If I''m right then it would be better to start by reporting a RHEL bug IMHO. Ian.
Tom W
2013-Apr-23 17:50 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
Thanks for the feedback Ian and Jan! It was not a typo, we are using RHEL5/CentOS5 which started in 2007 and is not fully EOL until 2020 but the production phase ends in 2017. We do understand your point though about getting on a much newer version but unfortunately we are a small operation and that type of change in the short term for our existing systems is very cost prohibitive. "jiffies" is the only clock source option in all our dom0s and domUs as per the following output so switching sources does not appear to be an option:>cat /sys/devices/system/clocksource/clocksource0/available_clocksource >jiffiesAre you thinking that''s potential sign of something off if the dom0 only has the one jiffies option? We''re using the default CentOS install and have no special boot settings related to timing or the clock. We had previously read Jan''s "fix scale_delta() inline assembly" thread but based on the discussion and all related threads but we didn''t think it really applied to our situation but perhaps it does. As well, our jumping appears to be much larger of a jump and way less frequent than others. We will figure out how to check if that change is included in our tree and get back to you on what we find. The latching clock behavior seems appropriate given the situation but it seems potentially odd that the clock can then be fixed by simply issuing a "date -s" command on the domU when independent_wallclock=0. Should it not stay latched on the future date? We shall also try the suggested RHEL bug submission path and see where that leads, thanks. If we''re stuck for the short/medium term on the latest Centos5 release with clocksource=jiffies, would switching our domU systems to independent_wallclock=1 and continuing to run ntpd have any better chance of bypassing the potential issue causing the jump or is it possible it could make things worse? On Tue, Apr 23, 2013 at 5:33 AM, Ian Campbell <Ian.Campbell@citrix.com>wrote:> On Tue, 2013-04-23 at 10:29 +0100, Jan Beulich wrote: > > >>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote: > > > -for all dom0s & domUs: independent_wallclock=0, ntpd is running, > > > clocksource=jiffies, Xen version 3.1.2 > > > > I can only second Ian''s recommendation to drop this clocksource> > option. > > > > And unless this was a typo, you surely want to get off that really > > old hypervisor. > > FWIW I had assumed this was the RHEL5/CentOS5 supplied hypervisor (it''s > the right era at least) and not a typo. > > > Nobody''s going to help you with issues there, > > If I''m right then it would be better to start by reporting a RHEL bug > IMHO. > > Ian. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Tom W
2013-Apr-24 01:26 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
> Does your hypervisor tree have this commit in it: > > commit 84628ee52a427b0f0fe50502eb8ffd0eedad0f03 > Author: Jan Beulich <jbeulich@suse.com> > Date: Mon Nov 26 17:20:39 2012 +0100 > > x86/time: fix scale_delta() inline assembly > > That was responsible for a rash of strange time jumps, although IIRC it > affected the whole system and not individual VMs. > > It might be worth looking at the scale_delta function in your kernel, > which I think you will find in arch/i386/kernel/time-xen.c. There was a > fix made to this code in the upstream kernel which may be missing there: > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de2d1a524e94a79078d9fe22c57c0c6009237547 > >After looking at the latest source for RHEL5, the scale_delta method does not have this change, nor does it have the fix Jan described here: http://markmail.org/message/cngzubj6b6vdo55a The latest RHEL6 does however have the change you described above. Would changing independent_wallclock=1 bypass the need for the domU system to call this potentially bad scale_delta method in RHEL5? Thanks Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Apr-24 09:00 UTC
Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
On Wed, 2013-04-24 at 02:26 +0100, Tom W wrote:> Would changing independent_wallclock=1 bypass the need for the domU > system to call this potentially bad scale_delta method in RHEL5?In principal the system time (which uses scale_delta) and the wallclock time (which independent_wallclock controls) are separate things, however your use of clocksource=jiffies and such an old kernel makes me unsure if they are intertwined or not in your environment. The best way to know for sure would be to rebuild your kernel with some debugging added and test that. Ian.