thr3ads.net - Xen users - Clock problems running RHEL6.3 PV guest [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Mark Thebridge

2013-Feb-11 14:45 UTC

Clock problems running RHEL6.3 PV guest

Hi,

I have a reasonably time-critical, networking application that I''m
trying to get running in a Xen PV guest.   Unfortunately, I''m
experiencing intermittent lockups that seem to be down to poor timekeeping in
the domU.

The application runs on Red Hat Enterprise Linux 6.3, and so I''m using
that as the domU.  For dom0 I''ve tried both CentOS 5.9 with Xen 3.1,
and Fedora 18 with Xen 4.2.1.  Both have the same effect.

The problem manifests  as what seem to be lockups - a single vCPU appears to
hang.   My application has internal monitoring threads that try to determine if
any part of the application has hung and these are erroneously triggering
constantly.  If I have a shell open to the guest, then sometimes it becomes
unresponsive for a second or two.  And very occasionally (maybe 3 or 4 times a
day?) the kernel reports soft lockups of around 25 seconds, always with the
following stack:

<IRQ>  [<ffffffff810d8392>] ? watchdog_timer_fn+0x1c2/0x1d0
[<ffffffff810951be>] ? __run_hrtimer+0x8e/0x1a0 [<ffffffff81007c09>]
? xen_clocksource_get_cycles+0x9/0x10
[<ffffffff81095566>] ? hrtimer_interrupt+0xe6/0x250
[<ffffffff8109570f>] ? __hrtimer_peek_ahead_timers+0x3f/0x50
[<ffffffff81095744>] ? hrtimer_peek_ahead_timers+0x24/0x40
[<ffffffff8109579b>] ? run_hrtimer_softirq+0x3b/0x40
[<ffffffff810729cb>] ? __do_softirq+0xbb/0x1f0 [<ffffffff8100c1cc>]
? call_softirq+0x1c/0x30 <EOI>  [<ffffffff8100de05>] ?
do_softirq+0x65/0xa0 [<ffffffff81072530>] ? ksoftirqd+0x80/0x110
[<ffffffff810724b0>] ? ksoftirqd+0x0/0x110 [<ffffffff810906d6>] ?
kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
[<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

I also get regular "clocksource tsc unstable" messages in domU.    If
I turn on ntpd in the domU then the clock moves fast enough that NTP
can''t compensate.
Note that the time in dom0 seems fine, and I''ve seen no issues there.

Has anyone seen anything similar?   I know this application *can* run
virtualized  - I have run it many times on VMware servers with no problem, and
the application has been stable in Amazon EC2 as well - so I feel it must be
something odd about my hypervisor/dom0 setup.  But I''m stumped if I can
work out what''s wrong.

Other potentially useful information:
-- Underlying physical CPUs are fairly standard 64-bit Intels -  Xeon E5645.
-- There are no other domUs running, and the dom0 is doing nothing unusual.
-- Other things I''ve tried, none of which seem to make any difference:
  -- Switching from xen to tsc as the clocksource in the guest (the only two
available)
  -- Change the hypervisor command line to set clocksource=pit rather than HPET
  -- Boot the domU with a single vCPU
  -- Pinning or not pinning the vCPUs to fixed physical CPUs, both in dom0 and
domU.

Thanks,
Mark

Xen users - Feb 2013 - Clock problems running RHEL6.3 PV guest

Clock problems running RHEL6.3 PV guest