Rik van Riel
2008-Aug-06 20:46 UTC
[Xen-devel] [BUG 1282] time jump on live migrate root cause & proposed fixes
Hi, I have done some debugging to find out the root cause of bug 1282, which has the following symptoms with paravirtualized guests: - after a live migrate, the time on the guest can jump - after a live migrate, the guest "forgets" to wake up processes - after a domU save, dom0 reboot and domU restore, the time is correct but processes are not woken up from sys_nanosleep The problem seems to stem from the fact that domU uses the hypervisor''s system_time, which is the time since hypervisor system bootup in nanoseconds, as its base for timekeeping. This works fine as long as the guest stays on the same hypervisor, but if the guest is migrated to a hypervisor with a different uptime, problems ensue. Specifically, if the guest is migrated to a host with a lower uptime, processes that call sys_nanosleep() will not be woken up until the new host''s uptime catches up with the uptime of the old host! While waiting for the uptime to catch up, gettimeofday always returns the same value. Conversely, if a guest migrates from a host with a lower uptime to a host with a higher uptime, the system time in the guest advances by the difference between the two uptimes. I can think of a few possible fixes for this issue: 1) have system_time in the hypervisor start at unix epoch 0 (january 1st 1970) instead of at boot time - this may require some magic to sync_cmos_clock(), sync_xen_wallclock() and/or other functions so dom0 does not get too confused while changing the time during bootup 2) have time_init() and time_resume() calculate the hypervisor boot time from the shared_info ->wc_sec ->wc_nsec and the shared_info->per cpu vcpu_info->system_time -- if the host boot time changes (by more than a second?) adjust some local offset that we add into get_nsec_offset() and get_usec_offset() to always adjust the time right 3) get_time_values_from_xen() and __update_wallclock() can keep track of such an offset by themselves Does anybody have comments on the ideas above, or maybe even better ideas on how to fix the problem? :) -- All Rights Reversed _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Aug-06 21:45 UTC
RE: [Xen-devel] [BUG 1282] time jump on live migrate root cause & proposed fixes
An idea I''ve been considering to handle live migration between "good tsc" and "bad tsc" machines may work nicely here too: 1) Implement softtsc* on PV domains (per-domain rather than global) 2) In live migrate startup, force-switch domain to scaled softtsc 3) When control transfers to new machine, rescale softtsc to tsc rate on target hypervisor 4) In live migrate cleanup, switch domain off of softtsc (or not, if Xen determines the tsc''s on target are untrustworthy) I''d also like to see softtsc controllable by a sysfs. Dan * softtsc means trap all tsc reads/writes and emulate them in the hypervisor. This works (in 4.0) for hvm domains and tsc reads on my machine go from ~80ns to ~3us... not great but not horrible.> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of > Rik van Riel > Sent: Wednesday, August 06, 2008 2:47 PM > To: xen-devel@lists.xensource.com > Subject: [Xen-devel] [BUG 1282] time jump on live migrate root cause & > proposed fixes > > > Hi, > > I have done some debugging to find out the root cause of bug > 1282, which > has the following symptoms with paravirtualized guests: > - after a live migrate, the time on the guest can jump > - after a live migrate, the guest "forgets" to wake up processes > - after a domU save, dom0 reboot and domU restore, the time is > correct but processes are not woken up from sys_nanosleep > > The problem seems to stem from the fact that domU uses the > hypervisor''s > system_time, which is the time since hypervisor system bootup in > nanoseconds, as its base for timekeeping. > > This works fine as long as the guest stays on the same hypervisor, > but if the guest is migrated to a hypervisor with a different uptime, > problems ensue. Specifically, if the guest is migrated to a host > with a lower uptime, processes that call sys_nanosleep() will not > be woken up until the new host''s uptime catches up with the uptime > of the old host! While waiting for the uptime to catch up, > gettimeofday always returns the same value. > > Conversely, if a guest migrates from a host with a lower uptime to > a host with a higher uptime, the system time in the guest advances > by the difference between the two uptimes. > > > I can think of a few possible fixes for this issue: > > 1) have system_time in the hypervisor start at unix epoch 0 > (january 1st 1970) instead of at boot time - this may > require some magic to sync_cmos_clock(), sync_xen_wallclock() > and/or other functions so dom0 does not get too confused while > changing the time during bootup > > 2) have time_init() and time_resume() calculate the hypervisor > boot time from the shared_info ->wc_sec ->wc_nsec and the > shared_info->per cpu vcpu_info->system_time -- if the host > boot time changes (by more than a second?) adjust some local > offset that we add into get_nsec_offset() and get_usec_offset() > to always adjust the time right > > 3) get_time_values_from_xen() and __update_wallclock() can keep > track of such an offset by themselves > > > Does anybody have comments on the ideas above, or maybe even > better ideas on how to fix the problem? :) > > -- > All Rights Reversed > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2008-Aug-07 02:00 UTC
[Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On Wed, 6 Aug 2008 16:46:57 -0400 Rik van Riel <riel@redhat.com> wrote:> I have done some debugging to find out the root cause of bug 1282, which > has the following symptoms with paravirtualized guests: > - after a live migrate, the time on the guest can jump > - after a live migrate, the guest "forgets" to wake up processes > - after a domU save, dom0 reboot and domU restore, the time is > correct but processes are not woken up from sys_nanosleep > > The problem seems to stem from the fact that domU uses the hypervisor''s > system_time, which is the time since hypervisor system bootup in > nanoseconds, as its base for timekeeping.I''ve been reading the code some more, and it appears to be even stranger than I imagined :( Setting the time in dom0, through do_settimeofday() or sync_xen_wallclock() ends up calling a hypervisor function do_settime(sec, nsec, system_timestamp), which ends up subtracting system_timestamp (HV uptime in nsecs) from the given time, setting the variables wc_sec and wc_nsec in arch/x86/time.c to (now - HV uptime). This effectively means that a settimeofday in dom0 will redefine the time at which the hyperviser booted up. It also means that time_resume() would theoretically do the right thing, if run on cpu0, which I assume it does. It sets the local system''s system_timestamp (HV uptime in nsecs) as well as shadow.tv_sec and shadow.tv_nsec, which reflect the hypervisor''s boot time. This really makes me wonder why the guest is getting its clock messed up by the difference of system uptimes when live migrating from one system to another, between two hosts that are NTP synced. The reason is that wc_sec + wc_nsec + system_timestamp should always be the same across multiple systems, since this equals system boot time + uptime. Does anybody know why a save/restore or a live migrate would mess things up? -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Aug-07 02:01 UTC
RE: [Xen-devel] [BUG 1282] time jump on live migrate root cause &proposed fixes
>From: Rik van Riel >Sent: 2008年8月7日 4:47 > > >I can think of a few possible fixes for this issue: > >1) have system_time in the hypervisor start at unix epoch 0 > (january 1st 1970) instead of at boot time - this may > require some magic to sync_cmos_clock(), sync_xen_wallclock() > and/or other functions so dom0 does not get too confused while > changing the time during bootupThis looks good, with only concern on compability. Would it cause messed time in all old domUs even for normal running, when fixing the issue related to special case like migration?> >2) have time_init() and time_resume() calculate the hypervisor > boot time from the shared_info ->wc_sec ->wc_nsec and the > shared_info->per cpu vcpu_info->system_time -- if the host > boot time changes (by more than a second?) adjust some local > offset that we add into get_nsec_offset() and get_usec_offset() > to always adjust the time rightThere may be issue. Although xen implementes system_time as elapsed cycles since boot, wc_sec is not a static value stamped at boot time. For any reason when xen is requested to change wall clock (do_settime), xen adjusts wc_sec while keeping system_time intact. (wc_sec + system_time = wall clock). That says, even on same machine shared_info->wc_xxx is variable. So it''d be difficult for domU to calculate boot time directly from shared info page. The key point is that Xen simply aligns all domU''s system time to xen''s system time, regardless of when domU is created. Then one alternative is to introduce a per-domain system time concept like: - At time_init, reset processed_system_time to zero and record offset to xen''s system time. Some interfaces needs a bit changes to understand this offset. - Add a time_suspend, which saves wall clock time (xtime?) at that time - Then in time_resume, retrieve current wall clock time from shared info page, and then compared to the value recorded at suspend time. The delta is then reflected to processed_system_time, with offset adjusted according to new wc_xxx and xen system_time. One caution is about time zone difference, which however can be compensated by control panel to ensure comparison in time_resume is made meaningfullly. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Aug-07 02:22 UTC
RE: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
>From: Rik van Riel >Sent: 2008年8月7日 10:01 >This really makes me wonder why the guest is getting its >clock messed up by the difference of system uptimes when >live migrating from one system to another, between two >hosts that are NTP synced. > >The reason is that wc_sec + wc_nsec + system_timestamp >should always be the same across multiple systems, since >this equals system boot time + uptime. > > >Does anybody know why a save/restore or a live migrate >would mess things up?Forgot my previous mail just sent out. As you said, ideally existing implementation looks correct, except uptime is unexpectedly stamped into some other components like timer related structures? :-( Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Aug-07 02:23 UTC
RE: [Xen-devel] [BUG 1282] time jump on live migrate root cause&proposed fixes
Please forget this one, since original logic looks correct. Thanks, Kevin>From: Tian, Kevin >Sent: 2008年8月7日 10:01 >>From: Rik van Riel >>Sent: 2008年8月7日 4:47 >> >> >>I can think of a few possible fixes for this issue: >> >>1) have system_time in the hypervisor start at unix epoch 0 >> (january 1st 1970) instead of at boot time - this may >> require some magic to sync_cmos_clock(), sync_xen_wallclock() >> and/or other functions so dom0 does not get too confused while >> changing the time during bootup > >This looks good, with only concern on compability. Would it >cause messed time in all old domUs even for normal running, >when fixing the issue related to special case like migration? > >> >>2) have time_init() and time_resume() calculate the hypervisor >> boot time from the shared_info ->wc_sec ->wc_nsec and the >> shared_info->per cpu vcpu_info->system_time -- if the host >> boot time changes (by more than a second?) adjust some local >> offset that we add into get_nsec_offset() and get_usec_offset() >> to always adjust the time right > >There may be issue. Although xen implementes system_time as >elapsed cycles since boot, wc_sec is not a static value stamped >at boot time. For any reason when xen is requested to change >wall clock (do_settime), xen adjusts wc_sec while keeping >system_time intact. (wc_sec + system_time = wall clock). That >says, even on same machine shared_info->wc_xxx is variable. >So it''d be difficult for domU to calculate boot time directly from >shared info page. > >The key point is that Xen simply aligns all domU''s system time >to xen''s system time, regardless of when domU is created. Then >one alternative is to introduce a per-domain system time concept >like: > >- At time_init, reset processed_system_time to zero and record >offset to xen''s system time. Some interfaces needs a bit changes >to understand this offset. > >- Add a time_suspend, which saves wall clock time (xtime?) at >that time > >- Then in time_resume, retrieve current wall clock time from shared >info page, and then compared to the value recorded at suspend >time. The delta is then reflected to processed_system_time, with >offset adjusted according to new wc_xxx and xen system_time. > >One caution is about time zone difference, which however can be >compensated by control panel to ensure comparison in >time_resume is made meaningfullly. > >Thanks, >Kevin > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Aug-07 06:57 UTC
Re: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On 7/8/08 03:00, "Rik van Riel" <riel@redhat.com> wrote:> The reason is that wc_sec + wc_nsec + system_timestamp > should always be the same across multiple systems, since > this equals system boot time + uptime. > > > Does anybody know why a save/restore or a live migrate > would mess things up?I''ll have a poke around the code today. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Aug-07 10:25 UTC
Re: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On 7/8/08 07:57, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> On 7/8/08 03:00, "Rik van Riel" <riel@redhat.com> wrote: > >> The reason is that wc_sec + wc_nsec + system_timestamp >> should always be the same across multiple systems, since >> this equals system boot time + uptime. >> >> >> Does anybody know why a save/restore or a live migrate >> would mess things up? > > I''ll have a poke around the code today.It does look like the code should work. What kernel specifically have you reproduced this behaviour with? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2008-Aug-07 13:06 UTC
Re: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On Thu, 07 Aug 2008 11:25:34 +0100 Keir Fraser <keir.fraser@eu.citrix.com> wrote:> It does look like the code should work.Yeah, it looks like it should. Unfortunately it doesn''t :)> What kernel specifically have you reproduced this behaviour with?The current RHEL 5 kernel (2.6.18-92.1.6.el5xen). I have instrumented time_resume and am seeing something very strange. I have two systemtap hooks, one when entering the function and one when returning from the function. This output contains the trace info from two live migrates. As you can see, time_resume() manages to get a new system_timestamp from each host just fine, through get_time_values_from_xen(). However, the data that update_wallclock() retrieves from the HYPERVISOR_shared_info does not change. Even wc_version stays the same across both host systems! # ./time_resume.stap entering time_resume, secs = 1217875833, nsecs = 869795597 system_time = 237323812007279, shadow_tv_version = 350 leaving time_resume, secs = 1217875833, nsecs = 869795597 system_time = 604085945766733, shadow_tv_version = 350 entering time_resume, secs = 1217875833, nsecs = 869795597 system_time = 604262961766733, shadow_tv_version = 350 leaving time_resume, secs = 1217875833, nsecs = 869795597 system_time = 237501337935436, shadow_tv_version = 350 -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Aug-07 13:44 UTC
Re: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On 7/8/08 14:06, "Rik van Riel" <riel@redhat.com> wrote:> However, the data that update_wallclock() retrieves > from the HYPERVISOR_shared_info does not change. Even > wc_version stays the same across both host systems!Ah, it''s an issue in your libxenguest (or equivalent) most likely. See xen-unstable.hg:15706. It''s been fixed since 3.1.1. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2008-Aug-07 15:38 UTC
Re: [Xen-devel] Re: [BUG 1282] time jump on live migrate root cause & proposed fixes
On Thu, 07 Aug 2008 14:44:57 +0100 Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 7/8/08 14:06, "Rik van Riel" <riel@redhat.com> wrote: > > > However, the data that update_wallclock() retrieves > > from the HYPERVISOR_shared_info does not change. Even > > wc_version stays the same across both host systems! > > Ah, it''s an issue in your libxenguest (or equivalent) most likely. See > xen-unstable.hg:15706. It''s been fixed since 3.1.1.This did the trick! Thank you. I guess I''ll close bug 1282 now :) -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel