1. Introduction This patch improves the hpet based guest clock in terms of drift and monotonicity. Prior to this work the drift with hpet was greater than 2%, far above the .05% limit for ntp to synchronize. With this code, the drift ranges from .001% to .0033% depending on guest and physical platform. Using hpet allows guest operating systems to provide monotonic time to their applications. Time sources other than hpet are not monotonic because of their reliance on tsc, which is not synchronized across physical processors. Windows 2k864 and many Linux guests are supported with two policies, one for guests that handle missed clock interrupts and the other for guests that require the correct number of interrupts. Guests may use hpet for the timing source even if the physical platform has no visible hpet. Migration is supported between physical machines which differ in physical hpet visibility. Most of the changes are in hpet.c. Two general facilities are added to track interrupt progress. The ideas here and the facilities would be useful in vpt.c, for other time sources, though no attempt is made here to improve vpt.c. The following sections discuss hpet dependencies, interrupt delivery policies, live migration, test results, and relation to recent work with monotonic time. 2. Virtual Hpet dependencies The virtual hpet depends on the ability to read the physical or simulated (see discussion below) hpet. For timekeeping, the virtual hpet also depends on two new interrupt notification facilities to implement its policies for interrupt delivery. 2.1. Two modes of low-level hpet main counter reads. In this implementation, the virtual hpet reads with read_64_main_counter(), exported by time.c, either the real physical hpet main counter register directly or a "simulated" hpet main counter. The simulated mode uses a monotonic version of get_s_time() (NOW()), where the last time value is returned whenever the current time value is less than the last time value. In simulated mode, since it is layered on s_time, the underlying hardware can be hpet or some other device. The frequency of the main counter in simulated mode is the same as the standard physical hpet frequency, allowing live migration between nodes that are configured differently. If the physical platform does not have an hpet device, or if xen is configured not to use the device, then the simulated method is used. If there is a physical hpet device, and xen has initialized it, then either simulated or physical mode can be used. This is governed by a boot time option, hpet-avoid. Setting this option to 1 gives the simulated mode and 0 the physical mode. The default is physical mode. A disadvantage of the physical mode is that may take longer to read the device than in simulated mode. On some platforms the cost is about the same (less than 250 nsec) for physical and simulated modes, while on others physical cost is much higher than simulated. A disadvantage of the simulated mode is that it can return the same value for the counter in consecutive calls. 2.2. Interrupt notification facilities. Two interrupt notification facilities are introduced, one is hvm_isa_irq_assert_cb() and the other hvm_register_intr_en_notif(). The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic. hvm_isa_irq_assert_cb allows a callback to be passed along to vioapic_deliver() and this callback is called with a mask of the vcpus which will get the interrupt. This callback is made before any vcpus receive an interrupt. Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular vector that will be called when that vector is injected in [vmx,svm]_intr_assist() and also when the guest finishes handling the interrupt. Here finished is defined as the point when the guest re-enables interrupts or lowers the tpr value. EOI is not used as the end of interrupt as this is sometimes returned before the interrupt handler has done its work. A flag is passed to the handler indicating whether this is the injection point (post = 1) or the interrupt finished (post = 0) point. The need for the finished point callback is discussed in the missed ticks policy section. To prevent a possible early trigger of the finished callback, intr_en_notif logic has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the second when interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully armed, re-enabling interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by [vmx,svm]_intr_assist(). 3. Interrupt delivery policies The existing hpet interrupt delivery is preserved. This includes vcpu round robin delivery used by Linux and broadcast delivery used by Windows. There are two policies for interrupt delivery, one for Windows 2k8-64 and the other for Linux. The Linux policy takes advantage of the (guest) Linux missed tick and offset calculations and does not attempt to deliver the right number of interrupts. The Windows policy delivers the correct number of interrupts, even if sometimes much closer to each other than the period. The policies are similar to those in vpt.c, though there are some important differences. Policies are selected with an HVMOP_set_param hypercall with index HVM_PARAM_TIMER_MODE. Two new values are added, HVM_HPET_guest_computes_missed_ticks and HVM_HPET_guest_does_not_compute_missed_ticks. The reason that two new ones are added is that in some guests (32bit Linux) a no-missed policy is needed for clock sources other than hpet and a missed ticks policy for hpet. It was felt that there would be less confusion by simply introducing the two hpet policies. 3.1. The missed ticks policy The Linux clock interrupt handler for hpet calculates missed ticks and offset using the hpet main counter. The algorithm works well when the time since the last interrupt is greater than or equal to a period and poorly otherwise. The missed ticks policy ensures that no two clock interrupts are delivered to the guest at a time interval less than a period. A time stamp (hpet main counter value) is recorded (by a callback registered with hvm_register_intr_en_notif) when Linux finishes handling the clock interrupt. Then, ensuing interrupts are delivered to the vioapic only if the current main counter value is a period greater than when the last interrupt was handled. Tests showed a significant improvement in clock drift with end of interrupt time stamps versus beginning of interrupt[1]. It is believed that the reason for the improvement is that the clock interrupt handler goes for a spinlock and can be therefore delayed in its processing. Furthermore, the main counter is read by the guest under the lock. The net effect is that if we time stamp injection, we can get the difference in time between successive interrupt handler lock acquisitions to be less than the period. 3.2. The no-missed ticks policy Windows 2k864 keeps very poor time with the missed ticks policy. So the no-missed ticks policy was developed. In the no-missed ticks policy we deliver the correct number of interrupts, even if they are spaced less than a period apart (when catching up). Windows 2k864 uses a broadcast mode in the interrupt routing such that all vcpus get the clock interrupt. The best Windows drift performance was achieved when the policy code ensured that all the previous interrupts (on the various vcpus) had been injected before injecting the next interrupt to the vioapic.. The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to record the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback registered with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in the pending_mask. When the pending_mask is clear it decrements hpet.intr_pending_nr and if intr_pending_nr is still non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb(). Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks(). The missed ticks policy intr_en_notif callback also uses the pending_mask method. So even though Linux does not broadcast its interrupts, the code could handle it if it did. In this case the end of interrupt time stamp is made when the pending_mask is clear. 4. Live Migration Live migration with hpet preserves the current offset of the guest clock with respect to ntp. This is accomplished by migrating all of the state in the h->hpet data structure in the usual way. The hp->mc_offset is recalculated on the receiving node so that the guest sees a continuous hpet main counter. Code as been added to xc_domain_save.c to send a small message after the domain context is sent. The contents of the message is the physical tsc timestamp, last_tsc, read just before the message is sent. When the last_tsc message is received in xc_domain_restore.c, another physical tsc timestamp, cur_tsc, is read. The two timestamps are loaded into the domain structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then xc_domain_hvm_setcontext is called so that hpet_load has access to these time stamps. Hpet_load uses the timestamps to account for the time spent saving and loading the domain context. With this technique, the only neglected time is the time spent sending a small network message. 5. Test Results Some recent test results are: 5.1 Linux 4u664 and Windows 2k864 load test. Duration: 70 hours. Test date: 6/2/08 Loads: usex -b48 on Linux; burn-in on Windows Guest vcpus: 8 for Linux; 2 for Windows Hardware: 8 physical cpu AMD Clock drift : Linux: .0012% Windows: .009% 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test Duration: 23 hours. Test date: 6/3/08 Loads: none Guest vcpus: 8 for each Linux; 2 for Windows Hardware: 4 physical cpu AMD Clock drift : Linux: .033% Windows: .019% 6. Relation to recent work in xen-unstable There is a similarity between hvm_get_guest_time() in xen-unstable and read_64_main_counter() in this code. However, read_64_main_counter() is more tuned to the needs of hpet.c. It has no "set" operation, only the get. It isolates the mode, physical or simulated, in read_64_main_counter() itself. It uses no vcpu or domain state as it is a physical entity, in either mode. And it provides a real physical mode for every read for those applications that desire this. 7. Conclusion The virtual hpet is improved by this patch in terms of accuracy and monotonicity. Tests performed to date verify this and more testing is under way. 8. Future Work Testing with Windows Vista will be performed soon. The reason for accuracy variations on different platforms using the physical hpet device will be investigated. Additional overhead measurements on simulated vs physical hpet mode will be made. Footnotes: 1. I don''t recall the accuracy improvement with end of interrupt stamping, but it was significant, perhaps better than two to one improvement. It would be a very simple matter to re-measure the improvement as the facility can call back at injection time as well. Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> Signed-off-by: Ben Guthro <bguthro@virtualiron.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Are these patches needed now the timers are built on Xen system time rather than host TSC? Dan has reported much better time-keeping with his patch checked in, and it¹s for sure a lot less invasive than this patchset. -- Keir On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:> > 1. Introduction > > This patch improves the hpet based guest clock in terms of drift and > monotonicity. > Prior to this work the drift with hpet was greater than 2%, far above the .05% > limit > for ntp to synchronize. With this code, the drift ranges from .001% to .0033% > depending > on guest and physical platform. > > Using hpet allows guest operating systems to provide monotonic time to their > applications. Time sources other than hpet are not monotonic because > of their reliance on tsc, which is not synchronized across physical > processors. > > Windows 2k864 and many Linux guests are supported with two policies, one for > guests > that handle missed clock interrupts and the other for guests that require the > correct number of interrupts. > > Guests may use hpet for the timing source even if the physical platform has no > visible > hpet. Migration is supported between physical machines which differ in > physical > hpet visibility. > > Most of the changes are in hpet.c. Two general facilities are added to track > interrupt > progress. The ideas here and the facilities would be useful in vpt.c, for > other time > sources, though no attempt is made here to improve vpt.c. > > The following sections discuss hpet dependencies, interrupt delivery policies, > live migration, > test results, and relation to recent work with monotonic time. > > > 2. Virtual Hpet dependencies > > The virtual hpet depends on the ability to read the physical or simulated > (see discussion below) hpet. For timekeeping, the virtual hpet also depends > on two new interrupt notification facilities to implement its policies for > interrupt delivery. > > 2.1. Two modes of low-level hpet main counter reads. > > In this implementation, the virtual hpet reads with read_64_main_counter(), > exported by > time.c, either the real physical hpet main counter register directly or a > "simulated" > hpet main counter. > > The simulated mode uses a monotonic version of get_s_time() (NOW()), where the > last > time value is returned whenever the current time value is less than the last > time > value. In simulated mode, since it is layered on s_time, the underlying > hardware > can be hpet or some other device. The frequency of the main counter in > simulated > mode is the same as the standard physical hpet frequency, allowing live > migration > between nodes that are configured differently. > > If the physical platform does not have an hpet device, or if xen is configured > not > to use the device, then the simulated method is used. If there is a physical > hpet device, > and xen has initialized it, then either simulated or physical mode can be > used. > This is governed by a boot time option, hpet-avoid. Setting this option to 1 > gives the > simulated mode and 0 the physical mode. The default is physical mode. > > A disadvantage of the physical mode is that may take longer to read the device > than in simulated mode. On some platforms the cost is about the same (less > than 250 nsec) for > physical and simulated modes, while on others physical cost is much higher > than simulated. > A disadvantage of the simulated mode is that it can return the same value > for the counter in consecutive calls. > > 2.2. Interrupt notification facilities. > > Two interrupt notification facilities are introduced, one is > hvm_isa_irq_assert_cb() > and the other hvm_register_intr_en_notif(). > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic. > hvm_isa_irq_assert_cb allows a callback to be passed along to > vioapic_deliver() > and this callback is called with a mask of the vcpus which will get the > interrupt. This callback is made before any vcpus receive an interrupt. > > Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular > vector that will be called when that vector is injected in > [vmx,svm]_intr_assist() > and also when the guest finishes handling the interrupt. Here finished is > defined > as the point when the guest re-enables interrupts or lowers the tpr value. > EOI is not used as the end of interrupt as this is sometimes returned before > the interrupt handler has done its work. A flag is passed to the handler > indicating > whether this is the injection point (post = 1) or the interrupt finished (post > = 0) point. > The need for the finished point callback is discussed in the missed ticks > policy section. > > To prevent a possible early trigger of the finished callback, intr_en_notif > logic > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the > second when > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully > armed, re-enabling > interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by > [vmx,svm]_intr_assist(). > > 3. Interrupt delivery policies > > The existing hpet interrupt delivery is preserved. This includes > vcpu round robin delivery used by Linux and broadcast delivery used by > Windows. > > There are two policies for interrupt delivery, one for Windows 2k8-64 and the > other > for Linux. The Linux policy takes advantage of the (guest) Linux missed tick > and offset > calculations and does not attempt to deliver the right number of interrupts. > The Windows policy delivers the correct number of interrupts, even if > sometimes much > closer to each other than the period. The policies are similar to those in > vpt.c, though > there are some important differences. > > Policies are selected with an HVMOP_set_param hypercall with index > HVM_PARAM_TIMER_MODE. > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that two new ones > are added is that > in some guests (32bit Linux) a no-missed policy is needed for clock sources > other than hpet > and a missed ticks policy for hpet. It was felt that there would be less > confusion by simply > introducing the two hpet policies. > > 3.1. The missed ticks policy > > The Linux clock interrupt handler for hpet calculates missed ticks and offset > using the hpet > main counter. The algorithm works well when the time since the last interrupt > is greater than > or equal to a period and poorly otherwise. > > The missed ticks policy ensures that no two clock interrupts are delivered to > the guest at > a time interval less than a period. A time stamp (hpet main counter value) is > recorded (by a > callback registered with hvm_register_intr_en_notif) when Linux finishes > handling the clock > interrupt. Then, ensuing interrupts are delivered to the vioapic only if the > current main > counter value is a period greater than when the last interrupt was handled. > > Tests showed a significant improvement in clock drift with end of interrupt > time stamps > versus beginning of interrupt[1]. It is believed that the reason for the > improvement > is that the clock interrupt handler goes for a spinlock and can be therefore > delayed in its > processing. Furthermore, the main counter is read by the guest under the lock. > The net > effect is that if we time stamp injection, we can get the difference in time > between successive interrupt handler lock acquisitions to be less than the > period. > > 3.2. The no-missed ticks policy > > Windows 2k864 keeps very poor time with the missed ticks policy. So the > no-missed ticks policy > was developed. In the no-missed ticks policy we deliver the correct number of > interrupts, > even if they are spaced less than a period apart (when catching up). > > Windows 2k864 uses a broadcast mode in the interrupt routing such that > all vcpus get the clock interrupt. The best Windows drift performance was > achieved when the > policy code ensured that all the previous interrupts (on the various vcpus) > had been injected > before injecting the next interrupt to the vioapic.. > > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to > record > the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback > registered > with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in > the pending_mask. > When the pending_mask is clear it decrements hpet.intr_pending_nr and if > intr_pending_nr is still > non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb(). > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks(). > > The missed ticks policy intr_en_notif callback also uses the pending_mask > method. So even though > Linux does not broadcast its interrupts, the code could handle it if it did. > In this case the end of interrupt time stamp is made when the pending_mask is > clear. > > 4. Live Migration > > Live migration with hpet preserves the current offset of the guest clock with > respect > to ntp. This is accomplished by migrating all of the state in the h->hpet data > structure > in the usual way. The hp->mc_offset is recalculated on the receiving node so > that the > guest sees a continuous hpet main counter. > > Code as been added to xc_domain_save.c to send a small message after the > domain context is sent. The contents of the message is the physical tsc > timestamp, last_tsc, > read just before the message is sent. When the last_tsc message is received in > xc_domain_restore.c, > another physical tsc timestamp, cur_tsc, is read. The two timestamps are > loaded into the domain > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then > xc_domain_hvm_setcontext > is called so that hpet_load has access to these time stamps. Hpet_load uses > the timestamps > to account for the time spent saving and loading the domain context. With this > technique, > the only neglected time is the time spent sending a small network message. > > 5. Test Results > > Some recent test results are: > > 5.1 Linux 4u664 and Windows 2k864 load test. > Duration: 70 hours. > Test date: 6/2/08 > Loads: usex -b48 on Linux; burn-in on Windows > Guest vcpus: 8 for Linux; 2 for Windows > Hardware: 8 physical cpu AMD > Clock drift : Linux: .0012% Windows: .009% > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > Duration: 23 hours. > Test date: 6/3/08 > Loads: none > Guest vcpus: 8 for each Linux; 2 for Windows > Hardware: 4 physical cpu AMD > Clock drift : Linux: .033% Windows: .019% > > 6. Relation to recent work in xen-unstable > > There is a similarity between hvm_get_guest_time() in xen-unstable and > read_64_main_counter() > in this code. However, read_64_main_counter() is more tuned to the needs of > hpet.c. It has no > "set" operation, only the get. It isolates the mode, physical or simulated, in > read_64_main_counter() > itself. It uses no vcpu or domain state as it is a physical entity, in either > mode. And it provides a real > physical mode for every read for those applications that desire this. > > 7. Conclusion > > The virtual hpet is improved by this patch in terms of accuracy and > monotonicity. > Tests performed to date verify this and more testing is under way. > > 8. Future Work > > Testing with Windows Vista will be performed soon. The reason for accuracy > variations > on different platforms using the physical hpet device will be investigated. > Additional overhead measurements on simulated vs physical hpet mode will be > made. > > Footnotes: > > 1. I don''t recall the accuracy improvement with end of interrupt stamping, but > it was > significant, perhaps better than two to one improvement. It would be a very > simple matter > to re-measure the improvement as the facility can call back at injection time > as well. > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > <mailto:dwinchell@virtualiron.com> > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > <mailto:bguthro@virtualiron.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir, I think the changes are required. We''ll run some tests today today so that we have some data to talk about. -Dave -----Original Message----- From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser Sent: Fri 6/6/2008 4:58 AM To: Ben Guthro; xen-devel Cc: dan.magenheimer@oracle.com Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Are these patches needed now the timers are built on Xen system time rather than host TSC? Dan has reported much better time-keeping with his patch checked in, and it¹s for sure a lot less invasive than this patchset. -- Keir On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:> > 1. Introduction > > This patch improves the hpet based guest clock in terms of drift and > monotonicity. > Prior to this work the drift with hpet was greater than 2%, far above the .05% > limit > for ntp to synchronize. With this code, the drift ranges from .001% to .0033% > depending > on guest and physical platform. > > Using hpet allows guest operating systems to provide monotonic time to their > applications. Time sources other than hpet are not monotonic because > of their reliance on tsc, which is not synchronized across physical > processors. > > Windows 2k864 and many Linux guests are supported with two policies, one for > guests > that handle missed clock interrupts and the other for guests that require the > correct number of interrupts. > > Guests may use hpet for the timing source even if the physical platform has no > visible > hpet. Migration is supported between physical machines which differ in > physical > hpet visibility. > > Most of the changes are in hpet.c. Two general facilities are added to track > interrupt > progress. The ideas here and the facilities would be useful in vpt.c, for > other time > sources, though no attempt is made here to improve vpt.c. > > The following sections discuss hpet dependencies, interrupt delivery policies, > live migration, > test results, and relation to recent work with monotonic time. > > > 2. Virtual Hpet dependencies > > The virtual hpet depends on the ability to read the physical or simulated > (see discussion below) hpet. For timekeeping, the virtual hpet also depends > on two new interrupt notification facilities to implement its policies for > interrupt delivery. > > 2.1. Two modes of low-level hpet main counter reads. > > In this implementation, the virtual hpet reads with read_64_main_counter(), > exported by > time.c, either the real physical hpet main counter register directly or a > "simulated" > hpet main counter. > > The simulated mode uses a monotonic version of get_s_time() (NOW()), where the > last > time value is returned whenever the current time value is less than the last > time > value. In simulated mode, since it is layered on s_time, the underlying > hardware > can be hpet or some other device. The frequency of the main counter in > simulated > mode is the same as the standard physical hpet frequency, allowing live > migration > between nodes that are configured differently. > > If the physical platform does not have an hpet device, or if xen is configured > not > to use the device, then the simulated method is used. If there is a physical > hpet device, > and xen has initialized it, then either simulated or physical mode can be > used. > This is governed by a boot time option, hpet-avoid. Setting this option to 1 > gives the > simulated mode and 0 the physical mode. The default is physical mode. > > A disadvantage of the physical mode is that may take longer to read the device > than in simulated mode. On some platforms the cost is about the same (less > than 250 nsec) for > physical and simulated modes, while on others physical cost is much higher > than simulated. > A disadvantage of the simulated mode is that it can return the same value > for the counter in consecutive calls. > > 2.2. Interrupt notification facilities. > > Two interrupt notification facilities are introduced, one is > hvm_isa_irq_assert_cb() > and the other hvm_register_intr_en_notif(). > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic. > hvm_isa_irq_assert_cb allows a callback to be passed along to > vioapic_deliver() > and this callback is called with a mask of the vcpus which will get the > interrupt. This callback is made before any vcpus receive an interrupt. > > Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular > vector that will be called when that vector is injected in > [vmx,svm]_intr_assist() > and also when the guest finishes handling the interrupt. Here finished is > defined > as the point when the guest re-enables interrupts or lowers the tpr value. > EOI is not used as the end of interrupt as this is sometimes returned before > the interrupt handler has done its work. A flag is passed to the handler > indicating > whether this is the injection point (post = 1) or the interrupt finished (post > = 0) point. > The need for the finished point callback is discussed in the missed ticks > policy section. > > To prevent a possible early trigger of the finished callback, intr_en_notif > logic > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the > second when > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully > armed, re-enabling > interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by > [vmx,svm]_intr_assist(). > > 3. Interrupt delivery policies > > The existing hpet interrupt delivery is preserved. This includes > vcpu round robin delivery used by Linux and broadcast delivery used by > Windows. > > There are two policies for interrupt delivery, one for Windows 2k8-64 and the > other > for Linux. The Linux policy takes advantage of the (guest) Linux missed tick > and offset > calculations and does not attempt to deliver the right number of interrupts. > The Windows policy delivers the correct number of interrupts, even if > sometimes much > closer to each other than the period. The policies are similar to those in > vpt.c, though > there are some important differences. > > Policies are selected with an HVMOP_set_param hypercall with index > HVM_PARAM_TIMER_MODE. > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that two new ones > are added is that > in some guests (32bit Linux) a no-missed policy is needed for clock sources > other than hpet > and a missed ticks policy for hpet. It was felt that there would be less > confusion by simply > introducing the two hpet policies. > > 3.1. The missed ticks policy > > The Linux clock interrupt handler for hpet calculates missed ticks and offset > using the hpet > main counter. The algorithm works well when the time since the last interrupt > is greater than > or equal to a period and poorly otherwise. > > The missed ticks policy ensures that no two clock interrupts are delivered to > the guest at > a time interval less than a period. A time stamp (hpet main counter value) is > recorded (by a > callback registered with hvm_register_intr_en_notif) when Linux finishes > handling the clock > interrupt. Then, ensuing interrupts are delivered to the vioapic only if the > current main > counter value is a period greater than when the last interrupt was handled. > > Tests showed a significant improvement in clock drift with end of interrupt > time stamps > versus beginning of interrupt[1]. It is believed that the reason for the > improvement > is that the clock interrupt handler goes for a spinlock and can be therefore > delayed in its > processing. Furthermore, the main counter is read by the guest under the lock. > The net > effect is that if we time stamp injection, we can get the difference in time > between successive interrupt handler lock acquisitions to be less than the > period. > > 3.2. The no-missed ticks policy > > Windows 2k864 keeps very poor time with the missed ticks policy. So the > no-missed ticks policy > was developed. In the no-missed ticks policy we deliver the correct number of > interrupts, > even if they are spaced less than a period apart (when catching up). > > Windows 2k864 uses a broadcast mode in the interrupt routing such that > all vcpus get the clock interrupt. The best Windows drift performance was > achieved when the > policy code ensured that all the previous interrupts (on the various vcpus) > had been injected > before injecting the next interrupt to the vioapic.. > > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to > record > the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback > registered > with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in > the pending_mask. > When the pending_mask is clear it decrements hpet.intr_pending_nr and if > intr_pending_nr is still > non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb(). > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks(). > > The missed ticks policy intr_en_notif callback also uses the pending_mask > method. So even though > Linux does not broadcast its interrupts, the code could handle it if it did. > In this case the end of interrupt time stamp is made when the pending_mask is > clear. > > 4. Live Migration > > Live migration with hpet preserves the current offset of the guest clock with > respect > to ntp. This is accomplished by migrating all of the state in the h->hpet data > structure > in the usual way. The hp->mc_offset is recalculated on the receiving node so > that the > guest sees a continuous hpet main counter. > > Code as been added to xc_domain_save.c to send a small message after the > domain context is sent. The contents of the message is the physical tsc > timestamp, last_tsc, > read just before the message is sent. When the last_tsc message is received in > xc_domain_restore.c, > another physical tsc timestamp, cur_tsc, is read. The two timestamps are > loaded into the domain > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then > xc_domain_hvm_setcontext > is called so that hpet_load has access to these time stamps. Hpet_load uses > the timestamps > to account for the time spent saving and loading the domain context. With this > technique, > the only neglected time is the time spent sending a small network message. > > 5. Test Results > > Some recent test results are: > > 5.1 Linux 4u664 and Windows 2k864 load test. > Duration: 70 hours. > Test date: 6/2/08 > Loads: usex -b48 on Linux; burn-in on Windows > Guest vcpus: 8 for Linux; 2 for Windows > Hardware: 8 physical cpu AMD > Clock drift : Linux: .0012% Windows: .009% > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > Duration: 23 hours. > Test date: 6/3/08 > Loads: none > Guest vcpus: 8 for each Linux; 2 for Windows > Hardware: 4 physical cpu AMD > Clock drift : Linux: .033% Windows: .019% > > 6. Relation to recent work in xen-unstable > > There is a similarity between hvm_get_guest_time() in xen-unstable and > read_64_main_counter() > in this code. However, read_64_main_counter() is more tuned to the needs of > hpet.c. It has no > "set" operation, only the get. It isolates the mode, physical or simulated, in > read_64_main_counter() > itself. It uses no vcpu or domain state as it is a physical entity, in either > mode. And it provides a real > physical mode for every read for those applications that desire this. > > 7. Conclusion > > The virtual hpet is improved by this patch in terms of accuracy and > monotonicity. > Tests performed to date verify this and more testing is under way. > > 8. Future Work > > Testing with Windows Vista will be performed soon. The reason for accuracy > variations > on different platforms using the physical hpet device will be investigated. > Additional overhead measurements on simulated vs physical hpet mode will be > made. > > Footnotes: > > 1. I don''t recall the accuracy improvement with end of interrupt stamping, but > it was > significant, perhaps better than two to one improvement. It would be a very > simple matter > to re-measure the improvement as the facility can call back at injection time > as well. > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > <mailto:dwinchell@virtualiron.com> > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > <mailto:bguthro@virtualiron.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This seems to break the save/restore format (in at least two places)... S. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave and Ben -- When running tests on xen-unstable (without your patch), please ensure that hpet=1 is set in the hvm config and also I think that when hpet is the clocksource on RHEL4-32, the clock IS resilient to missed ticks so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, all clock ticks must be delivered and so timer_mode should be 0). Per http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it''s my intent to clean this up, but I won''t get to it until next week. Thanks, Dan -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell Sent: Friday, June 06, 2008 4:46 AM To: Keir Fraser; Ben Guthro; xen-devel Cc: dan.magenheimer@oracle.com; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Keir, I think the changes are required. We''ll run some tests today today so that we have some data to talk about. -Dave -----Original Message----- From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser Sent: Fri 6/6/2008 4:58 AM To: Ben Guthro; xen-devel Cc: dan.magenheimer@oracle.com Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Are these patches needed now the timers are built on Xen system time rather than host TSC? Dan has reported much better time-keeping with his patch checked in, and it¹s for sure a lot less invasive than this patchset. -- Keir On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > 1. Introduction > > This patch improves the hpet based guest clock in terms of drift and > monotonicity. > Prior to this work the drift with hpet was greater than 2%, far above the .05% > limit > for ntp to synchronize. With this code, the drift ranges from .001% to .0033% > depending > on guest and physical platform. > > Using hpet allows guest operating systems to provide monotonic time to their > applications. Time sources other than hpet are not monotonic because > of their reliance on tsc, which is not synchronized across physical > processors. > > Windows 2k864 and many Linux guests are supported with two policies, one for > guests > that handle missed clock interrupts and the other for guests that require the > correct number of interrupts. > > Guests may use hpet for the timing source even if the physical platform has no > visible > hpet. Migration is supported between physical machines which differ in > physical > hpet visibility. > > Most of the changes are in hpet.c. Two general facilities are added to track > interrupt > progress. The ideas here and the facilities would be useful in vpt.c, for > other time > sources, though no attempt is made here to improve vpt.c. > > The following sections discuss hpet dependencies, interrupt delivery policies, > live migration, > test results, and relation to recent work with monotonic time. > > > 2. Virtual Hpet dependencies > > The virtual hpet depends on the ability to read the physical or simulated > (see discussion below) hpet. For timekeeping, the virtual hpet also depends > on two new interrupt notification facilities to implement its policies for > interrupt delivery. > > 2.1. Two modes of low-level hpet main counter reads. > > In this implementation, the virtual hpet reads with read_64_main_counter(), > exported by > time.c, either the real physical hpet main counter register directly or a > "simulated" > hpet main counter. > > The simulated mode uses a monotonic version of get_s_time() (NOW()), where the > last > time value is returned whenever the current time value is less than the last > time > value. In simulated mode, since it is layered on s_time, the underlying > hardware > can be hpet or some other device. The frequency of the main counter in > simulated > mode is the same as the standard physical hpet frequency, allowing live > migration > between nodes that are configured differently. > > If the physical platform does not have an hpet device, or if xen is configured > not > to use the device, then the simulated method is used. If there is a physical > hpet device, > and xen has initialized it, then either simulated or physical mode can be > used. > This is governed by a boot time option, hpet-avoid. Setting this option to 1 > gives the > simulated mode and 0 the physical mode. The default is physical mode. > > A disadvantage of the physical mode is that may take longer to read the device > than in simulated mode. On some platforms the cost is about the same (less > than 250 nsec) for > physical and simulated modes, while on others physical cost is much higher > than simulated. > A disadvantage of the simulated mode is that it can return the same value > for the counter in consecutive calls. > > 2.2. Interrupt notification facilities. > > Two interrupt notification facilities are introduced, one is > hvm_isa_irq_assert_cb() > and the other hvm_register_intr_en_notif(). > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic. > hvm_isa_irq_assert_cb allows a callback to be passed along to > vioapic_deliver() > and this callback is called with a mask of the vcpus which will get the > interrupt. This callback is made before any vcpus receive an interrupt. > > Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular > vector that will be called when that vector is injected in > [vmx,svm]_intr_assist() > and also when the guest finishes handling the interrupt. Here finished is > defined > as the point when the guest re-enables interrupts or lowers the tpr value. > EOI is not used as the end of interrupt as this is sometimes returned before > the interrupt handler has done its work. A flag is passed to the handler > indicating > whether this is the injection point (post = 1) or the interrupt finished (post > = 0) point. > The need for the finished point callback is discussed in the missed ticks > policy section. > > To prevent a possible early trigger of the finished callback, intr_en_notif > logic > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the > second when > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully > armed, re-enabling > interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by > [vmx,svm]_intr_assist(). > > 3. Interrupt delivery policies > > The existing hpet interrupt delivery is preserved. This includes > vcpu round robin delivery used by Linux and broadcast delivery used by > Windows. > > There are two policies for interrupt delivery, one for Windows 2k8-64 and the > other > for Linux. The Linux policy takes advantage of the (guest) Linux missed tick > and offset > calculations and does not attempt to deliver the right number of interrupts. > The Windows policy delivers the correct number of interrupts, even if > sometimes much > closer to each other than the period. The policies are similar to those in > vpt.c, though > there are some important differences. > > Policies are selected with an HVMOP_set_param hypercall with index > HVM_PARAM_TIMER_MODE. > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that two new ones > are added is that > in some guests (32bit Linux) a no-missed policy is needed for clock sources > other than hpet > and a missed ticks policy for hpet. It was felt that there would be less > confusion by simply > introducing the two hpet policies. > > 3.1. The missed ticks policy > > The Linux clock interrupt handler for hpet calculates missed ticks and offset > using the hpet > main counter. The algorithm works well when the time since the last interrupt > is greater than > or equal to a period and poorly otherwise. > > The missed ticks policy ensures that no two clock interrupts are delivered to > the guest at > a time interval less than a period. A time stamp (hpet main counter value) is > recorded (by a > callback registered with hvm_register_intr_en_notif) when Linux finishes > handling the clock > interrupt. Then, ensuing interrupts are delivered to the vioapic only if the > current main > counter value is a period greater than when the last interrupt was handled. > > Tests showed a significant improvement in clock drift with end of interrupt > time stamps > versus beginning of interrupt[1]. It is believed that the reason for the > improvement > is that the clock interrupt handler goes for a spinlock and can be therefore > delayed in its > processing. Furthermore, the main counter is read by the guest under the lock. > The net > effect is that if we time stamp injection, we can get the difference in time > between successive interrupt handler lock acquisitions to be less than the > period. > > 3.2. The no-missed ticks policy > > Windows 2k864 keeps very poor time with the missed ticks policy. So the > no-missed ticks policy > was developed. In the no-missed ticks policy we deliver the correct number of > interrupts, > even if they are spaced less than a period apart (when catching up). > > Windows 2k864 uses a broadcast mode in the interrupt routing such that > all vcpus get the clock interrupt. The best Windows drift performance was > achieved when the > policy code ensured that all the previous interrupts (on the various vcpus) > had been injected > before injecting the next interrupt to the vioapic.. > > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to > record > the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback > registered > with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in > the pending_mask. > When the pending_mask is clear it decrements hpet.intr_pending_nr and if > intr_pending_nr is still > non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb(). > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks(). > > The missed ticks policy intr_en_notif callback also uses the pending_mask > method. So even though > Linux does not broadcast its interrupts, the code could handle it if it did. > In this case the end of interrupt time stamp is made when the pending_mask is > clear. > > 4. Live Migration > > Live migration with hpet preserves the current offset of the guest clock with > respect > to ntp. This is accomplished by migrating all of the state in the h->hpet data > structure > in the usual way. The hp->mc_offset is recalculated on the receiving node so > that the > guest sees a continuous hpet main counter. > > Code as been added to xc_domain_save.c to send a small message after the > domain context is sent. The contents of the message is the physical tsc > timestamp, last_tsc, > read just before the message is sent. When the last_tsc message is received in > xc_domain_restore.c, > another physical tsc timestamp, cur_tsc, is read. The two timestamps are > loaded into the domain > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then > xc_domain_hvm_setcontext > is called so that hpet_load has access to these time stamps. Hpet_load uses > the timestamps > to account for the time spent saving and loading the domain context. With this > technique, > the only neglected time is the time spent sending a small network message. > > 5. Test Results > > Some recent test results are: > > 5.1 Linux 4u664 and Windows 2k864 load test. > Duration: 70 hours. > Test date: 6/2/08 > Loads: usex -b48 on Linux; burn-in on Windows > Guest vcpus: 8 for Linux; 2 for Windows > Hardware: 8 physical cpu AMD > Clock drift : Linux: .0012% Windows: .009% > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > Duration: 23 hours. > Test date: 6/3/08 > Loads: none > Guest vcpus: 8 for each Linux; 2 for Windows > Hardware: 4 physical cpu AMD > Clock drift : Linux: .033% Windows: .019% > > 6. Relation to recent work in xen-unstable > > There is a similarity between hvm_get_guest_time() in xen-unstable and > read_64_main_counter() > in this code. However, read_64_main_counter() is more tuned to the needs of > hpet.c. It has no > "set" operation, only the get. It isolates the mode, physical or simulated, in > read_64_main_counter() > itself. It uses no vcpu or domain state as it is a physical entity, in either > mode. And it provides a real > physical mode for every read for those applications that desire this. > > 7. Conclusion > > The virtual hpet is improved by this patch in terms of accuracy and > monotonicity. > Tests performed to date verify this and more testing is under way. > > 8. Future Work > > Testing with Windows Vista will be performed soon. The reason for accuracy > variations > on different platforms using the physical hpet device will be investigated. > Additional overhead measurements on simulated vs physical hpet mode will be > made. > > Footnotes: > > 1. I don''t recall the accuracy improvement with end of interrupt stamping, but > it was > significant, perhaps better than two to one improvement. It would be a very > simple matter > to re-measure the improvement as the facility can call back at injection time > as well. > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > <mailto:dwinchell@virtualiron.com> > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > <mailto:bguthro@virtualiron.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Hand wrote:>This seems to break the save/restore format (in at least two places)... > >Steven, Can you give me more information on this? What sort of failures are you seeing? thanks, Dave> >S. > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dan, I am running with hpet=1 and timer_mode=2. I don''t see where timer_mode is checked for hpet timekeeping but I set it nevertheless. thanks, Dave Dan Magenheimer wrote:> Hi Dave and Ben -- > > When running tests on xen-unstable (without your patch), please ensure > that hpet=1 is set in the hvm config and also I think that when hpet > is the clocksource on RHEL4-32, the clock IS resilient to missed ticks > so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, > all clock ticks must be delivered and so timer_mode should be 0). > > Per > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it''s > my intent to clean this up, but I won''t get to it until next week. > > Thanks, > Dan > > -----Original Message----- > *From:* xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave > Winchell > *Sent:* Friday, June 06, 2008 4:46 AM > *To:* Keir Fraser; Ben Guthro; xen-devel > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Keir, > > I think the changes are required. We''ll run some tests today today so > that we have some data to talk about. > > -Dave > > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser > Sent: Fri 6/6/2008 4:58 AM > To: Ben Guthro; xen-devel > Cc: dan.magenheimer@oracle.com > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Are these patches needed now the timers are built on Xen system > time rather > than host TSC? Dan has reported much better time-keeping with his > patch > checked in, and it¹s for sure a lot less invasive than this patchset. > > > -- Keir > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > 1. Introduction > > > > This patch improves the hpet based guest clock in terms of drift and > > monotonicity. > > Prior to this work the drift with hpet was greater than 2%, far > above the .05% > > limit > > for ntp to synchronize. With this code, the drift ranges from > .001% to .0033% > > depending > > on guest and physical platform. > > > > Using hpet allows guest operating systems to provide monotonic > time to their > > applications. Time sources other than hpet are not monotonic because > > of their reliance on tsc, which is not synchronized across physical > > processors. > > > > Windows 2k864 and many Linux guests are supported with two > policies, one for > > guests > > that handle missed clock interrupts and the other for guests > that require the > > correct number of interrupts. > > > > Guests may use hpet for the timing source even if the physical > platform has no > > visible > > hpet. Migration is supported between physical machines which > differ in > > physical > > hpet visibility. > > > > Most of the changes are in hpet.c. Two general facilities are > added to track > > interrupt > > progress. The ideas here and the facilities would be useful in > vpt.c, for > > other time > > sources, though no attempt is made here to improve vpt.c. > > > > The following sections discuss hpet dependencies, interrupt > delivery policies, > > live migration, > > test results, and relation to recent work with monotonic time. > > > > > > 2. Virtual Hpet dependencies > > > > The virtual hpet depends on the ability to read the physical or > simulated > > (see discussion below) hpet. For timekeeping, the virtual hpet > also depends > > on two new interrupt notification facilities to implement its > policies for > > interrupt delivery. > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > In this implementation, the virtual hpet reads with > read_64_main_counter(), > > exported by > > time.c, either the real physical hpet main counter register > directly or a > > "simulated" > > hpet main counter. > > > > The simulated mode uses a monotonic version of get_s_time() > (NOW()), where the > > last > > time value is returned whenever the current time value is less > than the last > > time > > value. In simulated mode, since it is layered on s_time, the > underlying > > hardware > > can be hpet or some other device. The frequency of the main > counter in > > simulated > > mode is the same as the standard physical hpet frequency, > allowing live > > migration > > between nodes that are configured differently. > > > > If the physical platform does not have an hpet device, or if xen > is configured > > not > > to use the device, then the simulated method is used. If there > is a physical > > hpet device, > > and xen has initialized it, then either simulated or physical > mode can be > > used. > > This is governed by a boot time option, hpet-avoid. Setting this > option to 1 > > gives the > > simulated mode and 0 the physical mode. The default is physical > mode. > > > > A disadvantage of the physical mode is that may take longer to > read the device > > than in simulated mode. On some platforms the cost is about the > same (less > > than 250 nsec) for > > physical and simulated modes, while on others physical cost is > much higher > > than simulated. > > A disadvantage of the simulated mode is that it can return the > same value > > for the counter in consecutive calls. > > > > 2.2. Interrupt notification facilities. > > > > Two interrupt notification facilities are introduced, one is > > hvm_isa_irq_assert_cb() > > and the other hvm_register_intr_en_notif(). > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > the vioapic. > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > vioapic_deliver() > > and this callback is called with a mask of the vcpus which will > get the > > interrupt. This callback is made before any vcpus receive an > interrupt. > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > for a particular > > vector that will be called when that vector is injected in > > [vmx,svm]_intr_assist() > > and also when the guest finishes handling the interrupt. Here > finished is > > defined > > as the point when the guest re-enables interrupts or lowers the > tpr value. > > EOI is not used as the end of interrupt as this is sometimes > returned before > > the interrupt handler has done its work. A flag is passed to the > handler > > indicating > > whether this is the injection point (post = 1) or the interrupt > finished (post > > = 0) point. > > The need for the finished point callback is discussed in the > missed ticks > > policy section. > > > > To prevent a possible early trigger of the finished callback, > intr_en_notif > > logic > > has a two stage arm, the first at injection > (hvm_intr_en_notif_arm()) and the > > second when > > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). > Once fully > > armed, re-enabling > > interrupts will cause hvm_intr_en_notif_disarm() to make the end > of interrupt > > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() > are called by > > [vmx,svm]_intr_assist(). > > > > 3. Interrupt delivery policies > > > > The existing hpet interrupt delivery is preserved. This includes > > vcpu round robin delivery used by Linux and broadcast delivery > used by > > Windows. > > > > There are two policies for interrupt delivery, one for Windows > 2k8-64 and the > > other > > for Linux. The Linux policy takes advantage of the (guest) Linux > missed tick > > and offset > > calculations and does not attempt to deliver the right number of > interrupts. > > The Windows policy delivers the correct number of interrupts, > even if > > sometimes much > > closer to each other than the period. The policies are similar > to those in > > vpt.c, though > > there are some important differences. > > > > Policies are selected with an HVMOP_set_param hypercall with index > > HVM_PARAM_TIMER_MODE. > > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > two new ones > > are added is that > > in some guests (32bit Linux) a no-missed policy is needed for > clock sources > > other than hpet > > and a missed ticks policy for hpet. It was felt that there would > be less > > confusion by simply > > introducing the two hpet policies. > > > > 3.1. The missed ticks policy > > > > The Linux clock interrupt handler for hpet calculates missed > ticks and offset > > using the hpet > > main counter. The algorithm works well when the time since the > last interrupt > > is greater than > > or equal to a period and poorly otherwise. > > > > The missed ticks policy ensures that no two clock interrupts are > delivered to > > the guest at > > a time interval less than a period. A time stamp (hpet main > counter value) is > > recorded (by a > > callback registered with hvm_register_intr_en_notif) when Linux > finishes > > handling the clock > > interrupt. Then, ensuing interrupts are delivered to the vioapic > only if the > > current main > > counter value is a period greater than when the last interrupt > was handled. > > > > Tests showed a significant improvement in clock drift with end > of interrupt > > time stamps > > versus beginning of interrupt[1]. It is believed that the reason > for the > > improvement > > is that the clock interrupt handler goes for a spinlock and can > be therefore > > delayed in its > > processing. Furthermore, the main counter is read by the guest > under the lock. > > The net > > effect is that if we time stamp injection, we can get the > difference in time > > between successive interrupt handler lock acquisitions to be > less than the > > period. > > > > 3.2. The no-missed ticks policy > > > > Windows 2k864 keeps very poor time with the missed ticks policy. > So the > > no-missed ticks policy > > was developed. In the no-missed ticks policy we deliver the > correct number of > > interrupts, > > even if they are spaced less than a period apart (when catching up). > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > such that > > all vcpus get the clock interrupt. The best Windows drift > performance was > > achieved when the > > policy code ensured that all the previous interrupts (on the > various vcpus) > > had been injected > > before injecting the next interrupt to the vioapic.. > > > > The policy code works as follows. It uses the > hvm_isa_irq_assert_cb() to > > record > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > the callback > > registered > > with hvm_register_intr_en_notif() at post=1 time it clears the > current vcpu in > > the pending_mask. > > When the pending_mask is clear it decrements > hpet.intr_pending_nr and if > > intr_pending_nr is still > > non-zero posts another interrupt to the ioapic with > hvm_isa_irq_assert_cb(). > > Intr_pending_nr is incremented in > hpet_route_decision_not_missed_ticks(). > > > > The missed ticks policy intr_en_notif callback also uses the > pending_mask > > method. So even though > > Linux does not broadcast its interrupts, the code could handle > it if it did. > > In this case the end of interrupt time stamp is made when the > pending_mask is > > clear. > > > > 4. Live Migration > > > > Live migration with hpet preserves the current offset of the > guest clock with > > respect > > to ntp. This is accomplished by migrating all of the state in > the h->hpet data > > structure > > in the usual way. The hp->mc_offset is recalculated on the > receiving node so > > that the > > guest sees a continuous hpet main counter. > > > > Code as been added to xc_domain_save.c to send a small message > after the > > domain context is sent. The contents of the message is the > physical tsc > > timestamp, last_tsc, > > read just before the message is sent. When the last_tsc message > is received in > > xc_domain_restore.c, > > another physical tsc timestamp, cur_tsc, is read. The two > timestamps are > > loaded into the domain > > structure as last_tsc_sender and first_tsc_receiver with > hypercalls. Then > > xc_domain_hvm_setcontext > > is called so that hpet_load has access to these time stamps. > Hpet_load uses > > the timestamps > > to account for the time spent saving and loading the domain > context. With this > > technique, > > the only neglected time is the time spent sending a small > network message. > > > > 5. Test Results > > > > Some recent test results are: > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > Duration: 70 hours. > > Test date: 6/2/08 > > Loads: usex -b48 on Linux; burn-in on Windows > > Guest vcpus: 8 for Linux; 2 for Windows > > Hardware: 8 physical cpu AMD > > Clock drift : Linux: .0012% Windows: .009% > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > Duration: 23 hours. > > Test date: 6/3/08 > > Loads: none > > Guest vcpus: 8 for each Linux; 2 for Windows > > Hardware: 4 physical cpu AMD > > Clock drift : Linux: .033% Windows: .019% > > > > 6. Relation to recent work in xen-unstable > > > > There is a similarity between hvm_get_guest_time() in > xen-unstable and > > read_64_main_counter() > > in this code. However, read_64_main_counter() is more tuned to > the needs of > > hpet.c. It has no > > "set" operation, only the get. It isolates the mode, physical or > simulated, in > > read_64_main_counter() > > itself. It uses no vcpu or domain state as it is a physical > entity, in either > > mode. And it provides a real > > physical mode for every read for those applications that desire > this. > > > > 7. Conclusion > > > > The virtual hpet is improved by this patch in terms of accuracy and > > monotonicity. > > Tests performed to date verify this and more testing is under way. > > > > 8. Future Work > > > > Testing with Windows Vista will be performed soon. The reason > for accuracy > > variations > > on different platforms using the physical hpet device will be > investigated. > > Additional overhead measurements on simulated vs physical hpet > mode will be > > made. > > > > Footnotes: > > > > 1. I don''t recall the accuracy improvement with end of interrupt > stamping, but > > it was > > significant, perhaps better than two to one improvement. It > would be a very > > simple matter > > to re-measure the improvement as the facility can call back at > injection time > > as well. > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > <mailto:dwinchell@virtualiron.com> > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > <mailto:bguthro@virtualiron.com> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan, Keir: Preliminary tests results indicate an error of .1% for Linux 64 bit guests configured for hpet with xen-unstable as is. As we have discussed many times, the ntp requirement is .05%. Tests on the patch we just submitted for hpet have indicated errors of .0012% on this platform under similar test conditions and .03% on other platforms. Windows vista64 has an error of 11% using hpet with the xen-unstable bits. In an overnight test with our hpet patch, the Windows vista error was .008%. The tests are with two or three guests on a physical node, all under load, and with the ratio of vcpus to phys cpus > 1. I will continue to run tests over the next few days. thanks, Dave Dan Magenheimer wrote:> Hi Dave and Ben -- > > When running tests on xen-unstable (without your patch), please ensure > that hpet=1 is set in the hvm config and also I think that when hpet > is the clocksource on RHEL4-32, the clock IS resilient to missed ticks > so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, > all clock ticks must be delivered and so timer_mode should be 0). > > Per > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it''s > my intent to clean this up, but I won''t get to it until next week. > > Thanks, > Dan > > -----Original Message----- > *From:* xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave > Winchell > *Sent:* Friday, June 06, 2008 4:46 AM > *To:* Keir Fraser; Ben Guthro; xen-devel > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Keir, > > I think the changes are required. We''ll run some tests today today so > that we have some data to talk about. > > -Dave > > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser > Sent: Fri 6/6/2008 4:58 AM > To: Ben Guthro; xen-devel > Cc: dan.magenheimer@oracle.com > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Are these patches needed now the timers are built on Xen system > time rather > than host TSC? Dan has reported much better time-keeping with his > patch > checked in, and it¹s for sure a lot less invasive than this patchset. > > > -- Keir > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > 1. Introduction > > > > This patch improves the hpet based guest clock in terms of drift and > > monotonicity. > > Prior to this work the drift with hpet was greater than 2%, far > above the .05% > > limit > > for ntp to synchronize. With this code, the drift ranges from > .001% to .0033% > > depending > > on guest and physical platform. > > > > Using hpet allows guest operating systems to provide monotonic > time to their > > applications. Time sources other than hpet are not monotonic because > > of their reliance on tsc, which is not synchronized across physical > > processors. > > > > Windows 2k864 and many Linux guests are supported with two > policies, one for > > guests > > that handle missed clock interrupts and the other for guests > that require the > > correct number of interrupts. > > > > Guests may use hpet for the timing source even if the physical > platform has no > > visible > > hpet. Migration is supported between physical machines which > differ in > > physical > > hpet visibility. > > > > Most of the changes are in hpet.c. Two general facilities are > added to track > > interrupt > > progress. The ideas here and the facilities would be useful in > vpt.c, for > > other time > > sources, though no attempt is made here to improve vpt.c. > > > > The following sections discuss hpet dependencies, interrupt > delivery policies, > > live migration, > > test results, and relation to recent work with monotonic time. > > > > > > 2. Virtual Hpet dependencies > > > > The virtual hpet depends on the ability to read the physical or > simulated > > (see discussion below) hpet. For timekeeping, the virtual hpet > also depends > > on two new interrupt notification facilities to implement its > policies for > > interrupt delivery. > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > In this implementation, the virtual hpet reads with > read_64_main_counter(), > > exported by > > time.c, either the real physical hpet main counter register > directly or a > > "simulated" > > hpet main counter. > > > > The simulated mode uses a monotonic version of get_s_time() > (NOW()), where the > > last > > time value is returned whenever the current time value is less > than the last > > time > > value. In simulated mode, since it is layered on s_time, the > underlying > > hardware > > can be hpet or some other device. The frequency of the main > counter in > > simulated > > mode is the same as the standard physical hpet frequency, > allowing live > > migration > > between nodes that are configured differently. > > > > If the physical platform does not have an hpet device, or if xen > is configured > > not > > to use the device, then the simulated method is used. If there > is a physical > > hpet device, > > and xen has initialized it, then either simulated or physical > mode can be > > used. > > This is governed by a boot time option, hpet-avoid. Setting this > option to 1 > > gives the > > simulated mode and 0 the physical mode. The default is physical > mode. > > > > A disadvantage of the physical mode is that may take longer to > read the device > > than in simulated mode. On some platforms the cost is about the > same (less > > than 250 nsec) for > > physical and simulated modes, while on others physical cost is > much higher > > than simulated. > > A disadvantage of the simulated mode is that it can return the > same value > > for the counter in consecutive calls. > > > > 2.2. Interrupt notification facilities. > > > > Two interrupt notification facilities are introduced, one is > > hvm_isa_irq_assert_cb() > > and the other hvm_register_intr_en_notif(). > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > the vioapic. > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > vioapic_deliver() > > and this callback is called with a mask of the vcpus which will > get the > > interrupt. This callback is made before any vcpus receive an > interrupt. > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > for a particular > > vector that will be called when that vector is injected in > > [vmx,svm]_intr_assist() > > and also when the guest finishes handling the interrupt. Here > finished is > > defined > > as the point when the guest re-enables interrupts or lowers the > tpr value. > > EOI is not used as the end of interrupt as this is sometimes > returned before > > the interrupt handler has done its work. A flag is passed to the > handler > > indicating > > whether this is the injection point (post = 1) or the interrupt > finished (post > > = 0) point. > > The need for the finished point callback is discussed in the > missed ticks > > policy section. > > > > To prevent a possible early trigger of the finished callback, > intr_en_notif > > logic > > has a two stage arm, the first at injection > (hvm_intr_en_notif_arm()) and the > > second when > > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). > Once fully > > armed, re-enabling > > interrupts will cause hvm_intr_en_notif_disarm() to make the end > of interrupt > > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() > are called by > > [vmx,svm]_intr_assist(). > > > > 3. Interrupt delivery policies > > > > The existing hpet interrupt delivery is preserved. This includes > > vcpu round robin delivery used by Linux and broadcast delivery > used by > > Windows. > > > > There are two policies for interrupt delivery, one for Windows > 2k8-64 and the > > other > > for Linux. The Linux policy takes advantage of the (guest) Linux > missed tick > > and offset > > calculations and does not attempt to deliver the right number of > interrupts. > > The Windows policy delivers the correct number of interrupts, > even if > > sometimes much > > closer to each other than the period. The policies are similar > to those in > > vpt.c, though > > there are some important differences. > > > > Policies are selected with an HVMOP_set_param hypercall with index > > HVM_PARAM_TIMER_MODE. > > Two new values are added, HVM_HPET_guest_computes_missed_ticks and > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > two new ones > > are added is that > > in some guests (32bit Linux) a no-missed policy is needed for > clock sources > > other than hpet > > and a missed ticks policy for hpet. It was felt that there would > be less > > confusion by simply > > introducing the two hpet policies. > > > > 3.1. The missed ticks policy > > > > The Linux clock interrupt handler for hpet calculates missed > ticks and offset > > using the hpet > > main counter. The algorithm works well when the time since the > last interrupt > > is greater than > > or equal to a period and poorly otherwise. > > > > The missed ticks policy ensures that no two clock interrupts are > delivered to > > the guest at > > a time interval less than a period. A time stamp (hpet main > counter value) is > > recorded (by a > > callback registered with hvm_register_intr_en_notif) when Linux > finishes > > handling the clock > > interrupt. Then, ensuing interrupts are delivered to the vioapic > only if the > > current main > > counter value is a period greater than when the last interrupt > was handled. > > > > Tests showed a significant improvement in clock drift with end > of interrupt > > time stamps > > versus beginning of interrupt[1]. It is believed that the reason > for the > > improvement > > is that the clock interrupt handler goes for a spinlock and can > be therefore > > delayed in its > > processing. Furthermore, the main counter is read by the guest > under the lock. > > The net > > effect is that if we time stamp injection, we can get the > difference in time > > between successive interrupt handler lock acquisitions to be > less than the > > period. > > > > 3.2. The no-missed ticks policy > > > > Windows 2k864 keeps very poor time with the missed ticks policy. > So the > > no-missed ticks policy > > was developed. In the no-missed ticks policy we deliver the > correct number of > > interrupts, > > even if they are spaced less than a period apart (when catching up). > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > such that > > all vcpus get the clock interrupt. The best Windows drift > performance was > > achieved when the > > policy code ensured that all the previous interrupts (on the > various vcpus) > > had been injected > > before injecting the next interrupt to the vioapic.. > > > > The policy code works as follows. It uses the > hvm_isa_irq_assert_cb() to > > record > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > the callback > > registered > > with hvm_register_intr_en_notif() at post=1 time it clears the > current vcpu in > > the pending_mask. > > When the pending_mask is clear it decrements > hpet.intr_pending_nr and if > > intr_pending_nr is still > > non-zero posts another interrupt to the ioapic with > hvm_isa_irq_assert_cb(). > > Intr_pending_nr is incremented in > hpet_route_decision_not_missed_ticks(). > > > > The missed ticks policy intr_en_notif callback also uses the > pending_mask > > method. So even though > > Linux does not broadcast its interrupts, the code could handle > it if it did. > > In this case the end of interrupt time stamp is made when the > pending_mask is > > clear. > > > > 4. Live Migration > > > > Live migration with hpet preserves the current offset of the > guest clock with > > respect > > to ntp. This is accomplished by migrating all of the state in > the h->hpet data > > structure > > in the usual way. The hp->mc_offset is recalculated on the > receiving node so > > that the > > guest sees a continuous hpet main counter. > > > > Code as been added to xc_domain_save.c to send a small message > after the > > domain context is sent. The contents of the message is the > physical tsc > > timestamp, last_tsc, > > read just before the message is sent. When the last_tsc message > is received in > > xc_domain_restore.c, > > another physical tsc timestamp, cur_tsc, is read. The two > timestamps are > > loaded into the domain > > structure as last_tsc_sender and first_tsc_receiver with > hypercalls. Then > > xc_domain_hvm_setcontext > > is called so that hpet_load has access to these time stamps. > Hpet_load uses > > the timestamps > > to account for the time spent saving and loading the domain > context. With this > > technique, > > the only neglected time is the time spent sending a small > network message. > > > > 5. Test Results > > > > Some recent test results are: > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > Duration: 70 hours. > > Test date: 6/2/08 > > Loads: usex -b48 on Linux; burn-in on Windows > > Guest vcpus: 8 for Linux; 2 for Windows > > Hardware: 8 physical cpu AMD > > Clock drift : Linux: .0012% Windows: .009% > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > Duration: 23 hours. > > Test date: 6/3/08 > > Loads: none > > Guest vcpus: 8 for each Linux; 2 for Windows > > Hardware: 4 physical cpu AMD > > Clock drift : Linux: .033% Windows: .019% > > > > 6. Relation to recent work in xen-unstable > > > > There is a similarity between hvm_get_guest_time() in > xen-unstable and > > read_64_main_counter() > > in this code. However, read_64_main_counter() is more tuned to > the needs of > > hpet.c. It has no > > "set" operation, only the get. It isolates the mode, physical or > simulated, in > > read_64_main_counter() > > itself. It uses no vcpu or domain state as it is a physical > entity, in either > > mode. And it provides a real > > physical mode for every read for those applications that desire > this. > > > > 7. Conclusion > > > > The virtual hpet is improved by this patch in terms of accuracy and > > monotonicity. > > Tests performed to date verify this and more testing is under way. > > > > 8. Future Work > > > > Testing with Windows Vista will be performed soon. The reason > for accuracy > > variations > > on different platforms using the physical hpet device will be > investigated. > > Additional overhead measurements on simulated vs physical hpet > mode will be > > made. > > > > Footnotes: > > > > 1. I don''t recall the accuracy improvement with end of interrupt > stamping, but > > it was > > significant, perhaps better than two to one improvement. It > would be a very > > simple matter > > to re-measure the improvement as the facility can call back at > injection time > > as well. > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > <mailto:dwinchell@virtualiron.com> > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > <mailto:bguthro@virtualiron.com> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time() call to the platform time read function, and bypass TSC altogether. This would be cleaner than having only the vHPET code punch through to the physical HPET: instead we have the boot-time chosen platform timesource used by all virtual timers.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Possibly there are bugs in the hpet device model which are fixed by Dave''s patch. If this is actually the case, it would be nice to break those out as separate patches, as I think an 11% drift must largely be due to device-model bugs rather than relatively insignificant differences between hvm_get_guest_time() and physical HPET main counter. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Possibly there are bugs in the hpet device model which are fixed by Dave''s > patch. If this is actually the case, it would be nice to break those out as > separate patches, as I think an 11% drift must largely be due to > device-model bugs rather than relatively insignificant differences between > hvm_get_guest_time() and physical HPET main counter.Hi Keir, I tried an experiment on Friday where I short circuited the missed ticks policy code in the hpet.c patch, but used the physical hpet each access. The result for Linux was a drift of .1%, same as the xen-unstable bits. Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks policy code and running in simulated mode (layered on stime) when stime uses hpet. So clearly, the improvement from .1% to .03% is due to the policy code. I haven''t run the short circuit test with the windows policy but I can do that on Monday. Note: For Windows and Linux I get < .03% drift using the policy code and running simulated mode whether stime is using hpet or some other device. regards, Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Fri 6/6/2008 6:34 PM To: dan.magenheimer@oracle.com; Dave Winchell Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time() call to the platform time read function, and bypass TSC altogether. This would be cleaner than having only the vHPET code punch through to the physical HPET: instead we have the boot-time chosen platform timesource used by all virtual timers.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Possibly there are bugs in the hpet device model which are fixed by Dave''s patch. If this is actually the case, it would be nice to break those out as separate patches, as I think an 11% drift must largely be due to device-model bugs rather than relatively insignificant differences between hvm_get_guest_time() and physical HPET main counter. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dan,> While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time.I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%.> I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow?I don''t recall the direction. I can look it up in my notes at work tomorrow.> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0.Our patch is accurate to < .03% using the physical hpet mode or the simulated mode.> And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.In our experience, Xen system time is accurate enough now.> One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source.I do not know the tsc accuracy.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave -- Thanks for the additional explanation. Could you please be very precise, when you say "Linux", as to what you are (and are not) testing? Specifically: 1) kernel version number and/or distro info 2) 32 vs 64 3) kernel parameters specified 4) config file parameters 5) relevant CPU info that may be passed through by Xen to hvm guests (e.g. whether "tsc is synchronized") 6) relevant xen boot parameters (if any) As we''ve seen, different combinations of the above can yield very different test results. We''d like to confirm your tests, but if we can avoid unnecessary additional iterations (due to mismatches on the above), that would be helpful. Our testing goal is to ensure that there is at least one known good combination of parameters for each of RHEL3, RHEL4, and RHEL5 (both 32 and 64) and that works on both tsc-synchronized and tsc-unsynchronized Intel and AMD boxes. And hopefully that works with and without a real physical hpet available. We don''t have a good test environment for Windows time, but if you can provide the same test configuration detail, we may be able to do some testing. Thanks, Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan, > While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time. I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%. > I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow? I don''t recall the direction. I can look it up in my notes at work tomorrow. > Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. Our patch is accurate to < .03% using the physical hpet mode or the simulated mode. > And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet. In our experience, Xen system time is accurate enough now. > One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source. I do not know the tsc accuracy. > Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm... Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> A disadvantage of the simulated mode is that it can return the same value> for the counter in consecutive calls.It also occurs to me that if the granularity is good enough, an easy fix to this problem might be to always increment the returned value by at least one. Then time is always at least increasing rather than stopped -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan, > While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time. I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%. > I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow? I don''t recall the direction. I can look it up in my notes at work tomorrow. > Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. Our patch is accurate to < .03% using the physical hpet mode or the simulated mode. > And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet. In our experience, Xen system time is accurate enough now. > One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source. I do not know the tsc accuracy. > Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm... Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dan,> Could you please be very precise, when you say "Linux", > as to what you are (and are not) testing? Specifically: > 1) kernel version number and/or distro infoI personally have been testing with Linux red hat 4u4, 4u5 and 4u6 64 bit and red hat 4u4 32 bit. I''ve also tested Windows 2k8sp0 64bit and Vista sp1 64 bit. Our QA group is currently testing a wider set of guests.> 2) 32 vs 64both.> 3) kernel parameters specifiedI''m pretty sloppy here. Frequently I have clock=pit because our build process defaults to that and I know that the guests I use ignore clock= when hpet is in the acpi table. However, I don''t recommend that others do this. I check that Linux is using hpet in its boot log and I have statistics in the patch which tell me that hpet is being used. You''ve done a lot of investigation on clock= and clocksourceso I would like to hear your recommendation.> 4) config file parametersHpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks for Windows 2k8-64 and Vista 64. 8 vcpus for Linux and 2 for Windows.> 5) relevant CPU info that may be passed through by Xen > to hvm guests (e.g. whether "tsc is synchronized")Whatever xen-unstable does. I have not changed it.> 6) relevant xen boot parameters (if any)Nothing relevant.> As we''ve seen, different combinations of the above can yield > very different test results. We''d like to confirm your tests, > but if we can avoid unnecessary additional iterations (due to > mismatches on the above), that would be helpful.> Our testing goal is to ensure that there is at least one > known good combination of parameters for each of RHEL3, > RHEL4, and RHEL5 (both 32 and 64) and that works on > both tsc-synchronized and tsc-unsynchronized Intel > and AMD boxes. And hopefully that works with and without > a real physical hpet available.Thanks very much for testing this.> We don''t have a good test environment for Windows time, > but if you can provide the same test configuration detail, > we may be able to do some testing.The configuration information was given above. Thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Sun 6/8/2008 5:10 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave -- Thanks for the additional explanation. Could you please be very precise, when you say "Linux", as to what you are (and are not) testing? Specifically: 1) kernel version number and/or distro info 2) 32 vs 64 3) kernel parameters specified 4) config file parameters 5) relevant CPU info that may be passed through by Xen to hvm guests (e.g. whether "tsc is synchronized") 6) relevant xen boot parameters (if any) As we''ve seen, different combinations of the above can yield very different test results. We''d like to confirm your tests, but if we can avoid unnecessary additional iterations (due to mismatches on the above), that would be helpful. Our testing goal is to ensure that there is at least one known good combination of parameters for each of RHEL3, RHEL4, and RHEL5 (both 32 and 64) and that works on both tsc-synchronized and tsc-unsynchronized Intel and AMD boxes. And hopefully that works with and without a real physical hpet available. We don''t have a good test environment for Windows time, but if you can provide the same test configuration detail, we may be able to do some testing. Thanks, Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan, > While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time. I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%. > I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow? I don''t recall the direction. I can look it up in my notes at work tomorrow. > Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. Our patch is accurate to < .03% using the physical hpet mode or the simulated mode. > And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet. In our experience, Xen system time is accurate enough now. > One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source. I do not know the tsc accuracy. > Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm... Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:>> > 4) config file parameters > > Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks > for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks > for Windows 2k8-64 and Vista 64. > 8 vcpus for Linux and 2 for Windows.These new HVM_HPET options seems a weird design choice. It appears that you can only set these or one of the old options, so there¹s not actually any independence between the mode used by vpt.c and the mode used by hpet.c. At guest install time you ought to be able to tell whether the guest will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide whether missed-ticks accounting is required or not. I¹d be more agreeable to a partch that stripped out the physical hpet accesses (since you say they are not the reason for the improvement in accuracy), built hpet on top of vpt, and added the necessary extra mechanisms to deal with interrupt broadcasts into vpt.c. And which was split into more separate pieces of mechanism. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> These new HVM_HPET options seems a weird design choice. It appears that you > can only set these or one of the old options, so there¹s not actually any > independence between the mode used by vpt.c and the mode used by hpet.c. At > guest install time you ought to be able to tell whether the guest will use > hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide > whether missed-ticks accounting is required or not.I''ll use the existing options instead.> I¹d be more agreeable to a partch that stripped out the physical hpet > accesses (since you say they are not the reason for the improvement in > accuracy), built hpet on top of vpt, and added the necessary extra > mechanisms to deal with interrupt broadcasts into vpt.c. And which was split > into more separate pieces of mechanism.ok, I''ll work on this. How much time do I have to make the release you are working on? thanks, Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Mon 6/9/2008 3:36 AM To: Dave Winchell; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:>> > 4) config file parameters > > Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks > for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks > for Windows 2k8-64 and Vista 64. > 8 vcpus for Linux and 2 for Windows.These new HVM_HPET options seems a weird design choice. It appears that you can only set these or one of the old options, so there¹s not actually any independence between the mode used by vpt.c and the mode used by hpet.c. At guest install time you ought to be able to tell whether the guest will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide whether missed-ticks accounting is required or not. I¹d be more agreeable to a partch that stripped out the physical hpet accesses (since you say they are not the reason for the improvement in accuracy), built hpet on top of vpt, and added the necessary extra mechanisms to deal with interrupt broadcasts into vpt.c. And which was split into more separate pieces of mechanism. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/6/08 12:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:>> > I¹d be more agreeable to a partch that stripped out the physical hpet >> > accesses (since you say they are not the reason for the improvement in >> > accuracy), built hpet on top of vpt, and added the necessary extra >> > mechanisms to deal with interrupt broadcasts into vpt.c. And which was >> split >> > into more separate pieces of mechanism. > > ok, I''ll work on this. How much time do I have to make the release you are > working on?I¹m thinking feature freeze at the end of the month. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> ok, I''ll work on this. How much time do I have to make the release you are >> working on? > > I¹m thinking feature freeze at the end of the month.Oh, while I¹m commenting on the current version of the patch, I will point out that changes to the save/restore format need careful thought. At the very least we¹d like to be backward compatible (old images restore on new Xen) even if we don¹t achieve compatibility the other way round. I didn¹t really look too closely at this aspect of the patches so perhaps these patches are fine in this regard. Either way I think the core re-architecting of the hpet device model to handle missed ticks and so on is independent of interface changes anyway (apart from reasonable extensions to the timer_mode option). Changes to save/restore format, addition of extra debugging code, and peripheral things like that it¹d be nice to have in separate patches. It makes the core stuff easier to review and more likely to get accepted without quibble (because there¹s less of it, and it is all dedicated to a single purpose which I think we all agree is where we want to be). Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > > ok, I''ll work on this. How much time do I have to make the > release you are > working on? > > > I’m thinking feature freeze at the end of the month. > > > Oh, while I’m commenting on the current version of the patch, I will > point out that changes to the save/restore format need careful > thought. At the very least we’d like to be backward compatible (old > images restore on new Xen) even if we don’t achieve compatibility the > other way round. I didn’t really look too closely at this aspect of > the patches so perhaps these patches are fine in this regard.I''ll look into the live migrate from old to new.> Either way I think the core re-architecting of the hpet device model > to handle missed ticks and so on is independent of interface changes > anyway (apart from reasonable extensions to the timer_mode option). > Changes to save/restore format, addition of extra debugging code, and > peripheral things like that it’d be nice to have in separate patches. > It makes the core stuff easier to review and more likely to get > accepted without quibble (because there’s less of it, and it is all > dedicated to a single purpose which I think we all agree is where we > want to be).This is fine. I''ll break up the patch along the lines you suggest.> > Thanks, > KeirThanks, Dave _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> At guest install time you ought to be able to tell whether the guest > will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) > and decide whether missed-ticks accounting is required or not.Unfortunately this is not true on Linux, at least without gathering (and hardcoding) more information about the system. Whether hpet is used or not is dependent not only on the OS/version and hvm config parameters, but also on kernel command line parameters and even the underlying CPU. For example, on RHEL5u1, if the tsc is synchronized and the CPU is Intel, and no kernel parameters are chosen, tsc will be chosen as the default clocksource even if hpet is present. Ugly. That said, if Dave''s patch achieves the stated accuracy on most versions of Linux (e.g. at least RHEL4+5, 32+64, smp+1p) for SOME set of parameters (which might be different on each Linux version), it would still be better than what we have now. The ideal solution, I think, would be for the default hvm settings to "do the right thing" for both Windows and Linux at least for the vast majority of configuration choices. I''m not sure this is possible, but it sure would be nice. Dan -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Monday, June 09, 2008 1:36 AM To: Dave Winchell; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> 4) config file parametersHpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks for Windows 2k8-64 and Vista 64. 8 vcpus for Linux and 2 for Windows. These new HVM_HPET options seems a weird design choice. It appears that you can only set these or one of the old options, so there''s not actually any independence between the mode used by vpt.c and the mode used by hpet.c. At guest install time you ought to be able to tell whether the guest will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide whether missed-ticks accounting is required or not. I''d be more agreeable to a partch that stripped out the physical hpet accesses (since you say they are not the reason for the improvement in accuracy), built hpet on top of vpt, and added the necessary extra mechanisms to deal with interrupt broadcasts into vpt.c. And which was split into more separate pieces of mechanism. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''m wondering what is "magic" about 0.03% in all the non-hw-hpet measurements. Is that just the accuracy of the underlying tsc on your test system, e.g. the skew of tsc relative to an external (ntp) source? Or is Xen (tsc-based) system time skewing that much on an overcommitted system (and skewing much less than 0.03% on an unloaded system)? Running the following on dom0 both on an unloaded and overcommitted system (with ntpd off in dom0 and all guests) might be interesting: # ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Saturday, June 07, 2008 3:21 PM To: Keir Fraser; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> Possibly there are bugs in the hpet device model which are fixed by Dave''s > patch. If this is actually the case, it would be nice to break those out as > separate patches, as I think an 11% drift must largely be due to > device-model bugs rather than relatively insignificant differences between > hvm_get_guest_time() and physical HPET main counter.Hi Keir, I tried an experiment on Friday where I short circuited the missed ticks policy code in the hpet.c patch, but used the physical hpet each access. The result for Linux was a drift of .1%, same as the xen-unstable bits. Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks policy code and running in simulated mode (layered on stime) when stime uses hpet. So clearly, the improvement from .1% to .03% is due to the policy code. I haven''t run the short circuit test with the windows policy but I can do that on Monday. Note: For Windows and Linux I get < .03% drift using the policy code and running simulated mode whether stime is using hpet or some other device. regards, Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Fri 6/6/2008 6:34 PM To: dan.magenheimer@oracle.com; Dave Winchell Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time() call to the platform time read function, and bypass TSC altogether. This would be cleaner than having only the vHPET code punch through to the physical HPET: instead we have the boot-time chosen platform timesource used by all virtual timers.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Possibly there are bugs in the hpet device model which are fixed by Dave''s patch. If this is actually the case, it would be nice to break those out as separate patches, as I think an 11% drift must largely be due to device-model bugs rather than relatively insignificant differences between hvm_get_guest_time() and physical HPET main counter. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/6/08 21:48, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> At guest install time you ought to be able to tell whether the guest >> will use hpet or not based on its version (RHELx, SLESy, Winz etc etc) >> and decide whether missed-ticks accounting is required or not. > > Unfortunately this is not true on Linux, at least without gathering > (and hardcoding) more information about the system. Whether hpet is > used or not is dependent not only on the OS/version and hvm config > parameters, but also on kernel command line parameters and even > the underlying CPU. For example, on RHEL5u1, if the tsc is synchronized > and the CPU is Intel, and no kernel parameters are chosen, tsc will be > chosen as the default clocksource even if hpet is present. Ugly.It''s not immediately obvious that adding further independent configuration knobs to twiddle would make our lives that much easier. However it certainly increases the test matrix. In your example above, by synchronised TSC do you mean constant-rate TSC? That can at least be hidden in CPUID now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:>I''m wondering what is "magic" about 0.03% in all the non-hw-hpet >measurements. >.03% is simply the maximum error we''ve seen with hpet. The maximum value (.03) is the same whether its simulated or physical. The best value physical is .001% and I don''t remember the best value simulated bit I believe it is under .01%, perhaps well under. I''ll have to repeat that measurement. I would think that simulated and physical would give roughly the same drift values, but perhaps at very low drifts that doesn''t hold. I think the .03% is mostly due to the stability of the physical hpet device on a platform. I''ve noticed on some platforms, the simulated hpet time actually improves if you disable the hpet in the bios so that stime() is layered on the pm timer or whatever. I would like to get to the bottom of this hpet stability variance from platform to platform. Regards, Dave> Is that just the accuracy of the underlying tsc >on your test system, e.g. the skew of tsc relative to an external >(ntp) source? Or is Xen (tsc-based) system time skewing that much >on an overcommitted system (and skewing much less than 0.03% on an >unloaded system)? > >Running the following on dom0 both on an unloaded and overcommitted >system (with ntpd off in dom0 and all guests) might be interesting: > ># ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER > >-----Original Message----- >From: Dave Winchell [mailto:dwinchell@virtualiron.com] >Sent: Saturday, June 07, 2008 3:21 PM >To: Keir Fraser; dan.magenheimer@oracle.com >Cc: Ben Guthro; xen-devel; Dave Winchell >Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > >>Possibly there are bugs in the hpet device model which are fixed by Dave''s >>patch. If this is actually the case, it would be nice to break those out as >>separate patches, as I think an 11% drift must largely be due to >>device-model bugs rather than relatively insignificant differences between >>hvm_get_guest_time() and physical HPET main counter. >> >> > >Hi Keir, > >I tried an experiment on Friday where I short circuited the missed ticks policy >code in the hpet.c patch, but used the physical hpet each access. The result for Linux >was a drift of .1%, same as the xen-unstable bits. > >Conversely I get very good drift numbers, i.e., under .03%, when using the missed ticks >policy code and running in simulated mode (layered on stime) when stime uses hpet. > >So clearly, the improvement from .1% to .03% is due to the policy code. >I haven''t run the short circuit test with the windows policy but I can do that >on Monday. > >Note: For Windows and Linux I get < .03% drift using the policy code and running >simulated mode whether stime is using hpet or some other device. > > >regards, >Dave > > > >-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: Fri 6/6/2008 6:34 PM >To: dan.magenheimer@oracle.com; Dave Winchell >Cc: Ben Guthro; xen-devel >Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > > >>Although hwhpet=1 is a fine alternative in many cases, it may >>be unavailable on some systems and may cause significant performance >>issues on others. So I think we will still need to track down >>the poor accuracy when hwhpet=0. And if for some reason >>Xen system time can''t be made accurate enough (< 0.05%), then >>I think we should consider building Xen system time itself on >>top of hardware hpet instead of TSC... at least when Xen discovers >>a capable hpet. >> >> > >Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time() >call to the platform time read function, and bypass TSC altogether. This >would be cleaner than having only the vHPET code punch through to the >physical HPET: instead we have the boot-time chosen platform timesource used >by all virtual timers. > > > >>Or maybe there''s a computation error somewhere in the hvm hpet >>scaling code? Hmmm... >> >> > >Possibly there are bugs in the hpet device model which are fixed by Dave''s >patch. If this is actually the case, it would be nice to break those out as >separate patches, as I think an 11% drift must largely be due to >device-model bugs rather than relatively insignificant differences between >hvm_get_guest_time() and physical HPET main counter. > > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >> At guest install time you ought to be able to tell whether > the guest > >> will use hpet or not based on its version (RHELx, SLESy, > Winz etc etc) > >> and decide whether missed-ticks accounting is required or not. > > > > Unfortunately this is not true on Linux, at least without gathering > > (and hardcoding) more information about the system. Whether hpet is > > used or not is dependent not only on the OS/version and hvm config > > parameters, but also on kernel command line parameters and even > > the underlying CPU. For example, on RHEL5u1, if the tsc is > synchronized > > and the CPU is Intel, and no kernel parameters are chosen, > tsc will be > > chosen as the default clocksource even if hpet is present. Ugly. > > It''s not immediately obvious that adding further independent > configuration > knobs to twiddle would make our lives that much easier. > However it certainly > increases the test matrix.I fully agree. That''s why I think the default parameters in Xen should "do the right thing". The default will get the most testing and if users say "time hurts when I change the parameters" we can say "then don''t change the parameters" ;-)> In your example above, by synchronised TSC do you mean > constant-rate TSC? > That can at least be hidden in CPUID now.I meant synchronized as defined in 2.6.18/arch/x86_64/kernel/time.c in the function unsynchronized_tsc() and as used in the same file in time_init_gtod(). To make this more complicated, these routines have had not-insignificant bug fixes in RHEL5u1/2. But yes, it might be a good idea if X86_FEATURE_CONSTANT_TSC always returns 0 (or at least is configurable and defaults off), since the test may only be made in the guest at boot time and the guest may migrate to a machine without the feature. More ugliness, I know. My head hurts. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> The Linux policy is more subtle, but is required to go > from .1% to .03%.Thanks for the good documentation which I hadn''t thoroughly read until now. I now understand that the essence of your hpet missed ticks policy is to ensure that ticks are never delivered too close together. But I''m trying to understand WHY your patch works, in other words, what problem it is countering. I care about this for more reasons than just because it is interesting: (1) I''d like to feel confident that it is fixing a bug rather than just a symptom of a bug; and (2) I wonder how universally it is applicable. I see from code examination in mark_offset_hpet() in RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that the correction for lost ticks is just plain wrong in a virtual environment. (Suppose for example that a virtual tick was delivered every 1.999*hpet_tick... I think the clock would be off by 50%!) Is this the bug that is being "countered" by your policy? However, the lost tick handling in RHEL5u1/kernel/timer.c (which I think is used also for hpet) is much better so I am eager to find out if your policy works there too. If the hpet missed tick policy works for both, though, I should be happy, though I wonder about upstream kernels (e.g. the trend toward tickless). That said, I''d rather see this get into Xen 3.3 and worry about upstream kernels later :-) -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,> While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time.I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%.> I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow?I don''t recall the direction. I can look it up in my notes at work tomorrow.> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0.Our patch is accurate to < .03% using the physical hpet mode or the simulated mode.> And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.In our experience, Xen system time is accurate enough now.> One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source.I do not know the tsc accuracy.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > The Linux policy is more subtle, but is required to go > > from .1% to .03%.> Thanks for the good documentation which I hadn''t thoroughly > read until now. > I now understand that the essence of your > hpet missed ticks policy is to ensure that ticks are never > delivered too close together. But I''m trying to understand > WHY your patch works, in other words, what problem it is > countering.I''ll tell you what I recall about this. Tomorrow I''ll check the guest code to verify. I think that Linux declares a full tick, even if the interrupt is early. That''s the problem. On the other hand, if the interrupt is late it in effect declares a tick plus fraction. If it just declared the fraction in the first place, we could deliver the interrupts whenever we wanted. Its really not that different than the missed ticks policy in vpt.c except that there the period in vpt.c is based on start of interrupt and I have improved that with end-of interrupt as described in the patch note. I don''t recall what prompted me to try end-of-interrupt, but I saw a significant improvement. I may have been running a monotonicity test at the same time to explain the lock contention mentioned in the write-up.> I care about this for more reasons than just > because it is interesting: (1) I''d like to feel confident that > it is fixing a bug rather than just a symptom of a bug; > and (2) I wonder how universally it is applicable.Its worked well my my small set of guests. You and our QA are going to tell us about the wider set. It doesn''t matter if guest A handles interrupts closely spaced or not, just whether it handles them far apart. So it should be pretty universal with guests that really handle missed ticks. I think its interesting that some 32bit Linux guests handle missed ticks for hpet.> I see from code examination in mark_offset_hpet() in > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > the correction for lost ticks is just plain wrong in > a virtual environment. (Suppose for example that a virtual > tick was delivered every 1.999*hpet_tick... I think > the clock would be off by 50%!) Is this the bug that > is being "countered" by your policy?I haven''t looked at that code, perhaps. I''ll check it tomorrow.> However, the lost tick handling in RHEL5u1/kernel/timer.c > (which I think is used also for hpet) is much better > so I am eager to find out if your policy works there > too. > If the hpet missed tick policy works for both, though, > I should be happy, though I wonder about upstream kernels > (e.g. the trend toward tickless).I wasn''t aware of this trend. If its robust, however, it should handle late interrupts ...> That said, I''d rather > see this get into Xen 3.3 and worry about upstream kernels > later :-)Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Mon 6/9/2008 6:02 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> The Linux policy is more subtle, but is required to go > from .1% to .03%.Thanks for the good documentation which I hadn''t thoroughly read until now. I now understand that the essence of your hpet missed ticks policy is to ensure that ticks are never delivered too close together. But I''m trying to understand WHY your patch works, in other words, what problem it is countering. I care about this for more reasons than just because it is interesting: (1) I''d like to feel confident that it is fixing a bug rather than just a symptom of a bug; and (2) I wonder how universally it is applicable. I see from code examination in mark_offset_hpet() in RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that the correction for lost ticks is just plain wrong in a virtual environment. (Suppose for example that a virtual tick was delivered every 1.999*hpet_tick... I think the clock would be off by 50%!) Is this the bug that is being "countered" by your policy? However, the lost tick handling in RHEL5u1/kernel/timer.c (which I think is used also for hpet) is much better so I am eager to find out if your policy works there too. If the hpet missed tick policy works for both, though, I should be happy, though I wonder about upstream kernels (e.g. the trend toward tickless). That said, I''d rather see this get into Xen 3.3 and worry about upstream kernels later :-) -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,> While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time.I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%.> I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow?I don''t recall the direction. I can look it up in my notes at work tomorrow.> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0.Our patch is accurate to < .03% using the physical hpet mode or the simulated mode.> And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.In our experience, Xen system time is accurate enough now.> One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source.I do not know the tsc accuracy.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I''ll tell you what I recall about this. Tomorrow I''ll check the > guest code to verify. I think that Linux declares a full tick, > even if the interrupt is early. That''s the problem.Yes, that makes sense and concurs with what I remember from the EL4u5-32 code. If this is true, one would expect the default "no missed tick" policy to see time moving faster than an external source -- the first missed tick delivered after a long sleep would "catch up" and then the remainder would each add another tick.> On the other hand, if the interrupt is late it in effect declares > a tick plus fraction. If it just declared the fraction in the first place, > we could deliver the interrupts whenever we wanted.My read of the EL4u5-32 code is that the fraction is discarded and a new tick period commences at "now", so the fractions eventually accumulate as lost time. In EL5u1-32 however it looks like the fractions are accounted for. Indeed the EL5u1-32 "lost tick handling" code resembles the Linux/ia64 code which is what I''ve always assumed was the "missed tick" model. In this case, I think no policy is necessary and the measured skew should be identical to any physical hpet skew. I''ll have to test this hypothesis though. -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell Sent: Monday, June 09, 2008 5:35 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Dave Winchell; xen-devel; Ben Guthro Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> > The Linux policy is more subtle, but is required to go > > from .1% to .03%.> Thanks for the good documentation which I hadn''t thoroughly > read until now. > I now understand that the essence of your > hpet missed ticks policy is to ensure that ticks are never > delivered too close together. But I''m trying to understand > WHY your patch works, in other words, what problem it is > countering.I''ll tell you what I recall about this. Tomorrow I''ll check the guest code to verify. I think that Linux declares a full tick, even if the interrupt is early. That''s the problem. On the other hand, if the interrupt is late it in effect declares a tick plus fraction. If it just declared the fraction in the first place, we could deliver the interrupts whenever we wanted. Its really not that different than the missed ticks policy in vpt.c except that there the period in vpt.c is based on start of interrupt and I have improved that with end-of interrupt as described in the patch note. I don''t recall what prompted me to try end-of-interrupt, but I saw a significant improvement. I may have been running a monotonicity test at the same time to explain the lock contention mentioned in the write-up.> I care about this for more reasons than just > because it is interesting: (1) I''d like to feel confident that > it is fixing a bug rather than just a symptom of a bug; > and (2) I wonder how universally it is applicable.Its worked well my my small set of guests. You and our QA are going to tell us about the wider set. It doesn''t matter if guest A handles interrupts closely spaced or not, just whether it handles them far apart. So it should be pretty universal with guests that really handle missed ticks. I think its interesting that some 32bit Linux guests handle missed ticks for hpet.> I see from code examination in mark_offset_hpet() in > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > the correction for lost ticks is just plain wrong in > a virtual environment. (Suppose for example that a virtual > tick was delivered every 1.999*hpet_tick... I think > the clock would be off by 50%!) Is this the bug that > is being "countered" by your policy?I haven''t looked at that code, perhaps. I''ll check it tomorrow.> However, the lost tick handling in RHEL5u1/kernel/timer.c > (which I think is used also for hpet) is much better > so I am eager to find out if your policy works there > too. > If the hpet missed tick policy works for both, though, > I should be happy, though I wonder about upstream kernels > (e.g. the trend toward tickless).I wasn''t aware of this trend. If its robust, however, it should handle late interrupts ...> That said, I''d rather > see this get into Xen 3.3 and worry about upstream kernels > later :-)Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Mon 6/9/2008 6:02 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> The Linux policy is more subtle, but is required to go > from .1% to .03%.Thanks for the good documentation which I hadn''t thoroughly read until now. I now understand that the essence of your hpet missed ticks policy is to ensure that ticks are never delivered too close together. But I''m trying to understand WHY your patch works, in other words, what problem it is countering. I care about this for more reasons than just because it is interesting: (1) I''d like to feel confident that it is fixing a bug rather than just a symptom of a bug; and (2) I wonder how universally it is applicable. I see from code examination in mark_offset_hpet() in RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that the correction for lost ticks is just plain wrong in a virtual environment. (Suppose for example that a virtual tick was delivered every 1.999*hpet_tick... I think the clock would be off by 50%!) Is this the bug that is being "countered" by your policy? However, the lost tick handling in RHEL5u1/kernel/timer.c (which I think is used also for hpet) is much better so I am eager to find out if your policy works there too. If the hpet missed tick policy works for both, though, I should be happy, though I wonder about upstream kernels (e.g. the trend toward tickless). That said, I''d rather see this get into Xen 3.3 and worry about upstream kernels later :-) -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@oracle.com; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,> While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time.I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%.> I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow?I don''t recall the direction. I can look it up in my notes at work tomorrow.> Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0.Our patch is accurate to < .03% using the physical hpet mode or the simulated mode.> And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet.In our experience, Xen system time is accurate enough now.> One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source.I do not know the tsc accuracy.> Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm...Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let''s call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there''s a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can''t be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there''s a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it''s > > my intent to clean this up, but I won''t get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We''ll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@oracle.com > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don''t recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > <mailto:dwinchell@virtualiron.com> > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xensource.com > > > http://lists.xensource.com/xen-devel > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> I don''t recall what prompted me to try end-of-interrupt, > but I saw a significant improvement. I may have been running > a monotonicity test at the same time to explain the lock > contention mentioned in the write-up.Doesn¹t this policy guarantee that you actually deliver interrupts at consistently too low a rate? Since the delivery period is now timer period + latency of interrupt handling? I suppose it works for this guest type because it doesn¹t actually care about getting interrupts at the correct rate, so long as the ticks are always a bit late? For those that do need missed ticks to be delivered, do you track missed ticks at the absolute correct rate? This is perhaps a fine tradeoff for all platform timers those guests that can handle missed ticks obviously do not care about getting their timer interrupts at absolutely the correct rate, and delivering a little late is what they are geared to handle (getting delivered consistently early is just weird!). Whereas guests that need all ticks also want them (at least over the long run) at exactly the correct rate. I think there¹s good empirical analysis in the work you¹ve done. We just need the patches cleaned up and generalised for vpt.c now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I don''t recall what prompted me to try end-of-interrupt, > > but I saw a significant improvement. I may have been running > > a monotonicity test at the same time to explain the lock > > contention mentioned in the write-up.> Doesn¹t this policy guarantee that you actually deliver interrupts at > consistently too low a rate? Since the delivery period is now timer period + > latency of interrupt handling?Yes, the rate ends up being about half the normal rate because I toss the interrupt if it doesn''t meet the requirement. If, in testing, we find a guest that has a problem with half the normal rate, we can fine tune the policy. For example, instead of discarding set a small timer.> I suppose it works for this guest type > because it doesn¹t actually care about getting interrupts at the correct > rate, so long as the ticks are always a bit late?True.> For those that do need missed ticks to be delivered, do you track missed > ticks at the absolute correct rate?Yes.> > This is perhaps a fine tradeoff for all platform timers < those guests that > can handle missed ticks obviously do not care about getting their timer > interrupts at absolutely the correct rate, and delivering a little late is > what they are geared to handle (getting delivered consistently early is just > weird!). Whereas guests that need all ticks also want them (at least over > the long run) at exactly the correct rate.> I think there¹s good empirical analysis in the work you¹ve done. We just > need the patches cleaned up and generalised for vpt.c now.Thanks. I''ll get to work on the generalization, etc. -Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Tue 6/10/2008 3:52 AM To: Dave Winchell; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> I don''t recall what prompted me to try end-of-interrupt, > but I saw a significant improvement. I may have been running > a monotonicity test at the same time to explain the lock > contention mentioned in the write-up.Doesn¹t this policy guarantee that you actually deliver interrupts at consistently too low a rate? Since the delivery period is now timer period + latency of interrupt handling? I suppose it works for this guest type because it doesn¹t actually care about getting interrupts at the correct rate, so long as the ticks are always a bit late? For those that do need missed ticks to be delivered, do you track missed ticks at the absolute correct rate? This is perhaps a fine tradeoff for all platform timers < those guests that can handle missed ticks obviously do not care about getting their timer interrupts at absolutely the correct rate, and delivering a little late is what they are geared to handle (getting delivered consistently early is just weird!). Whereas guests that need all ticks also want them (at least over the long run) at exactly the correct rate. I think there¹s good empirical analysis in the work you¹ve done. We just need the patches cleaned up and generalised for vpt.c now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Doesn¹t this policy guarantee that you actually deliver interrupts at > > consistently too low a rate? Since the delivery period is now timer period + > > latency of interrupt handling? > > Yes, the rate ends up being about half the normal rate because > I toss the interrupt if it doesn''t meet the requirement. If, in testing, we find > a guest that has a problem with half the normal rate, we can fine tune > the policy. For example, instead of discarding set a small timer.A better solution is to set the new period timer at end-of-interrupt time (in the callback) instead of delivery time (timer expiration time). This way the rate will be very close to what the guest expects. I think I''ll make this change. -Dave -----Original Message----- From: Dave Winchell Sent: Tue 6/10/2008 7:49 AM To: Keir Fraser; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> > I don''t recall what prompted me to try end-of-interrupt, > > but I saw a significant improvement. I may have been running > > a monotonicity test at the same time to explain the lock > > contention mentioned in the write-up.> Doesn¹t this policy guarantee that you actually deliver interrupts at > consistently too low a rate? Since the delivery period is now timer period + > latency of interrupt handling?Yes, the rate ends up being about half the normal rate because I toss the interrupt if it doesn''t meet the requirement. If, in testing, we find a guest that has a problem with half the normal rate, we can fine tune the policy. For example, instead of discarding set a small timer.> I suppose it works for this guest type > because it doesn¹t actually care about getting interrupts at the correct > rate, so long as the ticks are always a bit late?True.> For those that do need missed ticks to be delivered, do you track missed > ticks at the absolute correct rate?Yes.> > This is perhaps a fine tradeoff for all platform timers < those guests that > can handle missed ticks obviously do not care about getting their timer > interrupts at absolutely the correct rate, and delivering a little late is > what they are geared to handle (getting delivered consistently early is just > weird!). Whereas guests that need all ticks also want them (at least over > the long run) at exactly the correct rate.> I think there¹s good empirical analysis in the work you¹ve done. We just > need the patches cleaned up and generalised for vpt.c now.Thanks. I''ll get to work on the generalization, etc. -Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Tue 6/10/2008 3:52 AM To: Dave Winchell; dan.magenheimer@oracle.com Cc: Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> I don''t recall what prompted me to try end-of-interrupt, > but I saw a significant improvement. I may have been running > a monotonicity test at the same time to explain the lock > contention mentioned in the write-up.Doesn¹t this policy guarantee that you actually deliver interrupts at consistently too low a rate? Since the delivery period is now timer period + latency of interrupt handling? I suppose it works for this guest type because it doesn¹t actually care about getting interrupts at the correct rate, so long as the ticks are always a bit late? For those that do need missed ticks to be delivered, do you track missed ticks at the absolute correct rate? This is perhaps a fine tradeoff for all platform timers < those guests that can handle missed ticks obviously do not care about getting their timer interrupts at absolutely the correct rate, and delivering a little late is what they are geared to handle (getting delivered consistently early is just weird!). Whereas guests that need all ticks also want them (at least over the long run) at exactly the correct rate. I think there¹s good empirical analysis in the work you¹ve done. We just need the patches cleaned up and generalised for vpt.c now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/6/08 13:34, "Dave Winchell" <dwinchell@virtualiron.com> wrote:>> > Yes, the rate ends up being about half the normal rate because >> > I toss the interrupt if it doesn''t meet the requirement. If, in testing, we >> find >> > a guest that has a problem with half the normal rate, we can fine tune >> > the policy. For example, instead of discarding set a small timer. > > A better solution is to set the new period timer at end-of-interrupt time > (in the callback) instead of delivery time (timer expiration time). > This way the rate will be very close to what the guest expects. > I think I''ll make this change.Yes, that¹s what I thought you¹d done. It sounds nicer to me. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir, Dan: Although I plan to break up the patch, etc., I''m posting this fix to the patch for anyone who might be interested. thanks, Dave _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> In EL5u1-32 however it looks like the fractions are accounted > for. Indeed the EL5u1-32 "lost tick handling" code resembles > the Linux/ia64 code which is what I''ve always assumed was > the "missed tick" model. In this case, I think no policy > is necessary and the measured skew should be identical to > any physical hpet skew. I''ll have to test this hypothesis though.I''ve tested this hypothesis and it seems to hold true. This means the existing (unpatched) hpet code works fine on EL5-32bit (vcpus=1) when hpet is the clocksource, even when the machine is overcommitted. A second hypothesis still needs to be tested that Dave''s patch will not make this worse. (Note that per previous discussion, my EL5u1-32bit guest running on an Intel dual-core physical box chose tsc as the best clocksource and I had to override it with clock=hpet in the kernel command line.)> Yes, that makes sense and concurs with what I remember from > the EL4u5-32 code. If this is true, one would expect the > default "no missed tick" policy to see time moving faster > than an external source -- the first missed tick delivered > after a long sleep would "catch up" and then the remainder > would each add another tick.Indeed with the existing (unpatched) hpet code, time is running faster on EL4u5-32 (vcpus=1, when overcommited). So Dave''s patch is definitely needed here. Will try 64-bit next. Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Monday, June 09, 2008 9:21 PM > To: ''Dave Winchell''; ''Keir Fraser'' > Cc: ''xen-devel''; ''Ben Guthro'' > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > I''ll tell you what I recall about this. Tomorrow I''ll check the > > guest code to verify. I think that Linux declares a full tick, > > even if the interrupt is early. That''s the problem. > > Yes, that makes sense and concurs with what I remember from > the EL4u5-32 code. If this is true, one would expect the > default "no missed tick" policy to see time moving faster > than an external source -- the first missed tick delivered > after a long sleep would "catch up" and then the remainder > would each add another tick. > > > On the other hand, if the interrupt is late it in effect declares > > a tick plus fraction. If it just declared the fraction in > the first place, > > we could deliver the interrupts whenever we wanted. > > My read of the EL4u5-32 code is that the fraction is discarded > and a new tick period commences at "now", so the fractions > eventually accumulate as lost time. > > In EL5u1-32 however it looks like the fractions are accounted > for. Indeed the EL5u1-32 "lost tick handling" code resembles > the Linux/ia64 code which is what I''ve always assumed was > the "missed tick" model. In this case, I think no policy > is necessary and the measured skew should be identical to > any physical hpet skew. I''ll have to test this hypothesis though. > > -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of > Dave Winchell > Sent: Monday, June 09, 2008 5:35 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Dave Winchell; xen-devel; Ben Guthro > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > > The Linux policy is more subtle, but is required to go > > > from .1% to .03%. > > > Thanks for the good documentation which I hadn''t thoroughly > > read until now. > > I now understand that the essence of your > > hpet missed ticks policy is to ensure that ticks are never > > delivered too close together. But I''m trying to understand > > WHY your patch works, in other words, what problem it is > > countering. > > I''ll tell you what I recall about this. Tomorrow I''ll check the > guest code to verify. I think that Linux declares a full tick, > even if the interrupt is early. That''s the problem. > On the other hand, if the interrupt is late it in effect declares > a tick plus fraction. If it just declared the fraction in the > first place, > we could deliver the interrupts whenever we wanted. > > Its really not that different than the missed ticks policy in vpt.c > except that there the period in vpt.c is based on start of interrupt > and I have improved that with end-of interrupt as described > in the patch note. > > I don''t recall what prompted me to try end-of-interrupt, > but I saw a significant improvement. I may have been running > a monotonicity test at the same time to explain the lock > contention mentioned in the write-up. > > > I care about this for more reasons than just > > because it is interesting: (1) I''d like to feel confident that > > it is fixing a bug rather than just a symptom of a bug; > > and (2) I wonder how universally it is applicable. > > Its worked well my my small set of guests. You and our > QA are going to tell us about the wider set. It doesn''t > matter if guest A handles interrupts closely spaced or not, > just whether it handles them far apart. So it should be pretty > universal with guests that really handle missed ticks. > I think its interesting that some 32bit Linux guests handle > missed ticks for hpet. > > > I see from code examination in mark_offset_hpet() in > > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > > the correction for lost ticks is just plain wrong in > > a virtual environment. (Suppose for example that a virtual > > tick was delivered every 1.999*hpet_tick... I think > > the clock would be off by 50%!) Is this the bug that > > is being "countered" by your policy? > > I haven''t looked at that code, perhaps. > I''ll check it tomorrow. > > > However, the lost tick handling in RHEL5u1/kernel/timer.c > > (which I think is used also for hpet) is much better > > so I am eager to find out if your policy works there > > too. > > If the hpet missed tick policy works for both, though, > > I should be happy, though I wonder about upstream kernels > > (e.g. the trend toward tickless). > > I wasn''t aware of this trend. If its robust, however, it should > handle late interrupts ... > > > That said, I''d rather > > see this get into Xen 3.3 and worry about upstream kernels > > later :-) > > Regards, > Dave > > > > -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Mon 6/9/2008 6:02 PM > To: Dave Winchell; Keir Fraser > Cc: Ben Guthro; xen-devel > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > The Linux policy is more subtle, but is required to go > > from .1% to .03%. > > Thanks for the good documentation which I hadn''t thoroughly > read until now. I now understand that the essence of your > hpet missed ticks policy is to ensure that ticks are never > delivered too close together. But I''m trying to understand > WHY your patch works, in other words, what problem it is > countering. I care about this for more reasons than just > because it is interesting: (1) I''d like to feel confident that > it is fixing a bug rather than just a symptom of a bug; > and (2) I wonder how universally it is applicable. > > I see from code examination in mark_offset_hpet() in > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > the correction for lost ticks is just plain wrong in > a virtual environment. (Suppose for example that a virtual > tick was delivered every 1.999*hpet_tick... I think > the clock would be off by 50%!) Is this the bug that > is being "countered" by your policy? > > However, the lost tick handling in RHEL5u1/kernel/timer.c > (which I think is used also for hpet) is much better > so I am eager to find out if your policy works there > too. > > If the hpet missed tick policy works for both, though, > I should be happy, though I wonder about upstream kernels > (e.g. the trend toward tickless). That said, I''d rather > see this get into Xen 3.3 and worry about upstream kernels > later :-) > > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Sunday, June 08, 2008 2:32 PM > To: dan.magenheimer@oracle.com; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Hi Dan, > > > While I am fully supportive of offering hardware hpet as an option > > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > > surprised by your preliminary results; the most obvious conclusion > > is that Xen system time is losing time at the rate of 1000 PPM > > though its possible there''s a bug somewhere else in the "time > > stack". Your Windows result is jaw-dropping and inexplicable, > > though I have to admit ignorance of how Windows manages time. > > I think xen system time is fine. You have to add the interrupt > delivery policies decribed in the write-up for the patch to get > accurate timekeeping in the guest. > > The windows policy is obvious and results in a large improvement > in accuracy. The Linux policy is more subtle, but is required to go > from .1% to .03%. > > > I think with my recent patch and hpet=1 (essentially the same as > > your emulated hpet), hvm guest time should track Xen system time. > > I wonder if domain0 (which if I understand correctly is directly > > using Xen system time) is also seeing an error of .1%? Also > > I wonder for the skew you are seeing (in both hvm guests and > > domain0) is time moving too fast or two slow? > > I don''t recall the direction. I can look it up in my notes at work > tomorrow. > > > Although hwhpet=1 is a fine alternative in many cases, it may > > be unavailable on some systems and may cause significant performance > > issues on others. So I think we will still need to track down > > the poor accuracy when hwhpet=0. > > Our patch is accurate to < .03% using the physical hpet mode or > the simulated mode. > > > And if for some reason > > Xen system time can''t be made accurate enough (< 0.05%), then > > I think we should consider building Xen system time itself on > > top of hardware hpet instead of TSC... at least when Xen discovers > > a capable hpet. > > In our experience, Xen system time is accurate enough now. > > > One more thought... do you know the accuracy of the TSC crystals > > on your test systems? I posted a patch awhile ago that was > > intended to test that, though I guess it was only testing skew > > of different TSCs on the same system, not TSCs against an > > external time source. > > I do not know the tsc accuracy. > > > Or maybe there''s a computation error somewhere in the hvm hpet > > scaling code? Hmmm... > > > Regards, > Dave > > > -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Fri 6/6/2008 4:29 PM > To: Dave Winchell; Keir Fraser > Cc: Ben Guthro; xen-devel > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Dave -- > > Thanks much for posting the preliminary results! > > While I am fully supportive of offering hardware hpet as an option > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > surprised by your preliminary results; the most obvious conclusion > is that Xen system time is losing time at the rate of 1000 PPM > though its possible there''s a bug somewhere else in the "time > stack". Your Windows result is jaw-dropping and inexplicable, > though I have to admit ignorance of how Windows manages time. > > > I think with my recent patch and hpet=1 (essentially the same as > your emulated hpet), hvm guest time should track Xen system time. > I wonder if domain0 (which if I understand correctly is directly > using Xen system time) is also seeing an error of .1%? Also > I wonder for the skew you are seeing (in both hvm guests and > domain0) is time moving too fast or two slow? > > Although hwhpet=1 is a fine alternative in many cases, it may > be unavailable on some systems and may cause significant performance > issues on others. So I think we will still need to track down > the poor accuracy when hwhpet=0. And if for some reason > Xen system time can''t be made accurate enough (< 0.05%), then > I think we should consider building Xen system time itself on > top of hardware hpet instead of TSC... at least when Xen discovers > a capable hpet. > > One more thought... do you know the accuracy of the TSC crystals > on your test systems? I posted a patch awhile ago that was > intended to test that, though I guess it was only testing skew > of different TSCs on the same system, not TSCs against an > external time source. > > Or maybe there''s a computation error somewhere in the hvm hpet > scaling code? Hmmm... > > Thanks, > Dan > > > -----Original Message----- > > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > > Sent: Friday, June 06, 2008 1:33 PM > > To: dan.magenheimer@oracle.com; Keir Fraser > > Cc: Ben Guthro; xen-devel; Dave Winchell > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > > > Dan, Keir: > > > > Preliminary tests results indicate an error of .1% for Linux 64 bit > > guests configured > > for hpet with xen-unstable as is. As we have discussed many > times, the > > ntp requirement is .05%. > > Tests on the patch we just submitted for hpet have > indicated errors of > > .0012% > > on this platform under similar test conditions and .03% on > > other platforms. > > > > Windows vista64 has an error of 11% using hpet with the > > xen-unstable bits. > > In an overnight test with our hpet patch, the Windows vista > > error was .008%. > > > > The tests are with two or three guests on a physical node, all under > > load, and with > > the ratio of vcpus to phys cpus > 1. > > > > I will continue to run tests over the next few days. > > > > thanks, > > Dave > > > > > > Dan Magenheimer wrote: > > > > > Hi Dave and Ben -- > > > > > > When running tests on xen-unstable (without your patch), > > please ensure > > > that hpet=1 is set in the hvm config and also I think > that when hpet > > > is the clocksource on RHEL4-32, the clock IS resilient to > > missed ticks > > > so timer_mode should be 2 (vs when pit is the clocksource > > on RHEL4-32, > > > all clock ticks must be delivered and so timer_mode should be 0). > > > > > > Per > > > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > > 00098.html it''s > > > my intent to clean this up, but I won''t get to it until next week. > > > > > > Thanks, > > > Dan > > > > > > -----Original Message----- > > > *From:* xen-devel-bounces@lists.xensource.com > > > [mailto:xen-devel-bounces@lists.xensource.com]*On > > Behalf Of *Dave > > > Winchell > > > *Sent:* Friday, June 06, 2008 4:46 AM > > > *To:* Keir Fraser; Ben Guthro; xen-devel > > > *Cc:* dan.magenheimer@oracle.com; Dave Winchell > > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > > > Keir, > > > > > > I think the changes are required. We''ll run some tests > > today today so > > > that we have some data to talk about. > > > > > > -Dave > > > > > > > > > -----Original Message----- > > > From: xen-devel-bounces@lists.xensource.com on behalf > > of Keir Fraser > > > Sent: Fri 6/6/2008 4:58 AM > > > To: Ben Guthro; xen-devel > > > Cc: dan.magenheimer@oracle.com > > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > > > Are these patches needed now the timers are built on > Xen system > > > time rather > > > than host TSC? Dan has reported much better > > time-keeping with his > > > patch > > > checked in, and it¹s for sure a lot less invasive than > > this patchset. > > > > > > > > > -- Keir > > > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > > > > > > > > > > > 1. Introduction > > > > > > > > This patch improves the hpet based guest clock in > > terms of drift and > > > > monotonicity. > > > > Prior to this work the drift with hpet was greater > > than 2%, far > > > above the .05% > > > > limit > > > > for ntp to synchronize. With this code, the drift > ranges from > > > .001% to .0033% > > > > depending > > > > on guest and physical platform. > > > > > > > > Using hpet allows guest operating systems to > provide monotonic > > > time to their > > > > applications. Time sources other than hpet are not > > monotonic because > > > > of their reliance on tsc, which is not synchronized > > across physical > > > > processors. > > > > > > > > Windows 2k864 and many Linux guests are supported with two > > > policies, one for > > > > guests > > > > that handle missed clock interrupts and the other for guests > > > that require the > > > > correct number of interrupts. > > > > > > > > Guests may use hpet for the timing source even if > the physical > > > platform has no > > > > visible > > > > hpet. Migration is supported between physical machines which > > > differ in > > > > physical > > > > hpet visibility. > > > > > > > > Most of the changes are in hpet.c. Two general > facilities are > > > added to track > > > > interrupt > > > > progress. The ideas here and the facilities would > be useful in > > > vpt.c, for > > > > other time > > > > sources, though no attempt is made here to improve vpt.c. > > > > > > > > The following sections discuss hpet dependencies, interrupt > > > delivery policies, > > > > live migration, > > > > test results, and relation to recent work with > monotonic time. > > > > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > > > The virtual hpet depends on the ability to read the > > physical or > > > simulated > > > > (see discussion below) hpet. For timekeeping, the > > virtual hpet > > > also depends > > > > on two new interrupt notification facilities to > implement its > > > policies for > > > > interrupt delivery. > > > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > > > In this implementation, the virtual hpet reads with > > > read_64_main_counter(), > > > > exported by > > > > time.c, either the real physical hpet main counter register > > > directly or a > > > > "simulated" > > > > hpet main counter. > > > > > > > > The simulated mode uses a monotonic version of get_s_time() > > > (NOW()), where the > > > > last > > > > time value is returned whenever the current time > value is less > > > than the last > > > > time > > > > value. In simulated mode, since it is layered on s_time, the > > > underlying > > > > hardware > > > > can be hpet or some other device. The frequency of the main > > > counter in > > > > simulated > > > > mode is the same as the standard physical hpet frequency, > > > allowing live > > > > migration > > > > between nodes that are configured differently. > > > > > > > > If the physical platform does not have an hpet > > device, or if xen > > > is configured > > > > not > > > > to use the device, then the simulated method is > used. If there > > > is a physical > > > > hpet device, > > > > and xen has initialized it, then either simulated > or physical > > > mode can be > > > > used. > > > > This is governed by a boot time option, hpet-avoid. > > Setting this > > > option to 1 > > > > gives the > > > > simulated mode and 0 the physical mode. The default > > is physical > > > mode. > > > > > > > > A disadvantage of the physical mode is that may > take longer to > > > read the device > > > > than in simulated mode. On some platforms the cost is > > about the > > > same (less > > > > than 250 nsec) for > > > > physical and simulated modes, while on others > physical cost is > > > much higher > > > > than simulated. > > > > A disadvantage of the simulated mode is that it can > return the > > > same value > > > > for the counter in consecutive calls. > > > > > > > > 2.2. Interrupt notification facilities. > > > > > > > > Two interrupt notification facilities are introduced, one is > > > > hvm_isa_irq_assert_cb() > > > > and the other hvm_register_intr_en_notif(). > > > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver > interrupts to > > > the vioapic. > > > > hvm_isa_irq_assert_cb allows a callback to be > passed along to > > > > vioapic_deliver() > > > > and this callback is called with a mask of the vcpus > > which will > > > get the > > > > interrupt. This callback is made before any vcpus receive an > > > interrupt. > > > > > > > > Vhpet uses hvm_register_intr_en_notif() to register > a handler > > > for a particular > > > > vector that will be called when that vector is injected in > > > > [vmx,svm]_intr_assist() > > > > and also when the guest finishes handling the > interrupt. Here > > > finished is > > > > defined > > > > as the point when the guest re-enables interrupts or > > lowers the > > > tpr value. > > > > EOI is not used as the end of interrupt as this is sometimes > > > returned before > > > > the interrupt handler has done its work. A flag is > > passed to the > > > handler > > > > indicating > > > > whether this is the injection point (post = 1) or the > > interrupt > > > finished (post > > > > = 0) point. > > > > The need for the finished point callback is discussed in the > > > missed ticks > > > > policy section. > > > > > > > > To prevent a possible early trigger of the finished > callback, > > > intr_en_notif > > > > logic > > > > has a two stage arm, the first at injection > > > (hvm_intr_en_notif_arm()) and the > > > > second when > > > > interrupts are seen to be disabled > > (hvm_intr_en_notif_disarm()). > > > Once fully > > > > armed, re-enabling > > > > interrupts will cause hvm_intr_en_notif_disarm() to > > make the end > > > of interrupt > > > > callback. hvm_intr_en_notif_arm() and > > hvm_intr_en_notif_disarm() > > > are called by > > > > [vmx,svm]_intr_assist(). > > > > > > > > 3. Interrupt delivery policies > > > > > > > > The existing hpet interrupt delivery is preserved. > > This includes > > > > vcpu round robin delivery used by Linux and > broadcast delivery > > > used by > > > > Windows. > > > > > > > > There are two policies for interrupt delivery, one > for Windows > > > 2k8-64 and the > > > > other > > > > for Linux. The Linux policy takes advantage of the > > (guest) Linux > > > missed tick > > > > and offset > > > > calculations and does not attempt to deliver the > > right number of > > > interrupts. > > > > The Windows policy delivers the correct number of > interrupts, > > > even if > > > > sometimes much > > > > closer to each other than the period. The policies > are similar > > > to those in > > > > vpt.c, though > > > > there are some important differences. > > > > > > > > Policies are selected with an HVMOP_set_param > > hypercall with index > > > > HVM_PARAM_TIMER_MODE. > > > > Two new values are added, > > HVM_HPET_guest_computes_missed_ticks and > > > > HVM_HPET_guest_does_not_compute_missed_ticks. The > reason that > > > two new ones > > > > are added is that > > > > in some guests (32bit Linux) a no-missed policy is > needed for > > > clock sources > > > > other than hpet > > > > and a missed ticks policy for hpet. It was felt that > > there would > > > be less > > > > confusion by simply > > > > introducing the two hpet policies. > > > > > > > > 3.1. The missed ticks policy > > > > > > > > The Linux clock interrupt handler for hpet calculates missed > > > ticks and offset > > > > using the hpet > > > > main counter. The algorithm works well when the > time since the > > > last interrupt > > > > is greater than > > > > or equal to a period and poorly otherwise. > > > > > > > > The missed ticks policy ensures that no two clock > > interrupts are > > > delivered to > > > > the guest at > > > > a time interval less than a period. A time stamp (hpet main > > > counter value) is > > > > recorded (by a > > > > callback registered with hvm_register_intr_en_notif) > > when Linux > > > finishes > > > > handling the clock > > > > interrupt. Then, ensuing interrupts are delivered to > > the vioapic > > > only if the > > > > current main > > > > counter value is a period greater than when the > last interrupt > > > was handled. > > > > > > > > Tests showed a significant improvement in clock > drift with end > > > of interrupt > > > > time stamps > > > > versus beginning of interrupt[1]. It is believed that > > the reason > > > for the > > > > improvement > > > > is that the clock interrupt handler goes for a > > spinlock and can > > > be therefore > > > > delayed in its > > > > processing. Furthermore, the main counter is read > by the guest > > > under the lock. > > > > The net > > > > effect is that if we time stamp injection, we can get the > > > difference in time > > > > between successive interrupt handler lock acquisitions to be > > > less than the > > > > period. > > > > > > > > 3.2. The no-missed ticks policy > > > > > > > > Windows 2k864 keeps very poor time with the missed > > ticks policy. > > > So the > > > > no-missed ticks policy > > > > was developed. In the no-missed ticks policy we deliver the > > > correct number of > > > > interrupts, > > > > even if they are spaced less than a period apart > > (when catching up). > > > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > > such that > > > > all vcpus get the clock interrupt. The best Windows drift > > > performance was > > > > achieved when the > > > > policy code ensured that all the previous interrupts (on the > > > various vcpus) > > > > had been injected > > > > before injecting the next interrupt to the vioapic.. > > > > > > > > The policy code works as follows. It uses the > > > hvm_isa_irq_assert_cb() to > > > > record > > > > the vcpus to be interrupted in > h->hpet.pending_mask. Then, in > > > the callback > > > > registered > > > > with hvm_register_intr_en_notif() at post=1 time it > clears the > > > current vcpu in > > > > the pending_mask. > > > > When the pending_mask is clear it decrements > > > hpet.intr_pending_nr and if > > > > intr_pending_nr is still > > > > non-zero posts another interrupt to the ioapic with > > > hvm_isa_irq_assert_cb(). > > > > Intr_pending_nr is incremented in > > > hpet_route_decision_not_missed_ticks(). > > > > > > > > The missed ticks policy intr_en_notif callback also uses the > > > pending_mask > > > > method. So even though > > > > Linux does not broadcast its interrupts, the code > could handle > > > it if it did. > > > > In this case the end of interrupt time stamp is > made when the > > > pending_mask is > > > > clear. > > > > > > > > 4. Live Migration > > > > > > > > Live migration with hpet preserves the current offset of the > > > guest clock with > > > > respect > > > > to ntp. This is accomplished by migrating all of > the state in > > > the h->hpet data > > > > structure > > > > in the usual way. The hp->mc_offset is recalculated on the > > > receiving node so > > > > that the > > > > guest sees a continuous hpet main counter. > > > > > > > > Code as been added to xc_domain_save.c to send a > small message > > > after the > > > > domain context is sent. The contents of the message is the > > > physical tsc > > > > timestamp, last_tsc, > > > > read just before the message is sent. When the > > last_tsc message > > > is received in > > > > xc_domain_restore.c, > > > > another physical tsc timestamp, cur_tsc, is read. The two > > > timestamps are > > > > loaded into the domain > > > > structure as last_tsc_sender and first_tsc_receiver with > > > hypercalls. Then > > > > xc_domain_hvm_setcontext > > > > is called so that hpet_load has access to these time stamps. > > > Hpet_load uses > > > > the timestamps > > > > to account for the time spent saving and loading the domain > > > context. With this > > > > technique, > > > > the only neglected time is the time spent sending a small > > > network message. > > > > > > > > 5. Test Results > > > > > > > > Some recent test results are: > > > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > > Duration: 70 hours. > > > > Test date: 6/2/08 > > > > Loads: usex -b48 on Linux; burn-in on Windows > > > > Guest vcpus: 8 for Linux; 2 for Windows > > > > Hardware: 8 physical cpu AMD > > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 > no-load test > > > > Duration: 23 hours. > > > > Test date: 6/3/08 > > > > Loads: none > > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > > Hardware: 4 physical cpu AMD > > > > Clock drift : Linux: .033% Windows: .019% > > > > > > > > 6. Relation to recent work in xen-unstable > > > > > > > > There is a similarity between hvm_get_guest_time() in > > > xen-unstable and > > > > read_64_main_counter() > > > > in this code. However, read_64_main_counter() is > more tuned to > > > the needs of > > > > hpet.c. It has no > > > > "set" operation, only the get. It isolates the mode, > > physical or > > > simulated, in > > > > read_64_main_counter() > > > > itself. It uses no vcpu or domain state as it is a physical > > > entity, in either > > > > mode. And it provides a real > > > > physical mode for every read for those applications > > that desire > > > this. > > > > > > > > 7. Conclusion > > > > > > > > The virtual hpet is improved by this patch in terms > > of accuracy and > > > > monotonicity. > > > > Tests performed to date verify this and more testing > > is under way. > > > > > > > > 8. Future Work > > > > > > > > Testing with Windows Vista will be performed soon. > The reason > > > for accuracy > > > > variations > > > > on different platforms using the physical hpet > device will be > > > investigated. > > > > Additional overhead measurements on simulated vs > physical hpet > > > mode will be > > > > made. > > > > > > > > Footnotes: > > > > > > > > 1. I don''t recall the accuracy improvement with end > > of interrupt > > > stamping, but > > > > it was > > > > significant, perhaps better than two to one improvement. It > > > would be a very > > > > simple matter > > > > to re-measure the improvement as the facility can > call back at > > > injection time > > > > as well. > > > > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > > > > <mailto:dwinchell@virtualiron.com> > > > > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > > > > <mailto:bguthro@virtualiron.com> > > > > > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@lists.xensource.com > > > > http://lists.xensource.com/xen-devel > > > > > > > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I implemented the monotonicity guarantee within hvm_get_guest_time(). We don''t need or want get_s_time_mono(). -- Keir On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> Keir, Dan: > > Although I plan to break up the patch, etc., I''m posting > this fix to the patch for anyone who might be interested. > > thanks, > Dave > # This is a BitKeeper generated diff -Nru style patch. > # > # ChangeSet > # 2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com > # vi-patch: xen-hpet > # > # Bug Id: 6057 > # > # Reviewed by: Robert > # > # SUMMARY: Fix wrap issue in monotonic s_time(). > # > # xen/arch/x86/time.c > # 2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2 > # Fix wrap issue in monotonic s_time(). > # > diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c > --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00 > +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00 > @@ -534,7 +534,7 @@ > u64 count; > unsigned long flags; > struct cpu_time *t = &this_cpu(cpu_time); > - u64 tsc, delta; > + u64 tsc, delta, diff; > s_time_t now; > > if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) { > @@ -542,7 +542,8 @@ > rdtscll(tsc); > delta = tsc - t->local_tsc_stamp; > now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale); > - if(now > get_s_time_mon.last_ret) > + diff = (u64)now - (u64)get_s_time_mon.last_ret; > + if((s64)diff > (s64)0) > get_s_time_mon.last_ret = now; > else > now = get_s_time_mon.last_ret;_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I implemented the monotonicity guarantee within hvm_get_guest_time(). We > don''t need or want get_s_time_mono().I''ll give hvm_get_guest_time() another look. -Dave -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: Wed 6/11/2008 4:30 AM To: Dave Winchell Cc: dan.magenheimer@oracle.com; Ben Guthro; xen-devel Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy I implemented the monotonicity guarantee within hvm_get_guest_time(). We don''t need or want get_s_time_mono(). -- Keir On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com> wrote:> Keir, Dan: > > Although I plan to break up the patch, etc., I''m posting > this fix to the patch for anyone who might be interested. > > thanks, > Dave > # This is a BitKeeper generated diff -Nru style patch. > # > # ChangeSet > # 2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com > # vi-patch: xen-hpet > # > # Bug Id: 6057 > # > # Reviewed by: Robert > # > # SUMMARY: Fix wrap issue in monotonic s_time(). > # > # xen/arch/x86/time.c > # 2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2 > # Fix wrap issue in monotonic s_time(). > # > diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c > --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00 > +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00 > @@ -534,7 +534,7 @@ > u64 count; > unsigned long flags; > struct cpu_time *t = &this_cpu(cpu_time); > - u64 tsc, delta; > + u64 tsc, delta, diff; > s_time_t now; > > if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) { > @@ -542,7 +542,8 @@ > rdtscll(tsc); > delta = tsc - t->local_tsc_stamp; > now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale); > - if(now > get_s_time_mon.last_ret) > + diff = (u64)now - (u64)get_s_time_mon.last_ret; > + if((s64)diff > (s64)0) > get_s_time_mon.last_ret = now; > else > now = get_s_time_mon.last_ret;_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:>>In EL5u1-32 however it looks like the fractions are accounted >>for. Indeed the EL5u1-32 "lost tick handling" code resembles >>the Linux/ia64 code which is what I''ve always assumed was >>the "missed tick" model. In this case, I think no policy >>is necessary and the measured skew should be identical to >>any physical hpet skew. I''ll have to test this hypothesis though. >> >> > >I''ve tested this hypothesis and it seems to hold true. >This means the existing (unpatched) hpet code works fine >on EL5-32bit (vcpus=1) when hpet is the clocksource, >even when the machine is overcommitted. A second hypothesis >still needs to be tested that Dave''s patch will not make this worse. > >Interesting, thanks for pointing this out and confirming.>(Note that per previous discussion, my EL5u1-32bit guest >running on an Intel dual-core physical box chose tsc as >the best clocksource and I had to override it with >clock=hpet in the kernel command line.) > >Is there one setting for all Linux guests that makes them choose hpet? Is it "clock=hpet clocksource=hpet"? I know you wrote at length about this before.> > >>Yes, that makes sense and concurs with what I remember from >>the EL4u5-32 code. If this is true, one would expect the >>default "no missed tick" policy to see time moving faster >>than an external source -- the first missed tick delivered >>after a long sleep would "catch up" and then the remainder >>would each add another tick. >> >> > >Indeed with the existing (unpatched) hpet code, time is >running faster on EL4u5-32 (vcpus=1, when overcommited). >So Dave''s patch is definitely needed here. > >Its good to get the verification of this. thanks, Dave>Will try 64-bit next. > >Dan > > > >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Monday, June 09, 2008 9:21 PM >>To: ''Dave Winchell''; ''Keir Fraser'' >>Cc: ''xen-devel''; ''Ben Guthro'' >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >> >>>I''ll tell you what I recall about this. Tomorrow I''ll check the >>>guest code to verify. I think that Linux declares a full tick, >>>even if the interrupt is early. That''s the problem. >>> >>> >>Yes, that makes sense and concurs with what I remember from >>the EL4u5-32 code. If this is true, one would expect the >>default "no missed tick" policy to see time moving faster >>than an external source -- the first missed tick delivered >>after a long sleep would "catch up" and then the remainder >>would each add another tick. >> >> >> >>>On the other hand, if the interrupt is late it in effect declares >>>a tick plus fraction. If it just declared the fraction in >>> >>> >>the first place, >> >> >>>we could deliver the interrupts whenever we wanted. >>> >>> >>My read of the EL4u5-32 code is that the fraction is discarded >>and a new tick period commences at "now", so the fractions >>eventually accumulate as lost time. >> >>In EL5u1-32 however it looks like the fractions are accounted >>for. Indeed the EL5u1-32 "lost tick handling" code resembles >>the Linux/ia64 code which is what I''ve always assumed was >>the "missed tick" model. In this case, I think no policy >>is necessary and the measured skew should be identical to >>any physical hpet skew. I''ll have to test this hypothesis though. >> >>-----Original Message----- >>From: xen-devel-bounces@lists.xensource.com >>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of >>Dave Winchell >>Sent: Monday, June 09, 2008 5:35 PM >>To: dan.magenheimer@oracle.com; Keir Fraser >>Cc: Dave Winchell; xen-devel; Ben Guthro >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >> >>>>The Linux policy is more subtle, but is required to go >>>>from .1% to .03%. >>>> >>>> >>>Thanks for the good documentation which I hadn''t thoroughly >>>read until now. >>>I now understand that the essence of your >>>hpet missed ticks policy is to ensure that ticks are never >>>delivered too close together. But I''m trying to understand >>>WHY your patch works, in other words, what problem it is >>>countering. >>> >>> >>I''ll tell you what I recall about this. Tomorrow I''ll check the >>guest code to verify. I think that Linux declares a full tick, >>even if the interrupt is early. That''s the problem. >>On the other hand, if the interrupt is late it in effect declares >>a tick plus fraction. If it just declared the fraction in the >>first place, >>we could deliver the interrupts whenever we wanted. >> >>Its really not that different than the missed ticks policy in vpt.c >>except that there the period in vpt.c is based on start of interrupt >>and I have improved that with end-of interrupt as described >>in the patch note. >> >>I don''t recall what prompted me to try end-of-interrupt, >>but I saw a significant improvement. I may have been running >>a monotonicity test at the same time to explain the lock >>contention mentioned in the write-up. >> >> >> >>> I care about this for more reasons than just >>>because it is interesting: (1) I''d like to feel confident that >>>it is fixing a bug rather than just a symptom of a bug; >>>and (2) I wonder how universally it is applicable. >>> >>> >>Its worked well my my small set of guests. You and our >>QA are going to tell us about the wider set. It doesn''t >>matter if guest A handles interrupts closely spaced or not, >>just whether it handles them far apart. So it should be pretty >>universal with guests that really handle missed ticks. >>I think its interesting that some 32bit Linux guests handle >>missed ticks for hpet. >> >> >> >>>I see from code examination in mark_offset_hpet() in >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that >>>the correction for lost ticks is just plain wrong in >>>a virtual environment. (Suppose for example that a virtual >>>tick was delivered every 1.999*hpet_tick... I think >>>the clock would be off by 50%!) Is this the bug that >>>is being "countered" by your policy? >>> >>> >>I haven''t looked at that code, perhaps. >>I''ll check it tomorrow. >> >> >> >>>However, the lost tick handling in RHEL5u1/kernel/timer.c >>>(which I think is used also for hpet) is much better >>>so I am eager to find out if your policy works there >>>too. >>>If the hpet missed tick policy works for both, though, >>>I should be happy, though I wonder about upstream kernels >>>(e.g. the trend toward tickless). >>> >>> >>I wasn''t aware of this trend. If its robust, however, it should >>handle late interrupts ... >> >> >> >>>That said, I''d rather >>>see this get into Xen 3.3 and worry about upstream kernels >>>later :-) >>> >>> >>Regards, >>Dave >> >> >> >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Mon 6/9/2008 6:02 PM >>To: Dave Winchell; Keir Fraser >>Cc: Ben Guthro; xen-devel >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >> >>>The Linux policy is more subtle, but is required to go >>>from .1% to .03%. >>> >>> >>Thanks for the good documentation which I hadn''t thoroughly >>read until now. I now understand that the essence of your >>hpet missed ticks policy is to ensure that ticks are never >>delivered too close together. But I''m trying to understand >>WHY your patch works, in other words, what problem it is >>countering. I care about this for more reasons than just >>because it is interesting: (1) I''d like to feel confident that >>it is fixing a bug rather than just a symptom of a bug; >>and (2) I wonder how universally it is applicable. >> >>I see from code examination in mark_offset_hpet() in >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that >>the correction for lost ticks is just plain wrong in >>a virtual environment. (Suppose for example that a virtual >>tick was delivered every 1.999*hpet_tick... I think >>the clock would be off by 50%!) Is this the bug that >>is being "countered" by your policy? >> >>However, the lost tick handling in RHEL5u1/kernel/timer.c >>(which I think is used also for hpet) is much better >>so I am eager to find out if your policy works there >>too. >> >>If the hpet missed tick policy works for both, though, >>I should be happy, though I wonder about upstream kernels >>(e.g. the trend toward tickless). That said, I''d rather >>see this get into Xen 3.3 and worry about upstream kernels >>later :-) >> >>-----Original Message----- >>From: Dave Winchell [mailto:dwinchell@virtualiron.com] >>Sent: Sunday, June 08, 2008 2:32 PM >>To: dan.magenheimer@oracle.com; Keir Fraser >>Cc: Ben Guthro; xen-devel; Dave Winchell >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >> >>Hi Dan, >> >> >> >>>While I am fully supportive of offering hardware hpet as an option >>>for hvm guests (let''s call it hwhpet=1 for shorthand), I am very >>>surprised by your preliminary results; the most obvious conclusion >>>is that Xen system time is losing time at the rate of 1000 PPM >>>though its possible there''s a bug somewhere else in the "time >>>stack". Your Windows result is jaw-dropping and inexplicable, >>>though I have to admit ignorance of how Windows manages time. >>> >>> >>I think xen system time is fine. You have to add the interrupt >>delivery policies decribed in the write-up for the patch to get >>accurate timekeeping in the guest. >> >>The windows policy is obvious and results in a large improvement >>in accuracy. The Linux policy is more subtle, but is required to go >>from .1% to .03%. >> >> >> >>>I think with my recent patch and hpet=1 (essentially the same as >>>your emulated hpet), hvm guest time should track Xen system time. >>>I wonder if domain0 (which if I understand correctly is directly >>>using Xen system time) is also seeing an error of .1%? Also >>>I wonder for the skew you are seeing (in both hvm guests and >>>domain0) is time moving too fast or two slow? >>> >>> >>I don''t recall the direction. I can look it up in my notes at work >>tomorrow. >> >> >> >>>Although hwhpet=1 is a fine alternative in many cases, it may >>>be unavailable on some systems and may cause significant performance >>>issues on others. So I think we will still need to track down >>>the poor accuracy when hwhpet=0. >>> >>> >>Our patch is accurate to < .03% using the physical hpet mode or >>the simulated mode. >> >> >> >>>And if for some reason >>>Xen system time can''t be made accurate enough (< 0.05%), then >>>I think we should consider building Xen system time itself on >>>top of hardware hpet instead of TSC... at least when Xen discovers >>>a capable hpet. >>> >>> >>In our experience, Xen system time is accurate enough now. >> >> >> >>>One more thought... do you know the accuracy of the TSC crystals >>>on your test systems? I posted a patch awhile ago that was >>>intended to test that, though I guess it was only testing skew >>>of different TSCs on the same system, not TSCs against an >>>external time source. >>> >>> >>I do not know the tsc accuracy. >> >> >> >>>Or maybe there''s a computation error somewhere in the hvm hpet >>>scaling code? Hmmm... >>> >>> >>Regards, >>Dave >> >> >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Fri 6/6/2008 4:29 PM >>To: Dave Winchell; Keir Fraser >>Cc: Ben Guthro; xen-devel >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >> >>Dave -- >> >>Thanks much for posting the preliminary results! >> >>While I am fully supportive of offering hardware hpet as an option >>for hvm guests (let''s call it hwhpet=1 for shorthand), I am very >>surprised by your preliminary results; the most obvious conclusion >>is that Xen system time is losing time at the rate of 1000 PPM >>though its possible there''s a bug somewhere else in the "time >>stack". Your Windows result is jaw-dropping and inexplicable, >>though I have to admit ignorance of how Windows manages time. >> >> >>I think with my recent patch and hpet=1 (essentially the same as >>your emulated hpet), hvm guest time should track Xen system time. >>I wonder if domain0 (which if I understand correctly is directly >>using Xen system time) is also seeing an error of .1%? Also >>I wonder for the skew you are seeing (in both hvm guests and >>domain0) is time moving too fast or two slow? >> >>Although hwhpet=1 is a fine alternative in many cases, it may >>be unavailable on some systems and may cause significant performance >>issues on others. So I think we will still need to track down >>the poor accuracy when hwhpet=0. And if for some reason >>Xen system time can''t be made accurate enough (< 0.05%), then >>I think we should consider building Xen system time itself on >>top of hardware hpet instead of TSC... at least when Xen discovers >>a capable hpet. >> >>One more thought... do you know the accuracy of the TSC crystals >>on your test systems? I posted a patch awhile ago that was >>intended to test that, though I guess it was only testing skew >>of different TSCs on the same system, not TSCs against an >>external time source. >> >>Or maybe there''s a computation error somewhere in the hvm hpet >>scaling code? Hmmm... >> >>Thanks, >>Dan >> >> >> >>>-----Original Message----- >>>From: Dave Winchell [mailto:dwinchell@virtualiron.com] >>>Sent: Friday, June 06, 2008 1:33 PM >>>To: dan.magenheimer@oracle.com; Keir Fraser >>>Cc: Ben Guthro; xen-devel; Dave Winchell >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>> >>> >>>Dan, Keir: >>> >>>Preliminary tests results indicate an error of .1% for Linux 64 bit >>>guests configured >>>for hpet with xen-unstable as is. As we have discussed many >>> >>> >>times, the >> >> >>>ntp requirement is .05%. >>>Tests on the patch we just submitted for hpet have >>> >>> >>indicated errors of >> >> >>>.0012% >>>on this platform under similar test conditions and .03% on >>>other platforms. >>> >>>Windows vista64 has an error of 11% using hpet with the >>>xen-unstable bits. >>>In an overnight test with our hpet patch, the Windows vista >>>error was .008%. >>> >>>The tests are with two or three guests on a physical node, all under >>>load, and with >>>the ratio of vcpus to phys cpus > 1. >>> >>>I will continue to run tests over the next few days. >>> >>>thanks, >>>Dave >>> >>> >>>Dan Magenheimer wrote: >>> >>> >>> >>>>Hi Dave and Ben -- >>>> >>>>When running tests on xen-unstable (without your patch), >>>> >>>> >>>please ensure >>> >>> >>>>that hpet=1 is set in the hvm config and also I think >>>> >>>> >>that when hpet >> >> >>>>is the clocksource on RHEL4-32, the clock IS resilient to >>>> >>>> >>>missed ticks >>> >>> >>>>so timer_mode should be 2 (vs when pit is the clocksource >>>> >>>> >>>on RHEL4-32, >>> >>> >>>>all clock ticks must be delivered and so timer_mode should be 0). >>>> >>>>Per >>>> >>>> >>>> >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg >>>00098.html it''s >>> >>> >>>>my intent to clean this up, but I won''t get to it until next week. >>>> >>>>Thanks, >>>>Dan >>>> >>>> -----Original Message----- >>>> *From:* xen-devel-bounces@lists.xensource.com >>>> [mailto:xen-devel-bounces@lists.xensource.com]*On >>>> >>>> >>>Behalf Of *Dave >>> >>> >>>> Winchell >>>> *Sent:* Friday, June 06, 2008 4:46 AM >>>> *To:* Keir Fraser; Ben Guthro; xen-devel >>>> *Cc:* dan.magenheimer@oracle.com; Dave Winchell >>>> *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>>> >>>> Keir, >>>> >>>> I think the changes are required. We''ll run some tests >>>> >>>> >>>today today so >>> >>> >>>> that we have some data to talk about. >>>> >>>> -Dave >>>> >>>> >>>> -----Original Message----- >>>> From: xen-devel-bounces@lists.xensource.com on behalf >>>> >>>> >>>of Keir Fraser >>> >>> >>>> Sent: Fri 6/6/2008 4:58 AM >>>> To: Ben Guthro; xen-devel >>>> Cc: dan.magenheimer@oracle.com >>>> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy >>>> >>>> Are these patches needed now the timers are built on >>>> >>>> >>Xen system >> >> >>>> time rather >>>> than host TSC? Dan has reported much better >>>> >>>> >>>time-keeping with his >>> >>> >>>> patch >>>> checked in, and it¹s for sure a lot less invasive than >>>> >>>> >>>this patchset. >>> >>> >>>> -- Keir >>>> >>>> On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: >>>> >>>> > >>>> > 1. Introduction >>>> > >>>> > This patch improves the hpet based guest clock in >>>> >>>> >>>terms of drift and >>> >>> >>>> > monotonicity. >>>> > Prior to this work the drift with hpet was greater >>>> >>>> >>>than 2%, far >>> >>> >>>> above the .05% >>>> > limit >>>> > for ntp to synchronize. With this code, the drift >>>> >>>> >>ranges from >> >> >>>> .001% to .0033% >>>> > depending >>>> > on guest and physical platform. >>>> > >>>> > Using hpet allows guest operating systems to >>>> >>>> >>provide monotonic >> >> >>>> time to their >>>> > applications. Time sources other than hpet are not >>>> >>>> >>>monotonic because >>> >>> >>>> > of their reliance on tsc, which is not synchronized >>>> >>>> >>>across physical >>> >>> >>>> > processors. >>>> > >>>> > Windows 2k864 and many Linux guests are supported with two >>>> policies, one for >>>> > guests >>>> > that handle missed clock interrupts and the other for guests >>>> that require the >>>> > correct number of interrupts. >>>> > >>>> > Guests may use hpet for the timing source even if >>>> >>>> >>the physical >> >> >>>> platform has no >>>> > visible >>>> > hpet. Migration is supported between physical machines which >>>> differ in >>>> > physical >>>> > hpet visibility. >>>> > >>>> > Most of the changes are in hpet.c. Two general >>>> >>>> >>facilities are >> >> >>>> added to track >>>> > interrupt >>>> > progress. The ideas here and the facilities would >>>> >>>> >>be useful in >> >> >>>> vpt.c, for >>>> > other time >>>> > sources, though no attempt is made here to improve vpt.c. >>>> > >>>> > The following sections discuss hpet dependencies, interrupt >>>> delivery policies, >>>> > live migration, >>>> > test results, and relation to recent work with >>>> >>>> >>monotonic time. >> >> >>>> > >>>> > >>>> > 2. Virtual Hpet dependencies >>>> > >>>> > The virtual hpet depends on the ability to read the >>>> >>>> >>>physical or >>> >>> >>>> simulated >>>> > (see discussion below) hpet. For timekeeping, the >>>> >>>> >>>virtual hpet >>> >>> >>>> also depends >>>> > on two new interrupt notification facilities to >>>> >>>> >>implement its >> >> >>>> policies for >>>> > interrupt delivery. >>>> > >>>> > 2.1. Two modes of low-level hpet main counter reads. >>>> > >>>> > In this implementation, the virtual hpet reads with >>>> read_64_main_counter(), >>>> > exported by >>>> > time.c, either the real physical hpet main counter register >>>> directly or a >>>> > "simulated" >>>> > hpet main counter. >>>> > >>>> > The simulated mode uses a monotonic version of get_s_time() >>>> (NOW()), where the >>>> > last >>>> > time value is returned whenever the current time >>>> >>>> >>value is less >> >> >>>> than the last >>>> > time >>>> > value. In simulated mode, since it is layered on s_time, the >>>> underlying >>>> > hardware >>>> > can be hpet or some other device. The frequency of the main >>>> counter in >>>> > simulated >>>> > mode is the same as the standard physical hpet frequency, >>>> allowing live >>>> > migration >>>> > between nodes that are configured differently. >>>> > >>>> > If the physical platform does not have an hpet >>>> >>>> >>>device, or if xen >>> >>> >>>> is configured >>>> > not >>>> > to use the device, then the simulated method is >>>> >>>> >>used. If there >> >> >>>> is a physical >>>> > hpet device, >>>> > and xen has initialized it, then either simulated >>>> >>>> >>or physical >> >> >>>> mode can be >>>> > used. >>>> > This is governed by a boot time option, hpet-avoid. >>>> >>>> >>>Setting this >>> >>> >>>> option to 1 >>>> > gives the >>>> > simulated mode and 0 the physical mode. The default >>>> >>>> >>>is physical >>> >>> >>>> mode. >>>> > >>>> > A disadvantage of the physical mode is that may >>>> >>>> >>take longer to >> >> >>>> read the device >>>> > than in simulated mode. On some platforms the cost is >>>> >>>> >>>about the >>> >>> >>>> same (less >>>> > than 250 nsec) for >>>> > physical and simulated modes, while on others >>>> >>>> >>physical cost is >> >> >>>> much higher >>>> > than simulated. >>>> > A disadvantage of the simulated mode is that it can >>>> >>>> >>return the >> >> >>>> same value >>>> > for the counter in consecutive calls. >>>> > >>>> > 2.2. Interrupt notification facilities. >>>> > >>>> > Two interrupt notification facilities are introduced, one is >>>> > hvm_isa_irq_assert_cb() >>>> > and the other hvm_register_intr_en_notif(). >>>> > >>>> > The vhpet uses hvm_isa_irq_assert_cb to deliver >>>> >>>> >>interrupts to >> >> >>>> the vioapic. >>>> > hvm_isa_irq_assert_cb allows a callback to be >>>> >>>> >>passed along to >> >> >>>> > vioapic_deliver() >>>> > and this callback is called with a mask of the vcpus >>>> >>>> >>>which will >>> >>> >>>> get the >>>> > interrupt. This callback is made before any vcpus receive an >>>> interrupt. >>>> > >>>> > Vhpet uses hvm_register_intr_en_notif() to register >>>> >>>> >>a handler >> >> >>>> for a particular >>>> > vector that will be called when that vector is injected in >>>> > [vmx,svm]_intr_assist() >>>> > and also when the guest finishes handling the >>>> >>>> >>interrupt. Here >> >> >>>> finished is >>>> > defined >>>> > as the point when the guest re-enables interrupts or >>>> >>>> >>>lowers the >>> >>> >>>> tpr value. >>>> > EOI is not used as the end of interrupt as this is sometimes >>>> returned before >>>> > the interrupt handler has done its work. A flag is >>>> >>>> >>>passed to the >>> >>> >>>> handler >>>> > indicating >>>> > whether this is the injection point (post = 1) or the >>>> >>>> >>>interrupt >>> >>> >>>> finished (post >>>> > = 0) point. >>>> > The need for the finished point callback is discussed in the >>>> missed ticks >>>> > policy section. >>>> > >>>> > To prevent a possible early trigger of the finished >>>> >>>> >>callback, >> >> >>>> intr_en_notif >>>> > logic >>>> > has a two stage arm, the first at injection >>>> (hvm_intr_en_notif_arm()) and the >>>> > second when >>>> > interrupts are seen to be disabled >>>> >>>> >>>(hvm_intr_en_notif_disarm()). >>> >>> >>>> Once fully >>>> > armed, re-enabling >>>> > interrupts will cause hvm_intr_en_notif_disarm() to >>>> >>>> >>>make the end >>> >>> >>>> of interrupt >>>> > callback. hvm_intr_en_notif_arm() and >>>> >>>> >>>hvm_intr_en_notif_disarm() >>> >>> >>>> are called by >>>> > [vmx,svm]_intr_assist(). >>>> > >>>> > 3. Interrupt delivery policies >>>> > >>>> > The existing hpet interrupt delivery is preserved. >>>> >>>> >>>This includes >>> >>> >>>> > vcpu round robin delivery used by Linux and >>>> >>>> >>broadcast delivery >> >> >>>> used by >>>> > Windows. >>>> > >>>> > There are two policies for interrupt delivery, one >>>> >>>> >>for Windows >> >> >>>> 2k8-64 and the >>>> > other >>>> > for Linux. The Linux policy takes advantage of the >>>> >>>> >>>(guest) Linux >>> >>> >>>> missed tick >>>> > and offset >>>> > calculations and does not attempt to deliver the >>>> >>>> >>>right number of >>> >>> >>>> interrupts. >>>> > The Windows policy delivers the correct number of >>>> >>>> >>interrupts, >> >> >>>> even if >>>> > sometimes much >>>> > closer to each other than the period. The policies >>>> >>>> >>are similar >> >> >>>> to those in >>>> > vpt.c, though >>>> > there are some important differences. >>>> > >>>> > Policies are selected with an HVMOP_set_param >>>> >>>> >>>hypercall with index >>> >>> >>>> > HVM_PARAM_TIMER_MODE. >>>> > Two new values are added, >>>> >>>> >>>HVM_HPET_guest_computes_missed_ticks and >>> >>> >>>> > HVM_HPET_guest_does_not_compute_missed_ticks. The >>>> >>>> >>reason that >> >> >>>> two new ones >>>> > are added is that >>>> > in some guests (32bit Linux) a no-missed policy is >>>> >>>> >>needed for >> >> >>>> clock sources >>>> > other than hpet >>>> > and a missed ticks policy for hpet. It was felt that >>>> >>>> >>>there would >>> >>> >>>> be less >>>> > confusion by simply >>>> > introducing the two hpet policies. >>>> > >>>> > 3.1. The missed ticks policy >>>> > >>>> > The Linux clock interrupt handler for hpet calculates missed >>>> ticks and offset >>>> > using the hpet >>>> > main counter. The algorithm works well when the >>>> >>>> >>time since the >> >> >>>> last interrupt >>>> > is greater than >>>> > or equal to a period and poorly otherwise. >>>> > >>>> > The missed ticks policy ensures that no two clock >>>> >>>> >>>interrupts are >>> >>> >>>> delivered to >>>> > the guest at >>>> > a time interval less than a period. A time stamp (hpet main >>>> counter value) is >>>> > recorded (by a >>>> > callback registered with hvm_register_intr_en_notif) >>>> >>>> >>>when Linux >>> >>> >>>> finishes >>>> > handling the clock >>>> > interrupt. Then, ensuing interrupts are delivered to >>>> >>>> >>>the vioapic >>> >>> >>>> only if the >>>> > current main >>>> > counter value is a period greater than when the >>>> >>>> >>last interrupt >> >> >>>> was handled. >>>> > >>>> > Tests showed a significant improvement in clock >>>> >>>> >>drift with end >> >> >>>> of interrupt >>>> > time stamps >>>> > versus beginning of interrupt[1]. It is believed that >>>> >>>> >>>the reason >>> >>> >>>> for the >>>> > improvement >>>> > is that the clock interrupt handler goes for a >>>> >>>> >>>spinlock and can >>> >>> >>>> be therefore >>>> > delayed in its >>>> > processing. Furthermore, the main counter is read >>>> >>>> >>by the guest >> >> >>>> under the lock. >>>> > The net >>>> > effect is that if we time stamp injection, we can get the >>>> difference in time >>>> > between successive interrupt handler lock acquisitions to be >>>> less than the >>>> > period. >>>> > >>>> > 3.2. The no-missed ticks policy >>>> > >>>> > Windows 2k864 keeps very poor time with the missed >>>> >>>> >>>ticks policy. >>> >>> >>>> So the >>>> > no-missed ticks policy >>>> > was developed. In the no-missed ticks policy we deliver the >>>> correct number of >>>> > interrupts, >>>> > even if they are spaced less than a period apart >>>> >>>> >>>(when catching up). >>> >>> >>>> > >>>> > Windows 2k864 uses a broadcast mode in the interrupt routing >>>> such that >>>> > all vcpus get the clock interrupt. The best Windows drift >>>> performance was >>>> > achieved when the >>>> > policy code ensured that all the previous interrupts (on the >>>> various vcpus) >>>> > had been injected >>>> > before injecting the next interrupt to the vioapic.. >>>> > >>>> > The policy code works as follows. It uses the >>>> hvm_isa_irq_assert_cb() to >>>> > record >>>> > the vcpus to be interrupted in >>>> >>>> >>h->hpet.pending_mask. Then, in >> >> >>>> the callback >>>> > registered >>>> > with hvm_register_intr_en_notif() at post=1 time it >>>> >>>> >>clears the >> >> >>>> current vcpu in >>>> > the pending_mask. >>>> > When the pending_mask is clear it decrements >>>> hpet.intr_pending_nr and if >>>> > intr_pending_nr is still >>>> > non-zero posts another interrupt to the ioapic with >>>> hvm_isa_irq_assert_cb(). >>>> > Intr_pending_nr is incremented in >>>> hpet_route_decision_not_missed_ticks(). >>>> > >>>> > The missed ticks policy intr_en_notif callback also uses the >>>> pending_mask >>>> > method. So even though >>>> > Linux does not broadcast its interrupts, the code >>>> >>>> >>could handle >> >> >>>> it if it did. >>>> > In this case the end of interrupt time stamp is >>>> >>>> >>made when the >> >> >>>> pending_mask is >>>> > clear. >>>> > >>>> > 4. Live Migration >>>> > >>>> > Live migration with hpet preserves the current offset of the >>>> guest clock with >>>> > respect >>>> > to ntp. This is accomplished by migrating all of >>>> >>>> >>the state in >> >> >>>> the h->hpet data >>>> > structure >>>> > in the usual way. The hp->mc_offset is recalculated on the >>>> receiving node so >>>> > that the >>>> > guest sees a continuous hpet main counter. >>>> > >>>> > Code as been added to xc_domain_save.c to send a >>>> >>>> >>small message >> >> >>>> after the >>>> > domain context is sent. The contents of the message is the >>>> physical tsc >>>> > timestamp, last_tsc, >>>> > read just before the message is sent. When the >>>> >>>> >>>last_tsc message >>> >>> >>>> is received in >>>> > xc_domain_restore.c, >>>> > another physical tsc timestamp, cur_tsc, is read. The two >>>> timestamps are >>>> > loaded into the domain >>>> > structure as last_tsc_sender and first_tsc_receiver with >>>> hypercalls. Then >>>> > xc_domain_hvm_setcontext >>>> > is called so that hpet_load has access to these time stamps. >>>> Hpet_load uses >>>> > the timestamps >>>> > to account for the time spent saving and loading the domain >>>> context. With this >>>> > technique, >>>> > the only neglected time is the time spent sending a small >>>> network message. >>>> > >>>> > 5. Test Results >>>> > >>>> > Some recent test results are: >>>> > >>>> > 5.1 Linux 4u664 and Windows 2k864 load test. >>>> > Duration: 70 hours. >>>> > Test date: 6/2/08 >>>> > Loads: usex -b48 on Linux; burn-in on Windows >>>> > Guest vcpus: 8 for Linux; 2 for Windows >>>> > Hardware: 8 physical cpu AMD >>>> > Clock drift : Linux: .0012% Windows: .009% >>>> > >>>> > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 >>>> >>>> >>no-load test >> >> >>>> > Duration: 23 hours. >>>> > Test date: 6/3/08 >>>> > Loads: none >>>> > Guest vcpus: 8 for each Linux; 2 for Windows >>>> > Hardware: 4 physical cpu AMD >>>> > Clock drift : Linux: .033% Windows: .019% >>>> > >>>> > 6. Relation to recent work in xen-unstable >>>> > >>>> > There is a similarity between hvm_get_guest_time() in >>>> xen-unstable and >>>> > read_64_main_counter() >>>> > in this code. However, read_64_main_counter() is >>>> >>>> >>more tuned to >> >> >>>> the needs of >>>> > hpet.c. It has no >>>> > "set" operation, only the get. It isolates the mode, >>>> >>>> >>>physical or >>> >>> >>>> simulated, in >>>> > read_64_main_counter() >>>> > itself. It uses no vcpu or domain state as it is a physical >>>> entity, in either >>>> > mode. And it provides a real >>>> > physical mode for every read for those applications >>>> >>>> >>>that desire >>> >>> >>>> this. >>>> > >>>> > 7. Conclusion >>>> > >>>> > The virtual hpet is improved by this patch in terms >>>> >>>> >>>of accuracy and >>> >>> >>>> > monotonicity. >>>> > Tests performed to date verify this and more testing >>>> >>>> >>>is under way. >>> >>> >>>> > >>>> > 8. Future Work >>>> > >>>> > Testing with Windows Vista will be performed soon. >>>> >>>> >>The reason >> >> >>>> for accuracy >>>> > variations >>>> > on different platforms using the physical hpet >>>> >>>> >>device will be >> >> >>>> investigated. >>>> > Additional overhead measurements on simulated vs >>>> >>>> >>physical hpet >> >> >>>> mode will be >>>> > made. >>>> > >>>> > Footnotes: >>>> > >>>> > 1. I don''t recall the accuracy improvement with end >>>> >>>> >>>of interrupt >>> >>> >>>> stamping, but >>>> > it was >>>> > significant, perhaps better than two to one improvement. It >>>> would be a very >>>> > simple matter >>>> > to re-measure the improvement as the facility can >>>> >>>> >>call back at >> >> >>>> injection time >>>> > as well. >>>> > >>>> > >>>> > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> >>>> > <mailto:dwinchell@virtualiron.com> >>>> > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> >>>> > <mailto:bguthro@virtualiron.com> >>>> > >>>> > >>>> > _______________________________________________ >>>> > Xen-devel mailing list >>>> > Xen-devel@lists.xensource.com >>>> > http://lists.xensource.com/xen-devel >>>> >>>> >>>> >>>> >>>> >>> >>> > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
OK, I can confirm that without Dave''s patch RHEL4- and RHEL5-based 64-bit uni-p kernels gain time when hpet is the clocksource. But WHOA with vcpus=2, el5u1-32 time suddenly goes crazy when domains are added whereas it seems fine when vcpus=1. All my testing so far has been on 3.1.3, so I am going to redo it on xen-unstable, first without Dave''s patch then with.> Is there one setting for all Linux guests that makes them > choose hpet? Is it "clock=hpet clocksource=hpet"? > I know you wrote at length about this before.In the hvm configuration file: hpet=1 acpi=1 (Note, acpi unspecified works too as 1 appears to be the default; but hpet=1 is ignored if acpi=0) In the kernel command line of the hvm domain (e.g. in grub.conf): clock=hpet notsc nopmtimer (Note, a different set of kernel parameters is necessary for each kernel but because the kernel either ignores or gives harmless warnings for invalid parameters, this set should always result in hpet being selected as the clocksource, at least on all RHEL4- and RHEL5-based kernels I''ve tested.)> -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Wednesday, June 11, 2008 7:58 AM > To: dan.magenheimer@oracle.com > Cc: Keir Fraser; xen-devel; Ben Guthro; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan Magenheimer wrote: > > >>In EL5u1-32 however it looks like the fractions are accounted > >>for. Indeed the EL5u1-32 "lost tick handling" code resembles > >>the Linux/ia64 code which is what I''ve always assumed was > >>the "missed tick" model. In this case, I think no policy > >>is necessary and the measured skew should be identical to > >>any physical hpet skew. I''ll have to test this hypothesis though. > >> > >> > > > >I''ve tested this hypothesis and it seems to hold true. > >This means the existing (unpatched) hpet code works fine > >on EL5-32bit (vcpus=1) when hpet is the clocksource, > >even when the machine is overcommitted. A second hypothesis > >still needs to be tested that Dave''s patch will not make this worse. > > > > > Interesting, thanks for pointing this out and confirming. > > >(Note that per previous discussion, my EL5u1-32bit guest > >running on an Intel dual-core physical box chose tsc as > >the best clocksource and I had to override it with > >clock=hpet in the kernel command line.) > > > > > Is there one setting for all Linux guests that makes them > choose hpet? Is it "clock=hpet clocksource=hpet"? > I know you wrote at length about this before. > > > > > > >>Yes, that makes sense and concurs with what I remember from > >>the EL4u5-32 code. If this is true, one would expect the > >>default "no missed tick" policy to see time moving faster > >>than an external source -- the first missed tick delivered > >>after a long sleep would "catch up" and then the remainder > >>would each add another tick. > >> > >> > > > >Indeed with the existing (unpatched) hpet code, time is > >running faster on EL4u5-32 (vcpus=1, when overcommited). > >So Dave''s patch is definitely needed here. > > > > > Its good to get the verification of this. > > thanks, > Dave > > >Will try 64-bit next. > > > >Dan > > > > > > > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > >>Sent: Monday, June 09, 2008 9:21 PM > >>To: ''Dave Winchell''; ''Keir Fraser'' > >>Cc: ''xen-devel''; ''Ben Guthro'' > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >> > >>>I''ll tell you what I recall about this. Tomorrow I''ll check the > >>>guest code to verify. I think that Linux declares a full tick, > >>>even if the interrupt is early. That''s the problem. > >>> > >>> > >>Yes, that makes sense and concurs with what I remember from > >>the EL4u5-32 code. If this is true, one would expect the > >>default "no missed tick" policy to see time moving faster > >>than an external source -- the first missed tick delivered > >>after a long sleep would "catch up" and then the remainder > >>would each add another tick. > >> > >> > >> > >>>On the other hand, if the interrupt is late it in effect declares > >>>a tick plus fraction. If it just declared the fraction in > >>> > >>> > >>the first place, > >> > >> > >>>we could deliver the interrupts whenever we wanted. > >>> > >>> > >>My read of the EL4u5-32 code is that the fraction is discarded > >>and a new tick period commences at "now", so the fractions > >>eventually accumulate as lost time. > >> > >>In EL5u1-32 however it looks like the fractions are accounted > >>for. Indeed the EL5u1-32 "lost tick handling" code resembles > >>the Linux/ia64 code which is what I''ve always assumed was > >>the "missed tick" model. In this case, I think no policy > >>is necessary and the measured skew should be identical to > >>any physical hpet skew. I''ll have to test this hypothesis though. > >> > >>-----Original Message----- > >>From: xen-devel-bounces@lists.xensource.com > >>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of > >>Dave Winchell > >>Sent: Monday, June 09, 2008 5:35 PM > >>To: dan.magenheimer@oracle.com; Keir Fraser > >>Cc: Dave Winchell; xen-devel; Ben Guthro > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >> > >>>>The Linux policy is more subtle, but is required to go > >>>>from .1% to .03%. > >>>> > >>>> > >>>Thanks for the good documentation which I hadn''t thoroughly > >>>read until now. > >>>I now understand that the essence of your > >>>hpet missed ticks policy is to ensure that ticks are never > >>>delivered too close together. But I''m trying to understand > >>>WHY your patch works, in other words, what problem it is > >>>countering. > >>> > >>> > >>I''ll tell you what I recall about this. Tomorrow I''ll check the > >>guest code to verify. I think that Linux declares a full tick, > >>even if the interrupt is early. That''s the problem. > >>On the other hand, if the interrupt is late it in effect declares > >>a tick plus fraction. If it just declared the fraction in the > >>first place, > >>we could deliver the interrupts whenever we wanted. > >> > >>Its really not that different than the missed ticks policy in vpt.c > >>except that there the period in vpt.c is based on start of interrupt > >>and I have improved that with end-of interrupt as described > >>in the patch note. > >> > >>I don''t recall what prompted me to try end-of-interrupt, > >>but I saw a significant improvement. I may have been running > >>a monotonicity test at the same time to explain the lock > >>contention mentioned in the write-up. > >> > >> > >> > >>> I care about this for more reasons than just > >>>because it is interesting: (1) I''d like to feel confident that > >>>it is fixing a bug rather than just a symptom of a bug; > >>>and (2) I wonder how universally it is applicable. > >>> > >>> > >>Its worked well my my small set of guests. You and our > >>QA are going to tell us about the wider set. It doesn''t > >>matter if guest A handles interrupts closely spaced or not, > >>just whether it handles them far apart. So it should be pretty > >>universal with guests that really handle missed ticks. > >>I think its interesting that some 32bit Linux guests handle > >>missed ticks for hpet. > >> > >> > >> > >>>I see from code examination in mark_offset_hpet() in > >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > >>>the correction for lost ticks is just plain wrong in > >>>a virtual environment. (Suppose for example that a virtual > >>>tick was delivered every 1.999*hpet_tick... I think > >>>the clock would be off by 50%!) Is this the bug that > >>>is being "countered" by your policy? > >>> > >>> > >>I haven''t looked at that code, perhaps. > >>I''ll check it tomorrow. > >> > >> > >> > >>>However, the lost tick handling in RHEL5u1/kernel/timer.c > >>>(which I think is used also for hpet) is much better > >>>so I am eager to find out if your policy works there > >>>too. > >>>If the hpet missed tick policy works for both, though, > >>>I should be happy, though I wonder about upstream kernels > >>>(e.g. the trend toward tickless). > >>> > >>> > >>I wasn''t aware of this trend. If its robust, however, it should > >>handle late interrupts ... > >> > >> > >> > >>>That said, I''d rather > >>>see this get into Xen 3.3 and worry about upstream kernels > >>>later :-) > >>> > >>> > >>Regards, > >>Dave > >> > >> > >> > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > >>Sent: Mon 6/9/2008 6:02 PM > >>To: Dave Winchell; Keir Fraser > >>Cc: Ben Guthro; xen-devel > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >>>The Linux policy is more subtle, but is required to go > >>>from .1% to .03%. > >>> > >>> > >>Thanks for the good documentation which I hadn''t thoroughly > >>read until now. I now understand that the essence of your > >>hpet missed ticks policy is to ensure that ticks are never > >>delivered too close together. But I''m trying to understand > >>WHY your patch works, in other words, what problem it is > >>countering. I care about this for more reasons than just > >>because it is interesting: (1) I''d like to feel confident that > >>it is fixing a bug rather than just a symptom of a bug; > >>and (2) I wonder how universally it is applicable. > >> > >>I see from code examination in mark_offset_hpet() in > >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > >>the correction for lost ticks is just plain wrong in > >>a virtual environment. (Suppose for example that a virtual > >>tick was delivered every 1.999*hpet_tick... I think > >>the clock would be off by 50%!) Is this the bug that > >>is being "countered" by your policy? > >> > >>However, the lost tick handling in RHEL5u1/kernel/timer.c > >>(which I think is used also for hpet) is much better > >>so I am eager to find out if your policy works there > >>too. > >> > >>If the hpet missed tick policy works for both, though, > >>I should be happy, though I wonder about upstream kernels > >>(e.g. the trend toward tickless). That said, I''d rather > >>see this get into Xen 3.3 and worry about upstream kernels > >>later :-) > >> > >>-----Original Message----- > >>From: Dave Winchell [mailto:dwinchell@virtualiron.com] > >>Sent: Sunday, June 08, 2008 2:32 PM > >>To: dan.magenheimer@oracle.com; Keir Fraser > >>Cc: Ben Guthro; xen-devel; Dave Winchell > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >>Hi Dan, > >> > >> > >> > >>>While I am fully supportive of offering hardware hpet as an option > >>>for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > >>>surprised by your preliminary results; the most obvious conclusion > >>>is that Xen system time is losing time at the rate of 1000 PPM > >>>though its possible there''s a bug somewhere else in the "time > >>>stack". Your Windows result is jaw-dropping and inexplicable, > >>>though I have to admit ignorance of how Windows manages time. > >>> > >>> > >>I think xen system time is fine. You have to add the interrupt > >>delivery policies decribed in the write-up for the patch to get > >>accurate timekeeping in the guest. > >> > >>The windows policy is obvious and results in a large improvement > >>in accuracy. The Linux policy is more subtle, but is required to go > >>from .1% to .03%. > >> > >> > >> > >>>I think with my recent patch and hpet=1 (essentially the same as > >>>your emulated hpet), hvm guest time should track Xen system time. > >>>I wonder if domain0 (which if I understand correctly is directly > >>>using Xen system time) is also seeing an error of .1%? Also > >>>I wonder for the skew you are seeing (in both hvm guests and > >>>domain0) is time moving too fast or two slow? > >>> > >>> > >>I don''t recall the direction. I can look it up in my notes at work > >>tomorrow. > >> > >> > >> > >>>Although hwhpet=1 is a fine alternative in many cases, it may > >>>be unavailable on some systems and may cause significant > performance > >>>issues on others. So I think we will still need to track down > >>>the poor accuracy when hwhpet=0. > >>> > >>> > >>Our patch is accurate to < .03% using the physical hpet mode or > >>the simulated mode. > >> > >> > >> > >>>And if for some reason > >>>Xen system time can''t be made accurate enough (< 0.05%), then > >>>I think we should consider building Xen system time itself on > >>>top of hardware hpet instead of TSC... at least when Xen discovers > >>>a capable hpet. > >>> > >>> > >>In our experience, Xen system time is accurate enough now. > >> > >> > >> > >>>One more thought... do you know the accuracy of the TSC crystals > >>>on your test systems? I posted a patch awhile ago that was > >>>intended to test that, though I guess it was only testing skew > >>>of different TSCs on the same system, not TSCs against an > >>>external time source. > >>> > >>> > >>I do not know the tsc accuracy. > >> > >> > >> > >>>Or maybe there''s a computation error somewhere in the hvm hpet > >>>scaling code? Hmmm... > >>> > >>> > >>Regards, > >>Dave > >> > >> > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > >>Sent: Fri 6/6/2008 4:29 PM > >>To: Dave Winchell; Keir Fraser > >>Cc: Ben Guthro; xen-devel > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >>Dave -- > >> > >>Thanks much for posting the preliminary results! > >> > >>While I am fully supportive of offering hardware hpet as an option > >>for hvm guests (let''s call it hwhpet=1 for shorthand), I am very > >>surprised by your preliminary results; the most obvious conclusion > >>is that Xen system time is losing time at the rate of 1000 PPM > >>though its possible there''s a bug somewhere else in the "time > >>stack". Your Windows result is jaw-dropping and inexplicable, > >>though I have to admit ignorance of how Windows manages time. > >> > >> > >>I think with my recent patch and hpet=1 (essentially the same as > >>your emulated hpet), hvm guest time should track Xen system time. > >>I wonder if domain0 (which if I understand correctly is directly > >>using Xen system time) is also seeing an error of .1%? Also > >>I wonder for the skew you are seeing (in both hvm guests and > >>domain0) is time moving too fast or two slow? > >> > >>Although hwhpet=1 is a fine alternative in many cases, it may > >>be unavailable on some systems and may cause significant performance > >>issues on others. So I think we will still need to track down > >>the poor accuracy when hwhpet=0. And if for some reason > >>Xen system time can''t be made accurate enough (< 0.05%), then > >>I think we should consider building Xen system time itself on > >>top of hardware hpet instead of TSC... at least when Xen discovers > >>a capable hpet. > >> > >>One more thought... do you know the accuracy of the TSC crystals > >>on your test systems? I posted a patch awhile ago that was > >>intended to test that, though I guess it was only testing skew > >>of different TSCs on the same system, not TSCs against an > >>external time source. > >> > >>Or maybe there''s a computation error somewhere in the hvm hpet > >>scaling code? Hmmm... > >> > >>Thanks, > >>Dan > >> > >> > >> > >>>-----Original Message----- > >>>From: Dave Winchell [mailto:dwinchell@virtualiron.com] > >>>Sent: Friday, June 06, 2008 1:33 PM > >>>To: dan.magenheimer@oracle.com; Keir Fraser > >>>Cc: Ben Guthro; xen-devel; Dave Winchell > >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>> > >>> > >>>Dan, Keir: > >>> > >>>Preliminary tests results indicate an error of .1% for Linux 64 bit > >>>guests configured > >>>for hpet with xen-unstable as is. As we have discussed many > >>> > >>> > >>times, the > >> > >> > >>>ntp requirement is .05%. > >>>Tests on the patch we just submitted for hpet have > >>> > >>> > >>indicated errors of > >> > >> > >>>.0012% > >>>on this platform under similar test conditions and .03% on > >>>other platforms. > >>> > >>>Windows vista64 has an error of 11% using hpet with the > >>>xen-unstable bits. > >>>In an overnight test with our hpet patch, the Windows vista > >>>error was .008%. > >>> > >>>The tests are with two or three guests on a physical node, > all under > >>>load, and with > >>>the ratio of vcpus to phys cpus > 1. > >>> > >>>I will continue to run tests over the next few days. > >>> > >>>thanks, > >>>Dave > >>> > >>> > >>>Dan Magenheimer wrote: > >>> > >>> > >>> > >>>>Hi Dave and Ben -- > >>>> > >>>>When running tests on xen-unstable (without your patch), > >>>> > >>>> > >>>please ensure > >>> > >>> > >>>>that hpet=1 is set in the hvm config and also I think > >>>> > >>>> > >>that when hpet > >> > >> > >>>>is the clocksource on RHEL4-32, the clock IS resilient to > >>>> > >>>> > >>>missed ticks > >>> > >>> > >>>>so timer_mode should be 2 (vs when pit is the clocksource > >>>> > >>>> > >>>on RHEL4-32, > >>> > >>> > >>>>all clock ticks must be delivered and so timer_mode should be 0). > >>>> > >>>>Per > >>>> > >>>> > >>>> > >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > >>>00098.html it''s > >>> > >>> > >>>>my intent to clean this up, but I won''t get to it until next week. > >>>> > >>>>Thanks, > >>>>Dan > >>>> > >>>> -----Original Message----- > >>>> *From:* xen-devel-bounces@lists.xensource.com > >>>> [mailto:xen-devel-bounces@lists.xensource.com]*On > >>>> > >>>> > >>>Behalf Of *Dave > >>> > >>> > >>>> Winchell > >>>> *Sent:* Friday, June 06, 2008 4:46 AM > >>>> *To:* Keir Fraser; Ben Guthro; xen-devel > >>>> *Cc:* dan.magenheimer@oracle.com; Dave Winchell > >>>> *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>>> > >>>> Keir, > >>>> > >>>> I think the changes are required. We''ll run some tests > >>>> > >>>> > >>>today today so > >>> > >>> > >>>> that we have some data to talk about. > >>>> > >>>> -Dave > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: xen-devel-bounces@lists.xensource.com on behalf > >>>> > >>>> > >>>of Keir Fraser > >>> > >>> > >>>> Sent: Fri 6/6/2008 4:58 AM > >>>> To: Ben Guthro; xen-devel > >>>> Cc: dan.magenheimer@oracle.com > >>>> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>>> > >>>> Are these patches needed now the timers are built on > >>>> > >>>> > >>Xen system > >> > >> > >>>> time rather > >>>> than host TSC? Dan has reported much better > >>>> > >>>> > >>>time-keeping with his > >>> > >>> > >>>> patch > >>>> checked in, and it¹s for sure a lot less invasive than > >>>> > >>>> > >>>this patchset. > >>> > >>> > >>>> -- Keir > >>>> > >>>> On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote: > >>>> > >>>> > > >>>> > 1. Introduction > >>>> > > >>>> > This patch improves the hpet based guest clock in > >>>> > >>>> > >>>terms of drift and > >>> > >>> > >>>> > monotonicity. > >>>> > Prior to this work the drift with hpet was greater > >>>> > >>>> > >>>than 2%, far > >>> > >>> > >>>> above the .05% > >>>> > limit > >>>> > for ntp to synchronize. With this code, the drift > >>>> > >>>> > >>ranges from > >> > >> > >>>> .001% to .0033% > >>>> > depending > >>>> > on guest and physical platform. > >>>> > > >>>> > Using hpet allows guest operating systems to > >>>> > >>>> > >>provide monotonic > >> > >> > >>>> time to their > >>>> > applications. Time sources other than hpet are not > >>>> > >>>> > >>>monotonic because > >>> > >>> > >>>> > of their reliance on tsc, which is not synchronized > >>>> > >>>> > >>>across physical > >>> > >>> > >>>> > processors. > >>>> > > >>>> > Windows 2k864 and many Linux guests are supported with two > >>>> policies, one for > >>>> > guests > >>>> > that handle missed clock interrupts and the other for guests > >>>> that require the > >>>> > correct number of interrupts. > >>>> > > >>>> > Guests may use hpet for the timing source even if > >>>> > >>>> > >>the physical > >> > >> > >>>> platform has no > >>>> > visible > >>>> > hpet. Migration is supported between physical machines which > >>>> differ in > >>>> > physical > >>>> > hpet visibility. > >>>> > > >>>> > Most of the changes are in hpet.c. Two general > >>>> > >>>> > >>facilities are > >> > >> > >>>> added to track > >>>> > interrupt > >>>> > progress. The ideas here and the facilities would > >>>> > >>>> > >>be useful in > >> > >> > >>>> vpt.c, for > >>>> > other time > >>>> > sources, though no attempt is made here to improve vpt.c. > >>>> > > >>>> > The following sections discuss hpet dependencies, interrupt > >>>> delivery policies, > >>>> > live migration, > >>>> > test results, and relation to recent work with > >>>> > >>>> > >>monotonic time. > >> > >> > >>>> > > >>>> > > >>>> > 2. Virtual Hpet dependencies > >>>> > > >>>> > The virtual hpet depends on the ability to read the > >>>> > >>>> > >>>physical or > >>> > >>> > >>>> simulated > >>>> > (see discussion below) hpet. For timekeeping, the > >>>> > >>>> > >>>virtual hpet > >>> > >>> > >>>> also depends > >>>> > on two new interrupt notification facilities to > >>>> > >>>> > >>implement its > >> > >> > >>>> policies for > >>>> > interrupt delivery. > >>>> > > >>>> > 2.1. Two modes of low-level hpet main counter reads. > >>>> > > >>>> > In this implementation, the virtual hpet reads with > >>>> read_64_main_counter(), > >>>> > exported by > >>>> > time.c, either the real physical hpet main counter register > >>>> directly or a > >>>> > "simulated" > >>>> > hpet main counter. > >>>> > > >>>> > The simulated mode uses a monotonic version of get_s_time() > >>>> (NOW()), where the > >>>> > last > >>>> > time value is returned whenever the current time > >>>> > >>>> > >>value is less > >> > >> > >>>> than the last > >>>> > time > >>>> > value. In simulated mode, since it is layered on s_time, the > >>>> underlying > >>>> > hardware > >>>> > can be hpet or some other device. The frequency of the main > >>>> counter in > >>>> > simulated > >>>> > mode is the same as the standard physical hpet frequency, > >>>> allowing live > >>>> > migration > >>>> > between nodes that are configured differently. > >>>> > > >>>> > If the physical platform does not have an hpet > >>>> > >>>> > >>>device, or if xen > >>> > >>> > >>>> is configured > >>>> > not > >>>> > to use the device, then the simulated method is > >>>> > >>>> > >>used. If there > >> > >> > >>>> is a physical > >>>> > hpet device, > >>>> > and xen has initialized it, then either simulated > >>>> > >>>> > >>or physical > >> > >> > >>>> mode can be > >>>> > used. > >>>> > This is governed by a boot time option, hpet-avoid. > >>>> > >>>> > >>>Setting this > >>> > >>> > >>>> option to 1 > >>>> > gives the > >>>> > simulated mode and 0 the physical mode. The default > >>>> > >>>> > >>>is physical > >>> > >>> > >>>> mode. > >>>> > > >>>> > A disadvantage of the physical mode is that may > >>>> > >>>> > >>take longer to > >> > >> > >>>> read the device > >>>> > than in simulated mode. On some platforms the cost is > >>>> > >>>> > >>>about the > >>> > >>> > >>>> same (less > >>>> > than 250 nsec) for > >>>> > physical and simulated modes, while on others > >>>> > >>>> > >>physical cost is > >> > >> > >>>> much higher > >>>> > than simulated. > >>>> > A disadvantage of the simulated mode is that it can > >>>> > >>>> > >>return the > >> > >> > >>>> same value > >>>> > for the counter in consecutive calls. > >>>> > > >>>> > 2.2. Interrupt notification facilities. > >>>> > > >>>> > Two interrupt notification facilities are introduced, one is > >>>> > hvm_isa_irq_assert_cb() > >>>> > and the other hvm_register_intr_en_notif(). > >>>> > > >>>> > The vhpet uses hvm_isa_irq_assert_cb to deliver > >>>> > >>>> > >>interrupts to > >> > >> > >>>> the vioapic. > >>>> > hvm_isa_irq_assert_cb allows a callback to be > >>>> > >>>> > >>passed along to > >> > >> > >>>> > vioapic_deliver() > >>>> > and this callback is called with a mask of the vcpus > >>>> > >>>> > >>>which will > >>> > >>> > >>>> get the > >>>> > interrupt. This callback is made before any vcpus receive an > >>>> interrupt. > >>>> > > >>>> > Vhpet uses hvm_register_intr_en_notif() to register > >>>> > >>>> > >>a handler > >> > >> > >>>> for a particular > >>>> > vector that will be called when that vector is injected in > >>>> > [vmx,svm]_intr_assist() > >>>> > and also when the guest finishes handling the > >>>> > >>>> > >>interrupt. Here > >> > >> > >>>> finished is > >>>> > defined > >>>> > as the point when the guest re-enables interrupts or > >>>> > >>>> > >>>lowers the > >>> > >>> > >>>> tpr value. > >>>> > EOI is not used as the end of interrupt as this is sometimes > >>>> returned before > >>>> > the interrupt handler has done its work. A flag is > >>>> > >>>> > >>>passed to the > >>> > >>> > >>>> handler > >>>> > indicating > >>>> > whether this is the injection point (post = 1) or the > >>>> > >>>> > >>>interrupt > >>> > >>> > >>>> finished (post > >>>> > = 0) point. > >>>> > The need for the finished point callback is discussed in the > >>>> missed ticks > >>>> > policy section. > >>>> > > >>>> > To prevent a possible early trigger of the finished > >>>> > >>>> > >>callback, > >> > >> > >>>> intr_en_notif > >>>> > logic > >>>> > has a two stage arm, the first at injection > >>>> (hvm_intr_en_notif_arm()) and the > >>>> > second when > >>>> > interrupts are seen to be disabled > >>>> > >>>> > >>>(hvm_intr_en_notif_disarm()). > >>> > >>> > >>>> Once fully > >>>> > armed, re-enabling > >>>> > interrupts will cause hvm_intr_en_notif_disarm() to > >>>> > >>>> > >>>make the end > >>> > >>> > >>>> of interrupt > >>>> > callback. hvm_intr_en_notif_arm() and > >>>> > >>>> > >>>hvm_intr_en_notif_disarm() > >>> > >>> > >>>> are called by > >>>> > [vmx,svm]_intr_assist(). > >>>> > > >>>> > 3. Interrupt delivery policies > >>>> > > >>>> > The existing hpet interrupt delivery is preserved. > >>>> > >>>> > >>>This includes > >>> > >>> > >>>> > vcpu round robin delivery used by Linux and > >>>> > >>>> > >>broadcast delivery > >> > >> > >>>> used by > >>>> > Windows. > >>>> > > >>>> > There are two policies for interrupt delivery, one > >>>> > >>>> > >>for Windows > >> > >> > >>>> 2k8-64 and the > >>>> > other > >>>> > for Linux. The Linux policy takes advantage of the > >>>> > >>>> > >>>(guest) Linux > >>> > >>> > >>>> missed tick > >>>> > and offset > >>>> > calculations and does not attempt to deliver the > >>>> > >>>> > >>>right number of > >>> > >>> > >>>> interrupts. > >>>> > The Windows policy delivers the correct number of > >>>> > >>>> > >>interrupts, > >> > >> > >>>> even if > >>>> > sometimes much > >>>> > closer to each other than the period. The policies > >>>> > >>>> > >>are similar > >> > >> > >>>> to those in > >>>> > vpt.c, though > >>>> > there are some important differences. > >>>> > > >>>> > Policies are selected with an HVMOP_set_param > >>>> > >>>> > >>>hypercall with index > >>> > >>> > >>>> > HVM_PARAM_TIMER_MODE. > >>>> > Two new values are added, > >>>> > >>>> > >>>HVM_HPET_guest_computes_missed_ticks and > >>> > >>> > >>>> > HVM_HPET_guest_does_not_compute_missed_ticks. The > >>>> > >>>> > >>reason that > >> > >> > >>>> two new ones > >>>> > are added is that > >>>> > in some guests (32bit Linux) a no-missed policy is > >>>> > >>>> > >>needed for > >> > >> > >>>> clock sources > >>>> > other than hpet > >>>> > and a missed ticks policy for hpet. It was felt that > >>>> > >>>> > >>>there would > >>> > >>> > >>>> be less > >>>> > confusion by simply > >>>> > introducing the two hpet policies. > >>>> > > >>>> > 3.1. The missed ticks policy > >>>> > > >>>> > The Linux clock interrupt handler for hpet calculates missed > >>>> ticks and offset > >>>> > using the hpet > >>>> > main counter. The algorithm works well when the > >>>> > >>>> > >>time since the > >> > >> > >>>> last interrupt > >>>> > is greater than > >>>> > or equal to a period and poorly otherwise. > >>>> > > >>>> > The missed ticks policy ensures that no two clock > >>>> > >>>> > >>>interrupts are > >>> > >>> > >>>> delivered to > >>>> > the guest at > >>>> > a time interval less than a period. A time stamp (hpet main > >>>> counter value) is > >>>> > recorded (by a > >>>> > callback registered with hvm_register_intr_en_notif) > >>>> > >>>> > >>>when Linux > >>> > >>> > >>>> finishes > >>>> > handling the clock > >>>> > interrupt. Then, ensuing interrupts are delivered to > >>>> > >>>> > >>>the vioapic > >>> > >>> > >>>> only if the > >>>> > current main > >>>> > counter value is a period greater than when the > >>>> > >>>> > >>last interrupt > >> > >> > >>>> was handled. > >>>> > > >>>> > Tests showed a significant improvement in clock > >>>> > >>>> > >>drift with end > >> > >> > >>>> of interrupt > >>>> > time stamps > >>>> > versus beginning of interrupt[1]. It is believed that > >>>> > >>>> > >>>the reason > >>> > >>> > >>>> for the > >>>> > improvement > >>>> > is that the clock interrupt handler goes for a > >>>> > >>>> > >>>spinlock and can > >>> > >>> > >>>> be therefore > >>>> > delayed in its > >>>> > processing. Furthermore, the main counter is read > >>>> > >>>> > >>by the guest > >> > >> > >>>> under the lock. > >>>> > The net > >>>> > effect is that if we time stamp injection, we can get the > >>>> difference in time > >>>> > between successive interrupt handler lock acquisitions to be > >>>> less than the > >>>> > period. > >>>> > > >>>> > 3.2. The no-missed ticks policy > >>>> > > >>>> > Windows 2k864 keeps very poor time with the missed > >>>> > >>>> > >>>ticks policy. > >>> > >>> > >>>> So the > >>>> > no-missed ticks policy > >>>> > was developed. In the no-missed ticks policy we deliver the > >>>> correct number of > >>>> > interrupts, > >>>> > even if they are spaced less than a period apart > >>>> > >>>> > >>>(when catching up). > >>> > >>> > >>>> > > >>>> > Windows 2k864 uses a broadcast mode in the interrupt routing > >>>> such that > >>>> > all vcpus get the clock interrupt. The best Windows drift > >>>> performance was > >>>> > achieved when the > >>>> > policy code ensured that all the previous interrupts (on the > >>>> various vcpus) > >>>> > had been injected > >>>> > before injecting the next interrupt to the vioapic.. > >>>> > > >>>> > The policy code works as follows. It uses the > >>>> hvm_isa_irq_assert_cb() to > >>>> > record > >>>> > the vcpus to be interrupted in > >>>> > >>>> > >>h->hpet.pending_mask. Then, in > >> > >> > >>>> the callback > >>>> > registered > >>>> > with hvm_register_intr_en_notif() at post=1 time it > >>>> > >>>> > >>clears the > >> > >> > >>>> current vcpu in > >>>> > the pending_mask. > >>>> > When the pending_mask is clear it decrements > >>>> hpet.intr_pending_nr and if > >>>> > intr_pending_nr is still > >>>> > non-zero posts another interrupt to the ioapic with > >>>> hvm_isa_irq_assert_cb(). > >>>> > Intr_pending_nr is incremented in > >>>> hpet_route_decision_not_missed_ticks(). > >>>> > > >>>> > The missed ticks policy intr_en_notif callback also uses the > >>>> pending_mask > >>>> > method. So even though > >>>> > Linux does not broadcast its interrupts, the code > >>>> > >>>> > >>could handle > >> > >> > >>>> it if it did. > >>>> > In this case the end of interrupt time stamp is > >>>> > >>>> > >>made when the > >> > >> > >>>> pending_mask is > >>>> > clear. > >>>> > > >>>> > 4. Live Migration > >>>> > > >>>> > Live migration with hpet preserves the current offset of the > >>>> guest clock with > >>>> > respect > >>>> > to ntp. This is accomplished by migrating all of > >>>> > >>>> > >>the state in > >> > >> > >>>> the h->hpet data > >>>> > structure > >>>> > in the usual way. The hp->mc_offset is recalculated on the > >>>> receiving node so > >>>> > that the > >>>> > guest sees a continuous hpet main counter. > >>>> > > >>>> > Code as been added to xc_domain_save.c to send a > >>>> > >>>> > >>small message > >> > >> > >>>> after the > >>>> > domain context is sent. The contents of the message is the > >>>> physical tsc > >>>> > timestamp, last_tsc, > >>>> > read just before the message is sent. When the > >>>> > >>>> > >>>last_tsc message > >>> > >>> > >>>> is received in > >>>> > xc_domain_restore.c, > >>>> > another physical tsc timestamp, cur_tsc, is read. The two > >>>> timestamps are > >>>> > loaded into the domain > >>>> > structure as last_tsc_sender and first_tsc_receiver with > >>>> hypercalls. Then > >>>> > xc_domain_hvm_setcontext > >>>> > is called so that hpet_load has access to these time stamps. > >>>> Hpet_load uses > >>>> > the timestamps > >>>> > to account for the time spent saving and loading the domain > >>>> context. With this > >>>> > technique, > >>>> > the only neglected time is the time spent sending a small > >>>> network message. > >>>> > > >>>> > 5. Test Results > >>>> > > >>>> > Some recent test results are: > >>>> > > >>>> > 5.1 Linux 4u664 and Windows 2k864 load test. > >>>> > Duration: 70 hours. > >>>> > Test date: 6/2/08 > >>>> > Loads: usex -b48 on Linux; burn-in on Windows > >>>> > Guest vcpus: 8 for Linux; 2 for Windows > >>>> > Hardware: 8 physical cpu AMD > >>>> > Clock drift : Linux: .0012% Windows: .009% > >>>> > > >>>> > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 > >>>> > >>>> > >>no-load test > >> > >> > >>>> > Duration: 23 hours. > >>>> > Test date: 6/3/08 > >>>> > Loads: none > >>>> > Guest vcpus: 8 for each Linux; 2 for Windows > >>>> > Hardware: 4 physical cpu AMD > >>>> > Clock drift : Linux: .033% Windows: .019% > >>>> > > >>>> > 6. Relation to recent work in xen-unstable > >>>> > > >>>> > There is a similarity between hvm_get_guest_time() in > >>>> xen-unstable and > >>>> > read_64_main_counter() > >>>> > in this code. However, read_64_main_counter() is > >>>> > >>>> > >>more tuned to > >> > >> > >>>> the needs of > >>>> > hpet.c. It has no > >>>> > "set" operation, only the get. It isolates the mode, > >>>> > >>>> > >>>physical or > >>> > >>> > >>>> simulated, in > >>>> > read_64_main_counter() > >>>> > itself. It uses no vcpu or domain state as it is a physical > >>>> entity, in either > >>>> > mode. And it provides a real > >>>> > physical mode for every read for those applications > >>>> > >>>> > >>>that desire > >>> > >>> > >>>> this. > >>>> > > >>>> > 7. Conclusion > >>>> > > >>>> > The virtual hpet is improved by this patch in terms > >>>> > >>>> > >>>of accuracy and > >>> > >>> > >>>> > monotonicity. > >>>> > Tests performed to date verify this and more testing > >>>> > >>>> > >>>is under way. > >>> > >>> > >>>> > > >>>> > 8. Future Work > >>>> > > >>>> > Testing with Windows Vista will be performed soon. > >>>> > >>>> > >>The reason > >> > >> > >>>> for accuracy > >>>> > variations > >>>> > on different platforms using the physical hpet > >>>> > >>>> > >>device will be > >> > >> > >>>> investigated. > >>>> > Additional overhead measurements on simulated vs > >>>> > >>>> > >>physical hpet > >> > >> > >>>> mode will be > >>>> > made. > >>>> > > >>>> > Footnotes: > >>>> > > >>>> > 1. I don''t recall the accuracy improvement with end > >>>> > >>>> > >>>of interrupt > >>> > >>> > >>>> stamping, but > >>>> > it was > >>>> > significant, perhaps better than two to one improvement. It > >>>> would be a very > >>>> > simple matter > >>>> > to re-measure the improvement as the facility can > >>>> > >>>> > >>call back at > >> > >> > >>>> injection time > >>>> > as well. > >>>> > > >>>> > > >>>> > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com> > >>>> > <mailto:dwinchell@virtualiron.com> > >>>> > Signed-off-by: Ben Guthro <bguthro@virtualiron.com> > >>>> > <mailto:bguthro@virtualiron.com> > >>>> > > >>>> > > >>>> > _______________________________________________ > >>>> > Xen-devel mailing list > >>>> > Xen-devel@lists.xensource.com > >>>> > http://lists.xensource.com/xen-devel > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > > > > > >_______________________________________________ > >Xen-devel mailing list > >Xen-devel@lists.xensource.com > >http://lists.xensource.com/xen-devel > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
(Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: 'Dave Winchell' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There's nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn't matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan, You shouldn''t be getting higher than .05%. I''d like to figure out what is wrong. I''m running the same guest you are with heavy loads and the physical processors overcommitted by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. Can you type ^a^a^a at the console and then type ''Z'' a couple of times about 10 seconds apart and send me the output? Do this when you have a domain running that is keeping poor time. You should take drift measurements over a period of time that is at least 20 minutes, preferably longer. Also, can you send me a tarball of your sources from the xen directory? thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > In EL5u1-32 however it looks like the fractions are accounted > > for. Indeed the EL5u1-32 "lost tick handling" code resembles > > the Linux/ia64 code which is what I''ve always assumed was > > the "missed tick" model. In this case, I think no policy > > is necessary and the measured skew should be identical to > > any physical hpet skew. I''ll have to test this hypothesis though. > > I''ve tested this hypothesis and it seems to hold true. > This means the existing (unpatched) hpet code works fine > on EL5-32bit (vcpus=1) when hpet is the clocksource, > even when the machine is overcommitted. A second hypothesis > still needs to be tested that Dave''s patch will not make this worse.OK, I can confirm that Dave''s patch, as expected, does not make this any worse. The timer algorithm in 2.6.18 for x86 (i.e. RHEL5-32bit) is definitely the most resilient to variations in tick delivery for a monotonically-increasing timesource (i.e. hpet). This algorithm is in arch-independent code but sadly x86_64 didn''t use it as of 2.6.18. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dan,> Another theoretical oddity... if you are always delivering > timer ticks "late", fewer than the nominal 1000 ticks/sec > should be being received. So then why is guest time actually > going faster than an external source?With a guest that computes missed ticks, and is not dealing with fractional ticks when the interrupts are closer than a period: If you send several interrupts farther apart than period and then send one closer than period, the guest gains a tick. With this fact you can have fewer than the expected number of interrupts and be gaining time. With one that expects the right number of interrupts (Windows) delivering fewer than expected makes the guest run slow. -Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dave -- I understand that ticks too close together causes time to move faster, but I thought your policy ensured that ticks were never delivered too close together. So I was surprised to see that time was moving faster rather than slower. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Thursday, June 12, 2008 6:39 PM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,> Another theoretical oddity... if you are always delivering > timer ticks "late", fewer than the nominal 1000 ticks/sec > should be being received. So then why is guest time actually > going faster than an external source?With a guest that computes missed ticks, and is not dealing with fractional ticks when the interrupts are closer than a period: If you send several interrupts farther apart than period and then send one closer than period, the guest gains a tick. With this fact you can have fewer than the expected number of interrupts and be gaining time. With one that expects the right number of interrupts (Windows) delivering fewer than expected makes the guest run slow. -Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I understand that ticks too close together causes > time to move faster, but I thought your policy ensured > that ticks were never delivered too close together. > So I was surprised to see that time was moving faster > rather than slower.ok. Send me the debug info and I''ll try to figure out what''s going on. thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 10:21 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dave -- I understand that ticks too close together causes time to move faster, but I thought your policy ensured that ticks were never delivered too close together. So I was surprised to see that time was moving faster rather than slower. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Thursday, June 12, 2008 6:39 PM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,> Another theoretical oddity... if you are always delivering > timer ticks "late", fewer than the nominal 1000 ticks/sec > should be being received. So then why is guest time actually > going faster than an external source?With a guest that computes missed ticks, and is not dealing with fractional ticks when the interrupts are closer than a period: If you send several interrupts farther apart than period and then send one closer than period, the guest gains a tick. With this fact you can have fewer than the expected number of interrupts and be gaining time. With one that expects the right number of interrupts (Windows) delivering fewer than expected makes the guest run slow. -Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dave -- Hmmm... in my earlier runs with rhel5u1-64, I had apic=0 (yes apic, not acpi). Changing it to apic=1 gives excellent results (< 0.01% even with overcommit). Changing it back to apic=0 has the same fairly bad results, 0.08% with no overcommit and 0.16% (and climbing) with overcommit. Note that this is all with vcpus=1. How odd... I vaguely recalled from some research a couple of months ago that hpet is read MORE than once/tick on the boot processor. I can''t seem to find the table I compiled from that research, but I did find this in an email I sent to you: "You probably know this already but an n-way 2.6 Linux kernel reads hpet (n+1)*1000 times/second. Let''s take five 2-way guests as an example; that comes to 15000 hpet reads/second...." I wondered what was different between apic=1 vs 0. Using: # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ cat /proc/interrupts | grep ''LOC|timer'' you can see that there are always 1000 LOC/sec. But with apic=1 there are also about 350 IO-APIC-edge-timer/sec and with apic=0 there are 1000 XT-PIC-timer/sec. I suspect that the latter of these (XT-PIC-timer) is messing up your policy and the former (edge-timer) is not. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Thursday, June 12, 2008 4:49 PM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dan, You shouldn''t be getting higher than .05%. I''d like to figure out what is wrong. I''m running the same guest you are with heavy loads and the physical processors overcommitted by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. Can you type ^a^a^a at the console and then type ''Z'' a couple of times about 10 seconds apart and send me the output? Do this when you have a domain running that is keeping poor time. You should take drift measurements over a period of time that is at least 20 minutes, preferably longer. Also, can you send me a tarball of your sources from the xen directory? thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/6/08 05:47, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> I wondered what was different between apic=1 vs 0. Using: > > # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ > cat /proc/interrupts | grep ''LOC|timer'' > > you can see that there are always 1000 LOC/sec. But > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > and with apic=0 there are 1000 XT-PIC-timer/sec. > > I suspect that the latter of these (XT-PIC-timer) is > messing up your policy and the former (edge-timer) is not.I think apic=0 is not a particularly useful configuration though, right? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Dan, I''m glad your able to reproduce my results. Are you still seeing the boot time hang up? Is this the reason for vcpus=1?> you can see that there are always 1000 LOC/sec. But > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > and with apic=0 there are 1000 XT-PIC-timer/sec.> I suspect that the latter of these (XT-PIC-timer) is > messing up your policy and the former (edge-timer) is not.Thanks for this data. Your analysis is correct, I think. I wrote the interrupt routing and callback code for the IOAPIC edge triggered interrupts. The PIC path does not have the callbacks. With no callbacks, it always looks to the routing code in hpet.c like its been longer than a period since the last one as the end-of-interrupt time stamp is zero. Thus, you get an interrupt each timeout or 1000 interrupts/sec. 350 is a typical amount when the algorithm for missed ticks is doing its thing. I''ll put this on the bug list - unless no one cares about apic=0. thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/13/2008 12:47 AM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dave -- Hmmm... in my earlier runs with rhel5u1-64, I had apic=0 (yes apic, not acpi). Changing it to apic=1 gives excellent results (< 0.01% even with overcommit). Changing it back to apic=0 has the same fairly bad results, 0.08% with no overcommit and 0.16% (and climbing) with overcommit. Note that this is all with vcpus=1. How odd... I vaguely recalled from some research a couple of months ago that hpet is read MORE than once/tick on the boot processor. I can''t seem to find the table I compiled from that research, but I did find this in an email I sent to you: "You probably know this already but an n-way 2.6 Linux kernel reads hpet (n+1)*1000 times/second. Let''s take five 2-way guests as an example; that comes to 15000 hpet reads/second...." I wondered what was different between apic=1 vs 0. Using: # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ cat /proc/interrupts | grep ''LOC|timer'' you can see that there are always 1000 LOC/sec. But with apic=1 there are also about 350 IO-APIC-edge-timer/sec and with apic=0 there are 1000 XT-PIC-timer/sec. I suspect that the latter of these (XT-PIC-timer) is messing up your policy and the former (edge-timer) is not. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Thursday, June 12, 2008 4:49 PM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dan, You shouldn''t be getting higher than .05%. I''d like to figure out what is wrong. I''m running the same guest you are with heavy loads and the physical processors overcommitted by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. Can you type ^a^a^a at the console and then type ''Z'' a couple of times about 10 seconds apart and send me the output? Do this when you have a domain running that is keeping poor time. You should take drift measurements over a period of time that is at least 20 minutes, preferably longer. Also, can you send me a tarball of your sources from the xen directory? thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan, Keir, In an overnight (17.5 hrs) test with three guests, 8 vcpus each on 8 physical cpus all under usex b48 loads I noted the following errors: rh4u664 -.72 sec (.0012%) rhas5u164 -10.2 sec (.016%) sles10u164 -9.3 sec (.015%) The number for rh4u664 is what I am used to seeing on this platform. The other ones are 10 times worse, but still good enough for ntp. The reason they are worse is that the guest clock code for hpet in rhas5u164 looks at the cmp register to calculate interrupt delay. I mentioned before on this list that one of the beauties of hpet was the fine hpet code in the guest (rh4u664) which did not use the delay computation, which in my mind is unnecessary and adds error. Well, in rhas5u164 and I assume in sles10u164 delay is back in and so is the associated error. The cmp register is also the reason for the hesitations on boot. I''ll have more to say on this later. thanks, Dave Dave Winchell wrote:> Hi Dan, > > I''m glad your able to reproduce my results. > Are you still seeing the boot time hang up? > Is this the reason for vcpus=1? > > > you can see that there are always 1000 LOC/sec. But > > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > > and with apic=0 there are 1000 XT-PIC-timer/sec. > > > I suspect that the latter of these (XT-PIC-timer) is > > messing up your policy and the former (edge-timer) is not. > > Thanks for this data. Your analysis is correct, I think. > I wrote the interrupt routing and callback code for the > IOAPIC edge triggered interrupts. The PIC path does not > have the callbacks. With no callbacks, it always looks to > the routing code in hpet.c like its been longer than a period > since the last one as the end-of-interrupt time stamp is zero. Thus, > you get > an interrupt each timeout or 1000 interrupts/sec. > 350 is a typical amount when the algorithm for missed ticks is > doing its thing. I''ll put this on the bug list - unless no one > cares about apic=0. > > thanks, > Dave > > > -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Fri 6/13/2008 12:47 AM > To: Dave Winchell; Keir Fraser; xen-devel > Cc: Ben Guthro > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > Hi Dave -- > > Hmmm... in my earlier runs with rhel5u1-64, I had apic=0 > (yes apic, not acpi). Changing it to apic=1 gives excellent > results (< 0.01% even with overcommit). Changing it back > to apic=0 has the same fairly bad results, 0.08% with no > overcommit and 0.16% (and climbing) with overcommit. > Note that this is all with vcpus=1. > > How odd... > > I vaguely recalled from some research a couple of months ago > that hpet is read MORE than once/tick on the boot processor. > I can''t seem to find the table I compiled from that research, > but I did find this in an email I sent to you: > > "You probably know this already but an n-way 2.6 Linux > kernel reads hpet (n+1)*1000 times/second. Let''s take > five 2-way guests as an example; that comes to 15000 > hpet reads/second...." > > I wondered what was different between apic=1 vs 0. Using: > > # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ > cat /proc/interrupts | grep ''LOC|timer'' > > you can see that there are always 1000 LOC/sec. But > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > and with apic=0 there are 1000 XT-PIC-timer/sec. > > I suspect that the latter of these (XT-PIC-timer) is > messing up your policy and the former (edge-timer) is not. > > Dan > > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@virtualiron.com] > Sent: Thursday, June 12, 2008 4:49 PM > To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel > Cc: Ben Guthro; Dave Winchell > Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, > > You shouldn''t be getting higher than .05%. > I''d like to figure out what is wrong. I''m running the same guest > you are with heavy loads and the physical processors overcommitted > by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. > > Can you type ^a^a^a at the console and then > type ''Z'' a couple of times about 10 seconds apart and send > me the output? Do this when you have a domain > running that is keeping poor time. > > You should take drift measurements over a period > of time that is at least 20 minutes, preferably longer. > > Also, can you send me a tarball of your sources from > the xen directory? > > > thanks, > Dave > > > > > -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thu 6/12/2008 6:05 PM > To: Dave Winchell; Keir Fraser; xen-devel > Cc: Ben Guthro > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > (Going back on list.) > > OK, so looking at the updated patch, hpet_avoid=1 is actually > working, just reporting wrong, correct? > > With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew > is under -0.04% and falling. With hpet_avoid=0, it looks > about the same. However both cases seem to start creeping > up again when I put load on, then fall again when I remove > the load -- even with sched-credit capping cpu usage. Odd! > This implies to me that the activity in the other domains > IS affecting skew on the domain-under-test. (Keir, any > comments on the hypothesis attached below?) > > Another theoretical oddity... if you are always delivering > timer ticks "late", fewer than the nominal 1000 ticks/sec > should be being received. So then why is guest time actually > going faster than an external source? > > (In my mind, going faster is much worse than going slower > because if ntpd or a human moves time backwards to compensate > for a clock going faster, "make" and other programs can > get very confused.) > > Dan > > > -----Original Message----- > > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > > Sent: Thursday, June 12, 2008 3:13 PM > > To: ''Dave Winchell'' > > Subject: RE: xen hpet patch > > > > > > One more thought while waiting for compile and reboot: > > > > Am I right that all of the policies are correcting for when > > a domain "A" is out-of-context? There''s nothing in any other > > domain "B" that can account for any timer loss/gain in domain > > "A". The only reason we are running other domains is to ensure > > that domain "A" is sometimes out-of-context, and the more > > it is out-of-context, the more likely we will observe > > a problem, correct? > > > > If this is true, it doesn''t matter what workload is run > > in the non-A domains... as long as it is loading the > > CPU(s), thus ensuring that domain A is sometimes not > > scheduled on any CPU. > > > > And if all this is true, we may not need to run other > > domains at all... running "xm sched-credit -d A -c 50" > > should result in domain A being out-of-context at least > > half the time. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I think apic=0 is not a particularly useful configuration > though, right?We''ve seen it proposed sometimes as a workaround for a boot-time problem, but I agree its not useful enough to warrant concern or stand in the way of Dave''s patch.> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Friday, June 13, 2008 1:34 AM > To: dan.magenheimer@oracle.com; Dave Winchell; xen-devel > Cc: Ben Guthro > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > On 13/6/08 05:47, "Dan Magenheimer" > <dan.magenheimer@oracle.com> wrote: > > > I wondered what was different between apic=1 vs 0. Using: > > > > # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ > > cat /proc/interrupts | grep ''LOC|timer'' > > > > you can see that there are always 1000 LOC/sec. But > > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > > and with apic=0 there are 1000 XT-PIC-timer/sec. > > > > I suspect that the latter of these (XT-PIC-timer) is > > messing up your policy and the former (edge-timer) is not. > > I think apic=0 is not a particularly useful configuration > though, right? > > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kudos, Dave, for your excellent work! Keir, I''ve completed enough testing to agree that Dave''s hpet policy is a huge improvement over the existing hpet code and a major improvement over the pit-based policies/timekeeping. I strongly recommend that, once Dave''s soon-to-be-revised patch is in, we turn on hpet by default for all hvm guests. I''d also suggest that the default timer_mode (at least when hpet=1) should be Dave''s guest_computes_missed_ticks policy. (Dave, could you include this in your revised patch? Or if you want me to, let me know.) A couple of remaining points...> I''m glad your able to reproduce my results. > Are you still seeing the boot time hang up? > Is this the reason for vcpus=1?No, I was just trying to be methodical in my testing, covering various cases. I haven''t seen the boot-time hang for awhile.> I''ll put this on the bug list - unless no one > cares about apic=0.It probably should be "on the bug list" but very low priority compared with getting the patch cleaned up (per Keir''s requirements) in time for the 3.3 release. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Friday, June 13, 2008 6:08 AM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan, I''m glad your able to reproduce my results. Are you still seeing the boot time hang up? Is this the reason for vcpus=1?> you can see that there are always 1000 LOC/sec. But > with apic=1 there are also about 350 IO-APIC-edge-timer/sec > and with apic=0 there are 1000 XT-PIC-timer/sec.> I suspect that the latter of these (XT-PIC-timer) is > messing up your policy and the former (edge-timer) is not.Thanks for this data. Your analysis is correct, I think. I wrote the interrupt routing and callback code for the IOAPIC edge triggered interrupts. The PIC path does not have the callbacks. With no callbacks, it always looks to the routing code in hpet.c like its been longer than a period since the last one as the end-of-interrupt time stamp is zero. Thus, you get an interrupt each timeout or 1000 interrupts/sec. 350 is a typical amount when the algorithm for missed ticks is doing its thing. I''ll put this on the bug list - unless no one cares about apic=0. thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Fri 6/13/2008 12:47 AM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dave -- Hmmm... in my earlier runs with rhel5u1-64, I had apic=0 (yes apic, not acpi). Changing it to apic=1 gives excellent results (< 0.01% even with overcommit). Changing it back to apic=0 has the same fairly bad results, 0.08% with no overcommit and 0.16% (and climbing) with overcommit. Note that this is all with vcpus=1. How odd... I vaguely recalled from some research a couple of months ago that hpet is read MORE than once/tick on the boot processor. I can''t seem to find the table I compiled from that research, but I did find this in an email I sent to you: "You probably know this already but an n-way 2.6 Linux kernel reads hpet (n+1)*1000 times/second. Let''s take five 2-way guests as an example; that comes to 15000 hpet reads/second...." I wondered what was different between apic=1 vs 0. Using: # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ cat /proc/interrupts | grep ''LOC|timer'' you can see that there are always 1000 LOC/sec. But with apic=1 there are also about 350 IO-APIC-edge-timer/sec and with apic=0 there are 1000 XT-PIC-timer/sec. I suspect that the latter of these (XT-PIC-timer) is messing up your policy and the former (edge-timer) is not. Dan -----Original Message----- From: Dave Winchell [mailto:dwinchell@virtualiron.com] Sent: Thursday, June 12, 2008 4:49 PM To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel Cc: Ben Guthro; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dan, You shouldn''t be getting higher than .05%. I''d like to figure out what is wrong. I''m running the same guest you are with heavy loads and the physical processors overcommitted by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. Can you type ^a^a^a at the console and then type ''Z'' a couple of times about 10 seconds apart and send me the output? Do this when you have a domain running that is keeping poor time. You should take drift measurements over a period of time that is at least 20 minutes, preferably longer. Also, can you send me a tarball of your sources from the xen directory? thanks, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] Sent: Thu 6/12/2008 6:05 PM To: Dave Winchell; Keir Fraser; xen-devel Cc: Ben Guthro Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy (Going back on list.) OK, so looking at the updated patch, hpet_avoid=1 is actually working, just reporting wrong, correct? With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew is under -0.04% and falling. With hpet_avoid=0, it looks about the same. However both cases seem to start creeping up again when I put load on, then fall again when I remove the load -- even with sched-credit capping cpu usage. Odd! This implies to me that the activity in the other domains IS affecting skew on the domain-under-test. (Keir, any comments on the hypothesis attached below?) Another theoretical oddity... if you are always delivering timer ticks "late", fewer than the nominal 1000 ticks/sec should be being received. So then why is guest time actually going faster than an external source? (In my mind, going faster is much worse than going slower because if ntpd or a human moves time backwards to compensate for a clock going faster, "make" and other programs can get very confused.) Dan> -----Original Message----- > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] > Sent: Thursday, June 12, 2008 3:13 PM > To: ''Dave Winchell'' > Subject: RE: xen hpet patch > > > One more thought while waiting for compile and reboot: > > Am I right that all of the policies are correcting for when > a domain "A" is out-of-context? There''s nothing in any other > domain "B" that can account for any timer loss/gain in domain > "A". The only reason we are running other domains is to ensure > that domain "A" is sometimes out-of-context, and the more > it is out-of-context, the more likely we will observe > a problem, correct? > > If this is true, it doesn''t matter what workload is run > in the non-A domains... as long as it is loading the > CPU(s), thus ensuring that domain A is sometimes not > scheduled on any CPU. > > And if all this is true, we may not need to run other > domains at all... running "xm sched-credit -d A -c 50" > should result in domain A being out-of-context at least > half the time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:>Kudos, Dave, for your excellent work! > >Thanks, Dan.>Keir, I''ve completed enough testing to agree that >Dave''s hpet policy is a huge improvement over the >existing hpet code and a major improvement over >the pit-based policies/timekeeping. I strongly >recommend that, once Dave''s soon-to-be-revised >patch is in, we turn on hpet by default for all >hvm guests. I''d also suggest that the default >timer_mode (at least when hpet=1) should be >Dave''s guest_computes_missed_ticks policy. >(Dave, could you include this in your revised >patch? Or if you want me to, let me know.) > >Sure, I can do it.>A couple of remaining points... > > > >>I''m glad your able to reproduce my results. >>Are you still seeing the boot time hang up? >>Is this the reason for vcpus=1? >> >> > >No, I was just trying to be methodical in my testing, >covering various cases. I haven''t seen the boot-time >hang for awhile. > >ok. We still see it here so I''m working on a fix/workaround.> > >>I''ll put this on the bug list - unless no one >>cares about apic=0. >> >> > >It probably should be "on the bug list" but very low >priority compared with getting the patch cleaned up >(per Keir''s requirements) in time for the 3.3 release. > >ok. Dan, thanks very much for the testing work. I know its not easy and you still came up with the results very quickly. -Dave>Dan > >-----Original Message----- >From: Dave Winchell [mailto:dwinchell@virtualiron.com] >Sent: Friday, June 13, 2008 6:08 AM >To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel >Cc: Ben Guthro; Dave Winchell >Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > >Hi Dan, > >I''m glad your able to reproduce my results. >Are you still seeing the boot time hang up? >Is this the reason for vcpus=1? > > > >>you can see that there are always 1000 LOC/sec. But >>with apic=1 there are also about 350 IO-APIC-edge-timer/sec >>and with apic=0 there are 1000 XT-PIC-timer/sec. >> >> > > > >>I suspect that the latter of these (XT-PIC-timer) is >>messing up your policy and the former (edge-timer) is not. >> >> > >Thanks for this data. Your analysis is correct, I think. >I wrote the interrupt routing and callback code for the >IOAPIC edge triggered interrupts. The PIC path does not >have the callbacks. With no callbacks, it always looks to >the routing code in hpet.c like its been longer than a period >since the last one as the end-of-interrupt time stamp is zero. Thus, you get >an interrupt each timeout or 1000 interrupts/sec. >350 is a typical amount when the algorithm for missed ticks is >doing its thing. I''ll put this on the bug list - unless no one >cares about apic=0. > >thanks, >Dave > > >-----Original Message----- >From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: Fri 6/13/2008 12:47 AM >To: Dave Winchell; Keir Fraser; xen-devel >Cc: Ben Guthro >Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >Hi Dave -- > >Hmmm... in my earlier runs with rhel5u1-64, I had apic=0 >(yes apic, not acpi). Changing it to apic=1 gives excellent >results (< 0.01% even with overcommit). Changing it back >to apic=0 has the same fairly bad results, 0.08% with no >overcommit and 0.16% (and climbing) with overcommit. >Note that this is all with vcpus=1. > >How odd... > >I vaguely recalled from some research a couple of months ago >that hpet is read MORE than once/tick on the boot processor. >I can''t seem to find the table I compiled from that research, >but I did find this in an email I sent to you: > >"You probably know this already but an n-way 2.6 Linux >kernel reads hpet (n+1)*1000 times/second. Let''s take >five 2-way guests as an example; that comes to 15000 >hpet reads/second...." > >I wondered what was different between apic=1 vs 0. Using: > ># cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \ > cat /proc/interrupts | grep ''LOC|timer'' > >you can see that there are always 1000 LOC/sec. But >with apic=1 there are also about 350 IO-APIC-edge-timer/sec >and with apic=0 there are 1000 XT-PIC-timer/sec. > >I suspect that the latter of these (XT-PIC-timer) is >messing up your policy and the former (edge-timer) is not. > >Dan > >-----Original Message----- >From: Dave Winchell [mailto:dwinchell@virtualiron.com] >Sent: Thursday, June 12, 2008 4:49 PM >To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel >Cc: Ben Guthro; Dave Winchell >Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > >Dan, > >You shouldn''t be getting higher than .05%. >I''d like to figure out what is wrong. I''m running the same guest >you are with heavy loads and the physical processors overcommitted >by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour. > >Can you type ^a^a^a at the console and then >type ''Z'' a couple of times about 10 seconds apart and send >me the output? Do this when you have a domain >running that is keeping poor time. > >You should take drift measurements over a period >of time that is at least 20 minutes, preferably longer. > >Also, can you send me a tarball of your sources from >the xen directory? > > >thanks, >Dave > > > > >-----Original Message----- >From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: Thu 6/12/2008 6:05 PM >To: Dave Winchell; Keir Fraser; xen-devel >Cc: Ben Guthro >Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >(Going back on list.) > >OK, so looking at the updated patch, hpet_avoid=1 is actually >working, just reporting wrong, correct? > >With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew >is under -0.04% and falling. With hpet_avoid=0, it looks >about the same. However both cases seem to start creeping >up again when I put load on, then fall again when I remove >the load -- even with sched-credit capping cpu usage. Odd! >This implies to me that the activity in the other domains >IS affecting skew on the domain-under-test. (Keir, any >comments on the hypothesis attached below?) > >Another theoretical oddity... if you are always delivering >timer ticks "late", fewer than the nominal 1000 ticks/sec >should be being received. So then why is guest time actually >going faster than an external source? > >(In my mind, going faster is much worse than going slower >because if ntpd or a human moves time backwards to compensate >for a clock going faster, "make" and other programs can >get very confused.) > >Dan > > > >>-----Original Message----- >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >>Sent: Thursday, June 12, 2008 3:13 PM >>To: ''Dave Winchell'' >>Subject: RE: xen hpet patch >> >> >>One more thought while waiting for compile and reboot: >> >>Am I right that all of the policies are correcting for when >>a domain "A" is out-of-context? There''s nothing in any other >>domain "B" that can account for any timer loss/gain in domain >>"A". The only reason we are running other domains is to ensure >>that domain "A" is sometimes out-of-context, and the more >>it is out-of-context, the more likely we will observe >>a problem, correct? >> >>If this is true, it doesn''t matter what workload is run >>in the non-A domains... as long as it is loading the >>CPU(s), thus ensuring that domain A is sometimes not >>scheduled on any CPU. >> >>And if all this is true, we may not need to run other >>domains at all... running "xm sched-credit -d A -c 50" >>should result in domain A being out-of-context at least >>half the time. >> >> > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel