thr3ads.net - Xen devel - [Xen-devel] [PATCH 0/2] Improve hpet accuracy [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Ben Guthro

2008-Jun-05 14:59 UTC

[Xen-devel] [PATCH 0/2] Improve hpet accuracy

1. Introduction

This patch improves the hpet based guest clock in terms of drift and
monotonicity.
Prior to this work the drift with hpet was greater than 2%, far above the .05%
limit
for ntp to synchronize. With this code, the drift ranges from .001% to .0033%
depending
on guest and physical platform.

Using hpet allows guest operating systems to provide monotonic time to their
applications. Time sources other than hpet are not monotonic because
of their reliance on tsc, which is not synchronized across physical processors.

Windows 2k864 and many Linux guests are supported with two policies, one for
guests
that handle missed clock interrupts and the other for guests that require the
correct number of interrupts.

Guests may use hpet for the timing source even if the physical platform has no
visible
hpet. Migration is supported between physical machines which differ in physical
hpet visibility.

Most of the changes are in hpet.c. Two general facilities are added to track
interrupt
progress. The ideas here and the facilities would be useful in vpt.c, for other
time
sources, though no attempt is made here to improve vpt.c.

The following sections discuss hpet dependencies, interrupt delivery policies,
live migration,
test results, and relation to recent work with monotonic time. 


2. Virtual Hpet dependencies

The virtual hpet depends on the ability to read the physical or simulated
(see discussion below) hpet.  For timekeeping, the virtual hpet also depends
on two new interrupt notification facilities to implement its policies for
interrupt delivery. 

2.1. Two modes of low-level hpet main counter reads.

In this implementation, the virtual hpet reads with read_64_main_counter(),
exported by
time.c, either the real physical hpet main counter register directly or a
"simulated"
hpet main counter.

The simulated mode uses a monotonic version of get_s_time() (NOW()), where the
last
time value is returned whenever the current time value is less than the last
time
value. In simulated mode, since it is layered on s_time, the underlying hardware
can be hpet or some other device. The frequency of the main counter in simulated
mode is the same as the standard physical hpet frequency, allowing live
migration
between nodes that are configured differently.

If the physical platform does not have an hpet device, or if xen is configured
not
to use the device, then the simulated method is used. If there is a physical
hpet device,
and xen has initialized it, then either simulated or physical mode can be used.
This is governed by a boot time option, hpet-avoid. Setting this option to 1
gives the
simulated mode and 0 the physical mode. The default is physical mode.

A disadvantage of the physical mode is that may take longer to read the device
than in simulated mode. On some platforms the cost is about the same (less than
250 nsec) for
physical and simulated modes, while on others physical cost is much higher than
simulated.
A disadvantage of the simulated mode is that it can return the same value
for the counter in consecutive calls.

2.2. Interrupt notification facilities.

Two interrupt notification facilities are introduced, one is
hvm_isa_irq_assert_cb()
and the other hvm_register_intr_en_notif().

The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
hvm_isa_irq_assert_cb allows a callback to be passed along to vioapic_deliver()
and this callback is called with a mask of the vcpus which will get the
interrupt. This callback is made before any vcpus receive an interrupt.

Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
vector that will be called when that vector is injected in
[vmx,svm]_intr_assist()
and also when the guest finishes handling the interrupt. Here finished is
defined
as the point when the guest re-enables interrupts or lowers the tpr value.
EOI is not used as the end of interrupt as this is sometimes returned before
the interrupt handler has done its work. A flag is passed to the handler
indicating
whether this is the injection point (post = 1) or the interrupt finished (post =
0) point.
The need for the finished point callback is discussed in the missed ticks policy
section.

To prevent a possible early trigger of the finished callback, intr_en_notif
logic
has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the
second when
interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
armed, re-enabling
interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
[vmx,svm]_intr_assist().

3. Interrupt delivery policies

The existing hpet interrupt delivery is preserved. This includes
vcpu round robin delivery used by Linux and broadcast delivery used by Windows.

There are two policies for interrupt delivery, one for Windows 2k8-64 and the
other
for Linux. The Linux policy takes advantage of the (guest) Linux missed tick and
offset
calculations and does not attempt to deliver the right number of interrupts.
The Windows policy delivers the correct number of interrupts, even if sometimes
much
closer to each other than the period. The policies are similar to those in
vpt.c, though
there are some important differences.

Policies are selected with an HVMOP_set_param hypercall with index
HVM_PARAM_TIMER_MODE.
Two new values are added, HVM_HPET_guest_computes_missed_ticks and
HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones are
added is that
in some guests (32bit Linux) a no-missed policy is needed for clock sources
other than hpet
and a missed ticks policy for hpet. It was felt that there would be less
confusion by simply
introducing the two hpet policies.

3.1. The missed ticks policy

The Linux clock interrupt handler for hpet calculates missed ticks and offset
using the hpet
main counter. The algorithm works well when the time since the last interrupt is
greater than
or equal to a period and poorly otherwise.

The missed ticks policy ensures that no two clock interrupts are delivered to
the guest at
a time interval less than a period. A time stamp (hpet main counter value) is
recorded (by a
callback registered with hvm_register_intr_en_notif) when Linux finishes
handling the clock
interrupt. Then, ensuing interrupts are delivered to the vioapic only if the
current main
counter value is a period greater than when the last interrupt was handled.

Tests showed a significant improvement in clock drift with end of interrupt time
stamps
versus beginning of interrupt[1]. It is believed that the reason for the
improvement
is that the clock interrupt handler goes for a spinlock and can be therefore
delayed in its
processing. Furthermore, the main counter is read by the guest under the lock.
The net
effect is that if we time stamp injection, we can get the difference in time
between successive interrupt handler lock acquisitions to be less than the
period.

3.2. The no-missed ticks policy

Windows 2k864 keeps very poor time with the missed ticks policy. So the
no-missed ticks policy
was developed. In the no-missed ticks policy we deliver the correct number of
interrupts,
even if they are spaced less than a period apart (when catching up).

Windows 2k864 uses a broadcast mode in the interrupt routing such that
all vcpus get the clock interrupt. The best Windows drift performance was
achieved when the
policy code ensured that all the previous interrupts (on the various vcpus) had
been injected
before injecting the next interrupt to the vioapic..

The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to record
the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback
registered
with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in
the pending_mask.
When the pending_mask is clear it decrements hpet.intr_pending_nr and if
intr_pending_nr is still
non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().

The missed ticks policy intr_en_notif callback also uses the pending_mask
method. So even though
Linux does not broadcast its interrupts, the code could handle it if it did.
In this case the end of interrupt time stamp is made when the pending_mask is
clear.

4. Live Migration

Live migration with hpet preserves the current offset of the guest clock with
respect
to ntp. This is accomplished by migrating all of the state in the h->hpet
data structure
in the usual way. The hp->mc_offset is recalculated on the receiving node so
that the
guest sees a continuous hpet main counter.

Code as been added to xc_domain_save.c to send a small message after the
domain context is sent. The contents of the message is the physical tsc
timestamp, last_tsc,
read just before the message is sent. When the last_tsc message is received in
xc_domain_restore.c,
another physical tsc timestamp, cur_tsc, is read. The two timestamps are loaded
into the domain
structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
xc_domain_hvm_setcontext
is called so that hpet_load has access to these time stamps. Hpet_load uses the
timestamps
to account for the time spent saving and loading the domain context. With this
technique,
the only neglected time is the time spent sending a small network message.

5. Test Results

Some recent test results are:

5.1 Linux 4u664 and Windows 2k864 load test.
      Duration: 70 hours.
      Test date: 6/2/08
      Loads: usex -b48 on Linux; burn-in on Windows
      Guest vcpus: 8 for Linux; 2 for Windows
      Hardware: 8 physical cpu AMD
      Clock drift : Linux: .0012% Windows: .009%

5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
      Duration: 23 hours.
      Test date: 6/3/08
      Loads: none
      Guest vcpus: 8 for each Linux; 2 for Windows
      Hardware: 4 physical cpu AMD
      Clock drift : Linux: .033% Windows: .019%

6. Relation to recent work in xen-unstable

There is a similarity between hvm_get_guest_time() in xen-unstable and
read_64_main_counter()
in this code. However, read_64_main_counter() is more tuned to the needs of
hpet.c. It has no
"set" operation, only the get. It isolates the mode, physical or
simulated, in read_64_main_counter()
itself. It uses no vcpu or domain state as it is a physical entity, in either
mode. And it provides a real
physical mode for every read for those applications that desire this.

7. Conclusion

The virtual hpet is improved by this patch in terms of accuracy and
monotonicity.
Tests performed to date verify this and more testing is under way.

8. Future Work

Testing with Windows Vista will be performed soon. The reason for accuracy
variations
on different platforms using the physical hpet device will be investigated.
Additional overhead measurements on simulated vs physical hpet mode will be
made.

Footnotes:

1. I don''t recall the accuracy improvement with end of interrupt
stamping, but it was
significant, perhaps better than two to one improvement. It would be a very
simple matter
to re-measure the improvement as the facility can call back at injection time as
well.


Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
Signed-off-by: Ben Guthro <bguthro@virtualiron.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-06 08:58 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Are these patches needed now the timers are built on Xen system time rather
than host TSC? Dan has reported much better time-keeping with his patch
checked in, and it¹s for sure a lot less invasive than this patchset.

 -- Keir

On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> 
> 1. Introduction
> 
> This patch improves the hpet based guest clock in terms of drift and
> monotonicity.
> Prior to this work the drift with hpet was greater than 2%, far above the
.05%
> limit
> for ntp to synchronize. With this code, the drift ranges from .001% to
.0033%
> depending
> on guest and physical platform.
> 
> Using hpet allows guest operating systems to provide monotonic time to
their
> applications. Time sources other than hpet are not monotonic because
> of their reliance on tsc, which is not synchronized across physical
> processors.
> 
> Windows 2k864 and many Linux guests are supported with two policies, one
for
> guests
> that handle missed clock interrupts and the other for guests that require
the
> correct number of interrupts.
> 
> Guests may use hpet for the timing source even if the physical platform has
no
> visible
> hpet. Migration is supported between physical machines which differ in
> physical
> hpet visibility.
> 
> Most of the changes are in hpet.c. Two general facilities are added to
track
> interrupt
> progress. The ideas here and the facilities would be useful in vpt.c, for
> other time
> sources, though no attempt is made here to improve vpt.c.
> 
> The following sections discuss hpet dependencies, interrupt delivery
policies,
> live migration,
> test results, and relation to recent work with monotonic time.
> 
> 
> 2. Virtual Hpet dependencies
> 
> The virtual hpet depends on the ability to read the physical or simulated
> (see discussion below) hpet.  For timekeeping, the virtual hpet also
depends
> on two new interrupt notification facilities to implement its policies for
> interrupt delivery.
> 
> 2.1. Two modes of low-level hpet main counter reads.
> 
> In this implementation, the virtual hpet reads with read_64_main_counter(),
> exported by
> time.c, either the real physical hpet main counter register directly or a
> "simulated"
> hpet main counter.
> 
> The simulated mode uses a monotonic version of get_s_time() (NOW()), where
the
> last
> time value is returned whenever the current time value is less than the
last
> time
> value. In simulated mode, since it is layered on s_time, the underlying
> hardware
> can be hpet or some other device. The frequency of the main counter in
> simulated
> mode is the same as the standard physical hpet frequency, allowing live
> migration
> between nodes that are configured differently.
> 
> If the physical platform does not have an hpet device, or if xen is
configured
> not
> to use the device, then the simulated method is used. If there is a
physical
> hpet device,
> and xen has initialized it, then either simulated or physical mode can be
> used.
> This is governed by a boot time option, hpet-avoid. Setting this option to
1
> gives the
> simulated mode and 0 the physical mode. The default is physical mode.
> 
> A disadvantage of the physical mode is that may take longer to read the
device
> than in simulated mode. On some platforms the cost is about the same (less
> than 250 nsec) for
> physical and simulated modes, while on others physical cost is much higher
> than simulated.
> A disadvantage of the simulated mode is that it can return the same value
> for the counter in consecutive calls.
> 
> 2.2. Interrupt notification facilities.
> 
> Two interrupt notification facilities are introduced, one is
> hvm_isa_irq_assert_cb()
> and the other hvm_register_intr_en_notif().
> 
> The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
> hvm_isa_irq_assert_cb allows a callback to be passed along to
> vioapic_deliver()
> and this callback is called with a mask of the vcpus which will get the
> interrupt. This callback is made before any vcpus receive an interrupt.
> 
> Vhpet uses hvm_register_intr_en_notif() to register a handler for a
particular
> vector that will be called when that vector is injected in
> [vmx,svm]_intr_assist()
> and also when the guest finishes handling the interrupt. Here finished is
> defined
> as the point when the guest re-enables interrupts or lowers the tpr value.
> EOI is not used as the end of interrupt as this is sometimes returned
before
> the interrupt handler has done its work. A flag is passed to the handler
> indicating
> whether this is the injection point (post = 1) or the interrupt finished
(post
> = 0) point.
> The need for the finished point callback is discussed in the missed ticks
> policy section.
> 
> To prevent a possible early trigger of the finished callback, intr_en_notif
> logic
> has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and
the
> second when
> interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
> armed, re-enabling
> interrupts will cause hvm_intr_en_notif_disarm() to make the end of
interrupt
> callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called
by
> [vmx,svm]_intr_assist().
> 
> 3. Interrupt delivery policies
> 
> The existing hpet interrupt delivery is preserved. This includes
> vcpu round robin delivery used by Linux and broadcast delivery used by
> Windows.
> 
> There are two policies for interrupt delivery, one for Windows 2k8-64 and
the
> other
> for Linux. The Linux policy takes advantage of the (guest) Linux missed
tick
> and offset
> calculations and does not attempt to deliver the right number of
interrupts.
> The Windows policy delivers the correct number of interrupts, even if
> sometimes much
> closer to each other than the period. The policies are similar to those in
> vpt.c, though
> there are some important differences.
> 
> Policies are selected with an HVMOP_set_param hypercall with index
> HVM_PARAM_TIMER_MODE.
> Two new values are added, HVM_HPET_guest_computes_missed_ticks and
> HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones
> are added is that
> in some guests (32bit Linux) a no-missed policy is needed for clock sources
> other than hpet
> and a missed ticks policy for hpet. It was felt that there would be less
> confusion by simply
> introducing the two hpet policies.
> 
> 3.1. The missed ticks policy
> 
> The Linux clock interrupt handler for hpet calculates missed ticks and
offset
> using the hpet
> main counter. The algorithm works well when the time since the last
interrupt
> is greater than
> or equal to a period and poorly otherwise.
> 
> The missed ticks policy ensures that no two clock interrupts are delivered
to
> the guest at
> a time interval less than a period. A time stamp (hpet main counter value)
is
> recorded (by a
> callback registered with hvm_register_intr_en_notif) when Linux finishes
> handling the clock
> interrupt. Then, ensuing interrupts are delivered to the vioapic only if
the
> current main
> counter value is a period greater than when the last interrupt was handled.
> 
> Tests showed a significant improvement in clock drift with end of interrupt
> time stamps
> versus beginning of interrupt[1]. It is believed that the reason for the
> improvement
> is that the clock interrupt handler goes for a spinlock and can be
therefore
> delayed in its
> processing. Furthermore, the main counter is read by the guest under the
lock.
> The net
> effect is that if we time stamp injection, we can get the difference in
time
> between successive interrupt handler lock acquisitions to be less than the
> period.
> 
> 3.2. The no-missed ticks policy
> 
> Windows 2k864 keeps very poor time with the missed ticks policy. So the
> no-missed ticks policy
> was developed. In the no-missed ticks policy we deliver the correct number
of
> interrupts,
> even if they are spaced less than a period apart (when catching up).
> 
> Windows 2k864 uses a broadcast mode in the interrupt routing such that
> all vcpus get the clock interrupt. The best Windows drift performance was
> achieved when the
> policy code ensured that all the previous interrupts (on the various vcpus)
> had been injected
> before injecting the next interrupt to the vioapic..
> 
> The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
> record
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in the
callback
> registered
> with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu
in
> the pending_mask.
> When the pending_mask is clear it decrements hpet.intr_pending_nr and if
> intr_pending_nr is still
> non-zero posts another interrupt to the ioapic with
hvm_isa_irq_assert_cb().
> Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
> 
> The missed ticks policy intr_en_notif callback also uses the pending_mask
> method. So even though
> Linux does not broadcast its interrupts, the code could handle it if it
did.
> In this case the end of interrupt time stamp is made when the pending_mask
is
> clear.
> 
> 4. Live Migration
> 
> Live migration with hpet preserves the current offset of the guest clock
with
> respect
> to ntp. This is accomplished by migrating all of the state in the
h->hpet data
> structure
> in the usual way. The hp->mc_offset is recalculated on the receiving
node so
> that the
> guest sees a continuous hpet main counter.
> 
> Code as been added to xc_domain_save.c to send a small message after the
> domain context is sent. The contents of the message is the physical tsc
> timestamp, last_tsc,
> read just before the message is sent. When the last_tsc message is received
in
> xc_domain_restore.c,
> another physical tsc timestamp, cur_tsc, is read. The two timestamps are
> loaded into the domain
> structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
> xc_domain_hvm_setcontext
> is called so that hpet_load has access to these time stamps. Hpet_load uses
> the timestamps
> to account for the time spent saving and loading the domain context. With
this
> technique,
> the only neglected time is the time spent sending a small network message.
> 
> 5. Test Results
> 
> Some recent test results are:
> 
> 5.1 Linux 4u664 and Windows 2k864 load test.
>       Duration: 70 hours.
>       Test date: 6/2/08
>       Loads: usex -b48 on Linux; burn-in on Windows
>       Guest vcpus: 8 for Linux; 2 for Windows
>       Hardware: 8 physical cpu AMD
>       Clock drift : Linux: .0012% Windows: .009%
> 
> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>       Duration: 23 hours.
>       Test date: 6/3/08
>       Loads: none
>       Guest vcpus: 8 for each Linux; 2 for Windows
>       Hardware: 4 physical cpu AMD
>       Clock drift : Linux: .033% Windows: .019%
> 
> 6. Relation to recent work in xen-unstable
> 
> There is a similarity between hvm_get_guest_time() in xen-unstable and
> read_64_main_counter()
> in this code. However, read_64_main_counter() is more tuned to the needs of
> hpet.c. It has no
> "set" operation, only the get. It isolates the mode, physical or
simulated, in
> read_64_main_counter()
> itself. It uses no vcpu or domain state as it is a physical entity, in
either
> mode. And it provides a real
> physical mode for every read for those applications that desire this.
> 
> 7. Conclusion
> 
> The virtual hpet is improved by this patch in terms of accuracy and
> monotonicity.
> Tests performed to date verify this and more testing is under way.
> 
> 8. Future Work
> 
> Testing with Windows Vista will be performed soon. The reason for accuracy
> variations
> on different platforms using the physical hpet device will be investigated.
> Additional overhead measurements on simulated vs physical hpet mode will be
> made.
> 
> Footnotes:
> 
> 1. I don''t recall the accuracy improvement with end of interrupt
stamping, but
> it was
> significant, perhaps better than two to one improvement. It would be a very
> simple matter
> to re-measure the improvement as the facility can call back at injection
time
> as well.
> 
> 
> Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> <mailto:dwinchell@virtualiron.com>
> Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> <mailto:bguthro@virtualiron.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-06 10:45 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Keir,

I think the changes are required. We''ll run some tests today today so
that we have some data to talk about.

-Dave


-----Original Message-----
From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
Sent: Fri 6/6/2008 4:58 AM
To: Ben Guthro; xen-devel
Cc: dan.magenheimer@oracle.com
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Are these patches needed now the timers are built on Xen system time rather
than host TSC? Dan has reported much better time-keeping with his patch
checked in, and it¹s for sure a lot less invasive than this patchset.


 -- Keir

On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:
> 
> 1. Introduction
> 
> This patch improves the hpet based guest clock in terms of drift and
> monotonicity.
> Prior to this work the drift with hpet was greater than 2%, far above the
.05%
> limit
> for ntp to synchronize. With this code, the drift ranges from .001% to
.0033%
> depending
> on guest and physical platform.
> 
> Using hpet allows guest operating systems to provide monotonic time to
their
> applications. Time sources other than hpet are not monotonic because
> of their reliance on tsc, which is not synchronized across physical
> processors.
> 
> Windows 2k864 and many Linux guests are supported with two policies, one
for
> guests
> that handle missed clock interrupts and the other for guests that require
the
> correct number of interrupts.
> 
> Guests may use hpet for the timing source even if the physical platform has
no
> visible
> hpet. Migration is supported between physical machines which differ in
> physical
> hpet visibility.
> 
> Most of the changes are in hpet.c. Two general facilities are added to
track
> interrupt
> progress. The ideas here and the facilities would be useful in vpt.c, for
> other time
> sources, though no attempt is made here to improve vpt.c.
> 
> The following sections discuss hpet dependencies, interrupt delivery
policies,
> live migration,
> test results, and relation to recent work with monotonic time.
> 
> 
> 2. Virtual Hpet dependencies
> 
> The virtual hpet depends on the ability to read the physical or simulated
> (see discussion below) hpet.  For timekeeping, the virtual hpet also
depends
> on two new interrupt notification facilities to implement its policies for
> interrupt delivery.
> 
> 2.1. Two modes of low-level hpet main counter reads.
> 
> In this implementation, the virtual hpet reads with read_64_main_counter(),
> exported by
> time.c, either the real physical hpet main counter register directly or a
> "simulated"
> hpet main counter.
> 
> The simulated mode uses a monotonic version of get_s_time() (NOW()), where
the
> last
> time value is returned whenever the current time value is less than the
last
> time
> value. In simulated mode, since it is layered on s_time, the underlying
> hardware
> can be hpet or some other device. The frequency of the main counter in
> simulated
> mode is the same as the standard physical hpet frequency, allowing live
> migration
> between nodes that are configured differently.
> 
> If the physical platform does not have an hpet device, or if xen is
configured
> not
> to use the device, then the simulated method is used. If there is a
physical
> hpet device,
> and xen has initialized it, then either simulated or physical mode can be
> used.
> This is governed by a boot time option, hpet-avoid. Setting this option to
1
> gives the
> simulated mode and 0 the physical mode. The default is physical mode.
> 
> A disadvantage of the physical mode is that may take longer to read the
device
> than in simulated mode. On some platforms the cost is about the same (less
> than 250 nsec) for
> physical and simulated modes, while on others physical cost is much higher
> than simulated.
> A disadvantage of the simulated mode is that it can return the same value
> for the counter in consecutive calls.
> 
> 2.2. Interrupt notification facilities.
> 
> Two interrupt notification facilities are introduced, one is
> hvm_isa_irq_assert_cb()
> and the other hvm_register_intr_en_notif().
> 
> The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
> hvm_isa_irq_assert_cb allows a callback to be passed along to
> vioapic_deliver()
> and this callback is called with a mask of the vcpus which will get the
> interrupt. This callback is made before any vcpus receive an interrupt.
> 
> Vhpet uses hvm_register_intr_en_notif() to register a handler for a
particular
> vector that will be called when that vector is injected in
> [vmx,svm]_intr_assist()
> and also when the guest finishes handling the interrupt. Here finished is
> defined
> as the point when the guest re-enables interrupts or lowers the tpr value.
> EOI is not used as the end of interrupt as this is sometimes returned
before
> the interrupt handler has done its work. A flag is passed to the handler
> indicating
> whether this is the injection point (post = 1) or the interrupt finished
(post
> = 0) point.
> The need for the finished point callback is discussed in the missed ticks
> policy section.
> 
> To prevent a possible early trigger of the finished callback, intr_en_notif
> logic
> has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and
the
> second when
> interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully
> armed, re-enabling
> interrupts will cause hvm_intr_en_notif_disarm() to make the end of
interrupt
> callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called
by
> [vmx,svm]_intr_assist().
> 
> 3. Interrupt delivery policies
> 
> The existing hpet interrupt delivery is preserved. This includes
> vcpu round robin delivery used by Linux and broadcast delivery used by
> Windows.
> 
> There are two policies for interrupt delivery, one for Windows 2k8-64 and
the
> other
> for Linux. The Linux policy takes advantage of the (guest) Linux missed
tick
> and offset
> calculations and does not attempt to deliver the right number of
interrupts.
> The Windows policy delivers the correct number of interrupts, even if
> sometimes much
> closer to each other than the period. The policies are similar to those in
> vpt.c, though
> there are some important differences.
> 
> Policies are selected with an HVMOP_set_param hypercall with index
> HVM_PARAM_TIMER_MODE.
> Two new values are added, HVM_HPET_guest_computes_missed_ticks and
> HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new ones
> are added is that
> in some guests (32bit Linux) a no-missed policy is needed for clock sources
> other than hpet
> and a missed ticks policy for hpet. It was felt that there would be less
> confusion by simply
> introducing the two hpet policies.
> 
> 3.1. The missed ticks policy
> 
> The Linux clock interrupt handler for hpet calculates missed ticks and
offset
> using the hpet
> main counter. The algorithm works well when the time since the last
interrupt
> is greater than
> or equal to a period and poorly otherwise.
> 
> The missed ticks policy ensures that no two clock interrupts are delivered
to
> the guest at
> a time interval less than a period. A time stamp (hpet main counter value)
is
> recorded (by a
> callback registered with hvm_register_intr_en_notif) when Linux finishes
> handling the clock
> interrupt. Then, ensuing interrupts are delivered to the vioapic only if
the
> current main
> counter value is a period greater than when the last interrupt was handled.
> 
> Tests showed a significant improvement in clock drift with end of interrupt
> time stamps
> versus beginning of interrupt[1]. It is believed that the reason for the
> improvement
> is that the clock interrupt handler goes for a spinlock and can be
therefore
> delayed in its
> processing. Furthermore, the main counter is read by the guest under the
lock.
> The net
> effect is that if we time stamp injection, we can get the difference in
time
> between successive interrupt handler lock acquisitions to be less than the
> period.
> 
> 3.2. The no-missed ticks policy
> 
> Windows 2k864 keeps very poor time with the missed ticks policy. So the
> no-missed ticks policy
> was developed. In the no-missed ticks policy we deliver the correct number
of
> interrupts,
> even if they are spaced less than a period apart (when catching up).
> 
> Windows 2k864 uses a broadcast mode in the interrupt routing such that
> all vcpus get the clock interrupt. The best Windows drift performance was
> achieved when the
> policy code ensured that all the previous interrupts (on the various vcpus)
> had been injected
> before injecting the next interrupt to the vioapic..
> 
> The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
> record
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in the
callback
> registered
> with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu
in
> the pending_mask.
> When the pending_mask is clear it decrements hpet.intr_pending_nr and if
> intr_pending_nr is still
> non-zero posts another interrupt to the ioapic with
hvm_isa_irq_assert_cb().
> Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
> 
> The missed ticks policy intr_en_notif callback also uses the pending_mask
> method. So even though
> Linux does not broadcast its interrupts, the code could handle it if it
did.
> In this case the end of interrupt time stamp is made when the pending_mask
is
> clear.
> 
> 4. Live Migration
> 
> Live migration with hpet preserves the current offset of the guest clock
with
> respect
> to ntp. This is accomplished by migrating all of the state in the
h->hpet data
> structure
> in the usual way. The hp->mc_offset is recalculated on the receiving
node so
> that the
> guest sees a continuous hpet main counter.
> 
> Code as been added to xc_domain_save.c to send a small message after the
> domain context is sent. The contents of the message is the physical tsc
> timestamp, last_tsc,
> read just before the message is sent. When the last_tsc message is received
in
> xc_domain_restore.c,
> another physical tsc timestamp, cur_tsc, is read. The two timestamps are
> loaded into the domain
> structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
> xc_domain_hvm_setcontext
> is called so that hpet_load has access to these time stamps. Hpet_load uses
> the timestamps
> to account for the time spent saving and loading the domain context. With
this
> technique,
> the only neglected time is the time spent sending a small network message.
> 
> 5. Test Results
> 
> Some recent test results are:
> 
> 5.1 Linux 4u664 and Windows 2k864 load test.
>       Duration: 70 hours.
>       Test date: 6/2/08
>       Loads: usex -b48 on Linux; burn-in on Windows
>       Guest vcpus: 8 for Linux; 2 for Windows
>       Hardware: 8 physical cpu AMD
>       Clock drift : Linux: .0012% Windows: .009%
> 
> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>       Duration: 23 hours.
>       Test date: 6/3/08
>       Loads: none
>       Guest vcpus: 8 for each Linux; 2 for Windows
>       Hardware: 4 physical cpu AMD
>       Clock drift : Linux: .033% Windows: .019%
> 
> 6. Relation to recent work in xen-unstable
> 
> There is a similarity between hvm_get_guest_time() in xen-unstable and
> read_64_main_counter()
> in this code. However, read_64_main_counter() is more tuned to the needs of
> hpet.c. It has no
> "set" operation, only the get. It isolates the mode, physical or
simulated, in
> read_64_main_counter()
> itself. It uses no vcpu or domain state as it is a physical entity, in
either
> mode. And it provides a real
> physical mode for every read for those applications that desire this.
> 
> 7. Conclusion
> 
> The virtual hpet is improved by this patch in terms of accuracy and
> monotonicity.
> Tests performed to date verify this and more testing is under way.
> 
> 8. Future Work
> 
> Testing with Windows Vista will be performed soon. The reason for accuracy
> variations
> on different platforms using the physical hpet device will be investigated.
> Additional overhead measurements on simulated vs physical hpet mode will be
> made.
> 
> Footnotes:
> 
> 1. I don''t recall the accuracy improvement with end of interrupt
stamping, but
> it was
> significant, perhaps better than two to one improvement. It would be a very
> simple matter
> to re-measure the improvement as the facility can call back at injection
time
> as well.
> 
> 
> Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
> <mailto:dwinchell@virtualiron.com>
> Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> <mailto:bguthro@virtualiron.com>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Steven Hand

2008-Jun-06 15:35 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

This seems to break the save/restore format (in at least two places)...


S.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-06 15:53 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave and Ben --

When running tests on xen-unstable (without your patch), please ensure that
hpet=1 is set in the hvm config and also I think that when hpet is the
clocksource on RHEL4-32, the clock IS resilient to missed ticks so timer_mode
should be 2 (vs when pit is the clocksource on RHEL4-32, all clock ticks must be
delivered and so timer_mode should be 0).

Per http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html
it''s my intent to clean this up, but I won''t get to it until
next week.

Thanks,
Dan
  -----Original Message-----
  From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell
  Sent: Friday, June 06, 2008 4:46 AM
  To: Keir Fraser; Ben Guthro; xen-devel
  Cc: dan.magenheimer@oracle.com; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Keir,

  I think the changes are required. We''ll run some tests today today so
  that we have some data to talk about.

  -Dave


  -----Original Message-----
  From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
  Sent: Fri 6/6/2008 4:58 AM
  To: Ben Guthro; xen-devel
  Cc: dan.magenheimer@oracle.com
  Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Are these patches needed now the timers are built on Xen system time rather
  than host TSC? Dan has reported much better time-keeping with his patch
  checked in, and it¹s for sure a lot less invasive than this patchset.


   -- Keir

  On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com> wrote:

  >
  > 1. Introduction
  >
  > This patch improves the hpet based guest clock in terms of drift and
  > monotonicity.
  > Prior to this work the drift with hpet was greater than 2%, far above the
.05%
  > limit
  > for ntp to synchronize. With this code, the drift ranges from .001% to
.0033%
  > depending
  > on guest and physical platform.
  >
  > Using hpet allows guest operating systems to provide monotonic time to
their
  > applications. Time sources other than hpet are not monotonic because
  > of their reliance on tsc, which is not synchronized across physical
  > processors.
  >
  > Windows 2k864 and many Linux guests are supported with two policies, one
for
  > guests
  > that handle missed clock interrupts and the other for guests that require
the
  > correct number of interrupts.
  >
  > Guests may use hpet for the timing source even if the physical platform
has no
  > visible
  > hpet. Migration is supported between physical machines which differ in
  > physical
  > hpet visibility.
  >
  > Most of the changes are in hpet.c. Two general facilities are added to
track
  > interrupt
  > progress. The ideas here and the facilities would be useful in vpt.c, for
  > other time
  > sources, though no attempt is made here to improve vpt.c.
  >
  > The following sections discuss hpet dependencies, interrupt delivery
policies,
  > live migration,
  > test results, and relation to recent work with monotonic time.
  >
  >
  > 2. Virtual Hpet dependencies
  >
  > The virtual hpet depends on the ability to read the physical or simulated
  > (see discussion below) hpet.  For timekeeping, the virtual hpet also
depends
  > on two new interrupt notification facilities to implement its policies
for
  > interrupt delivery.
  >
  > 2.1. Two modes of low-level hpet main counter reads.
  >
  > In this implementation, the virtual hpet reads with
read_64_main_counter(),
  > exported by
  > time.c, either the real physical hpet main counter register directly or a
  > "simulated"
  > hpet main counter.
  >
  > The simulated mode uses a monotonic version of get_s_time() (NOW()),
where the
  > last
  > time value is returned whenever the current time value is less than the
last
  > time
  > value. In simulated mode, since it is layered on s_time, the underlying
  > hardware
  > can be hpet or some other device. The frequency of the main counter in
  > simulated
  > mode is the same as the standard physical hpet frequency, allowing live
  > migration
  > between nodes that are configured differently.
  >
  > If the physical platform does not have an hpet device, or if xen is
configured
  > not
  > to use the device, then the simulated method is used. If there is a
physical
  > hpet device,
  > and xen has initialized it, then either simulated or physical mode can be
  > used.
  > This is governed by a boot time option, hpet-avoid. Setting this option
to 1
  > gives the
  > simulated mode and 0 the physical mode. The default is physical mode.
  >
  > A disadvantage of the physical mode is that may take longer to read the
device
  > than in simulated mode. On some platforms the cost is about the same
(less
  > than 250 nsec) for
  > physical and simulated modes, while on others physical cost is much
higher
  > than simulated.
  > A disadvantage of the simulated mode is that it can return the same value
  > for the counter in consecutive calls.
  >
  > 2.2. Interrupt notification facilities.
  >
  > Two interrupt notification facilities are introduced, one is
  > hvm_isa_irq_assert_cb()
  > and the other hvm_register_intr_en_notif().
  >
  > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the
vioapic.
  > hvm_isa_irq_assert_cb allows a callback to be passed along to
  > vioapic_deliver()
  > and this callback is called with a mask of the vcpus which will get the
  > interrupt. This callback is made before any vcpus receive an interrupt.
  >
  > Vhpet uses hvm_register_intr_en_notif() to register a handler for a
particular
  > vector that will be called when that vector is injected in
  > [vmx,svm]_intr_assist()
  > and also when the guest finishes handling the interrupt. Here finished is
  > defined
  > as the point when the guest re-enables interrupts or lowers the tpr
value.
  > EOI is not used as the end of interrupt as this is sometimes returned
before
  > the interrupt handler has done its work. A flag is passed to the handler
  > indicating
  > whether this is the injection point (post = 1) or the interrupt finished
(post
  > = 0) point.
  > The need for the finished point callback is discussed in the missed ticks
  > policy section.
  >
  > To prevent a possible early trigger of the finished callback,
intr_en_notif
  > logic
  > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and
the
  > second when
  > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once
fully
  > armed, re-enabling
  > interrupts will cause hvm_intr_en_notif_disarm() to make the end of
interrupt
  > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are
called by
  > [vmx,svm]_intr_assist().
  >
  > 3. Interrupt delivery policies
  >
  > The existing hpet interrupt delivery is preserved. This includes
  > vcpu round robin delivery used by Linux and broadcast delivery used by
  > Windows.
  >
  > There are two policies for interrupt delivery, one for Windows 2k8-64 and
the
  > other
  > for Linux. The Linux policy takes advantage of the (guest) Linux missed
tick
  > and offset
  > calculations and does not attempt to deliver the right number of
interrupts.
  > The Windows policy delivers the correct number of interrupts, even if
  > sometimes much
  > closer to each other than the period. The policies are similar to those
in
  > vpt.c, though
  > there are some important differences.
  >
  > Policies are selected with an HVMOP_set_param hypercall with index
  > HVM_PARAM_TIMER_MODE.
  > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
  > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that two new
ones
  > are added is that
  > in some guests (32bit Linux) a no-missed policy is needed for clock
sources
  > other than hpet
  > and a missed ticks policy for hpet. It was felt that there would be less
  > confusion by simply
  > introducing the two hpet policies.
  >
  > 3.1. The missed ticks policy
  >
  > The Linux clock interrupt handler for hpet calculates missed ticks and
offset
  > using the hpet
  > main counter. The algorithm works well when the time since the last
interrupt
  > is greater than
  > or equal to a period and poorly otherwise.
  >
  > The missed ticks policy ensures that no two clock interrupts are
delivered to
  > the guest at
  > a time interval less than a period. A time stamp (hpet main counter
value) is
  > recorded (by a
  > callback registered with hvm_register_intr_en_notif) when Linux finishes
  > handling the clock
  > interrupt. Then, ensuing interrupts are delivered to the vioapic only if
the
  > current main
  > counter value is a period greater than when the last interrupt was
handled.
  >
  > Tests showed a significant improvement in clock drift with end of
interrupt
  > time stamps
  > versus beginning of interrupt[1]. It is believed that the reason for the
  > improvement
  > is that the clock interrupt handler goes for a spinlock and can be
therefore
  > delayed in its
  > processing. Furthermore, the main counter is read by the guest under the
lock.
  > The net
  > effect is that if we time stamp injection, we can get the difference in
time
  > between successive interrupt handler lock acquisitions to be less than
the
  > period.
  >
  > 3.2. The no-missed ticks policy
  >
  > Windows 2k864 keeps very poor time with the missed ticks policy. So the
  > no-missed ticks policy
  > was developed. In the no-missed ticks policy we deliver the correct
number of
  > interrupts,
  > even if they are spaced less than a period apart (when catching up).
  >
  > Windows 2k864 uses a broadcast mode in the interrupt routing such that
  > all vcpus get the clock interrupt. The best Windows drift performance was
  > achieved when the
  > policy code ensured that all the previous interrupts (on the various
vcpus)
  > had been injected
  > before injecting the next interrupt to the vioapic..
  >
  > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to
  > record
  > the vcpus to be interrupted in h->hpet.pending_mask. Then, in the
callback
  > registered
  > with hvm_register_intr_en_notif() at post=1 time it clears the current
vcpu in
  > the pending_mask.
  > When the pending_mask is clear it decrements hpet.intr_pending_nr and if
  > intr_pending_nr is still
  > non-zero posts another interrupt to the ioapic with
hvm_isa_irq_assert_cb().
  > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().
  >
  > The missed ticks policy intr_en_notif callback also uses the pending_mask
  > method. So even though
  > Linux does not broadcast its interrupts, the code could handle it if it
did.
  > In this case the end of interrupt time stamp is made when the
pending_mask is
  > clear.
  >
  > 4. Live Migration
  >
  > Live migration with hpet preserves the current offset of the guest clock
with
  > respect
  > to ntp. This is accomplished by migrating all of the state in the
h->hpet data
  > structure
  > in the usual way. The hp->mc_offset is recalculated on the receiving
node so
  > that the
  > guest sees a continuous hpet main counter.
  >
  > Code as been added to xc_domain_save.c to send a small message after the
  > domain context is sent. The contents of the message is the physical tsc
  > timestamp, last_tsc,
  > read just before the message is sent. When the last_tsc message is
received in
  > xc_domain_restore.c,
  > another physical tsc timestamp, cur_tsc, is read. The two timestamps are
  > loaded into the domain
  > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then
  > xc_domain_hvm_setcontext
  > is called so that hpet_load has access to these time stamps. Hpet_load
uses
  > the timestamps
  > to account for the time spent saving and loading the domain context. With
this
  > technique,
  > the only neglected time is the time spent sending a small network
message.
  >
  > 5. Test Results
  >
  > Some recent test results are:
  >
  > 5.1 Linux 4u664 and Windows 2k864 load test.
  >       Duration: 70 hours.
  >       Test date: 6/2/08
  >       Loads: usex -b48 on Linux; burn-in on Windows
  >       Guest vcpus: 8 for Linux; 2 for Windows
  >       Hardware: 8 physical cpu AMD
  >       Clock drift : Linux: .0012% Windows: .009%
  >
  > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
  >       Duration: 23 hours.
  >       Test date: 6/3/08
  >       Loads: none
  >       Guest vcpus: 8 for each Linux; 2 for Windows
  >       Hardware: 4 physical cpu AMD
  >       Clock drift : Linux: .033% Windows: .019%
  >
  > 6. Relation to recent work in xen-unstable
  >
  > There is a similarity between hvm_get_guest_time() in xen-unstable and
  > read_64_main_counter()
  > in this code. However, read_64_main_counter() is more tuned to the needs
of
  > hpet.c. It has no
  > "set" operation, only the get. It isolates the mode, physical
or simulated, in
  > read_64_main_counter()
  > itself. It uses no vcpu or domain state as it is a physical entity, in
either
  > mode. And it provides a real
  > physical mode for every read for those applications that desire this.
  >
  > 7. Conclusion
  >
  > The virtual hpet is improved by this patch in terms of accuracy and
  > monotonicity.
  > Tests performed to date verify this and more testing is under way.
  >
  > 8. Future Work
  >
  > Testing with Windows Vista will be performed soon. The reason for
accuracy
  > variations
  > on different platforms using the physical hpet device will be
investigated.
  > Additional overhead measurements on simulated vs physical hpet mode will
be
  > made.
  >
  > Footnotes:
  >
  > 1. I don''t recall the accuracy improvement with end of interrupt
stamping, but
  > it was
  > significant, perhaps better than two to one improvement. It would be a
very
  > simple matter
  > to re-measure the improvement as the facility can call back at injection
time
  > as well.
  >
  >
  > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
  > <mailto:dwinchell@virtualiron.com>
  > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > <mailto:bguthro@virtualiron.com>
  >
  >
  > _______________________________________________
  > Xen-devel mailing list
  > Xen-devel@lists.xensource.com
  > http://lists.xensource.com/xen-devel






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-06 17:34 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Steven Hand wrote:
>This seems to break the save/restore format (in at least two places)...
>  
>Steven,

Can you give me more information on this? What sort of failures are you 
seeing?

thanks,
Dave
>
>S.
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-06 17:54 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

I am running with hpet=1 and timer_mode=2. I don''t see where timer_mode
is checked for
hpet timekeeping but I set it nevertheless.

thanks,
Dave


Dan Magenheimer wrote:
> Hi Dave and Ben --
>  
> When running tests on xen-unstable (without your patch), please ensure 
> that hpet=1 is set in the hvm config and also I think that when hpet 
> is the clocksource on RHEL4-32, the clock IS resilient to missed ticks 
> so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, 
> all clock ticks must be delivered and so timer_mode should be 0).
>  
> Per 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html
it''s
> my intent to clean this up, but I won''t get to it until next week.
>  
> Thanks,
> Dan
>
>     -----Original Message-----
>     *From:* xen-devel-bounces@lists.xensource.com
>     [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave
>     Winchell
>     *Sent:* Friday, June 06, 2008 4:46 AM
>     *To:* Keir Fraser; Ben Guthro; xen-devel
>     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Keir,
>
>     I think the changes are required. We''ll run some tests today
today so
>     that we have some data to talk about.
>
>     -Dave
>
>
>     -----Original Message-----
>     From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
>     Sent: Fri 6/6/2008 4:58 AM
>     To: Ben Guthro; xen-devel
>     Cc: dan.magenheimer@oracle.com
>     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Are these patches needed now the timers are built on Xen system
>     time rather
>     than host TSC? Dan has reported much better time-keeping with his
>     patch
>     checked in, and it¹s for sure a lot less invasive than this patchset.
>
>
>      -- Keir
>
>     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com>
wrote:
>
>     >
>     > 1. Introduction
>     >
>     > This patch improves the hpet based guest clock in terms of drift
and
>     > monotonicity.
>     > Prior to this work the drift with hpet was greater than 2%, far
>     above the .05%
>     > limit
>     > for ntp to synchronize. With this code, the drift ranges from
>     .001% to .0033%
>     > depending
>     > on guest and physical platform.
>     >
>     > Using hpet allows guest operating systems to provide monotonic
>     time to their
>     > applications. Time sources other than hpet are not monotonic
because
>     > of their reliance on tsc, which is not synchronized across
physical
>     > processors.
>     >
>     > Windows 2k864 and many Linux guests are supported with two
>     policies, one for
>     > guests
>     > that handle missed clock interrupts and the other for guests
>     that require the
>     > correct number of interrupts.
>     >
>     > Guests may use hpet for the timing source even if the physical
>     platform has no
>     > visible
>     > hpet. Migration is supported between physical machines which
>     differ in
>     > physical
>     > hpet visibility.
>     >
>     > Most of the changes are in hpet.c. Two general facilities are
>     added to track
>     > interrupt
>     > progress. The ideas here and the facilities would be useful in
>     vpt.c, for
>     > other time
>     > sources, though no attempt is made here to improve vpt.c.
>     >
>     > The following sections discuss hpet dependencies, interrupt
>     delivery policies,
>     > live migration,
>     > test results, and relation to recent work with monotonic time.
>     >
>     >
>     > 2. Virtual Hpet dependencies
>     >
>     > The virtual hpet depends on the ability to read the physical or
>     simulated
>     > (see discussion below) hpet.  For timekeeping, the virtual hpet
>     also depends
>     > on two new interrupt notification facilities to implement its
>     policies for
>     > interrupt delivery.
>     >
>     > 2.1. Two modes of low-level hpet main counter reads.
>     >
>     > In this implementation, the virtual hpet reads with
>     read_64_main_counter(),
>     > exported by
>     > time.c, either the real physical hpet main counter register
>     directly or a
>     > "simulated"
>     > hpet main counter.
>     >
>     > The simulated mode uses a monotonic version of get_s_time()
>     (NOW()), where the
>     > last
>     > time value is returned whenever the current time value is less
>     than the last
>     > time
>     > value. In simulated mode, since it is layered on s_time, the
>     underlying
>     > hardware
>     > can be hpet or some other device. The frequency of the main
>     counter in
>     > simulated
>     > mode is the same as the standard physical hpet frequency,
>     allowing live
>     > migration
>     > between nodes that are configured differently.
>     >
>     > If the physical platform does not have an hpet device, or if xen
>     is configured
>     > not
>     > to use the device, then the simulated method is used. If there
>     is a physical
>     > hpet device,
>     > and xen has initialized it, then either simulated or physical
>     mode can be
>     > used.
>     > This is governed by a boot time option, hpet-avoid. Setting this
>     option to 1
>     > gives the
>     > simulated mode and 0 the physical mode. The default is physical
>     mode.
>     >
>     > A disadvantage of the physical mode is that may take longer to
>     read the device
>     > than in simulated mode. On some platforms the cost is about the
>     same (less
>     > than 250 nsec) for
>     > physical and simulated modes, while on others physical cost is
>     much higher
>     > than simulated.
>     > A disadvantage of the simulated mode is that it can return the
>     same value
>     > for the counter in consecutive calls.
>     >
>     > 2.2. Interrupt notification facilities.
>     >
>     > Two interrupt notification facilities are introduced, one is
>     > hvm_isa_irq_assert_cb()
>     > and the other hvm_register_intr_en_notif().
>     >
>     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
>     the vioapic.
>     > hvm_isa_irq_assert_cb allows a callback to be passed along to
>     > vioapic_deliver()
>     > and this callback is called with a mask of the vcpus which will
>     get the
>     > interrupt. This callback is made before any vcpus receive an
>     interrupt.
>     >
>     > Vhpet uses hvm_register_intr_en_notif() to register a handler
>     for a particular
>     > vector that will be called when that vector is injected in
>     > [vmx,svm]_intr_assist()
>     > and also when the guest finishes handling the interrupt. Here
>     finished is
>     > defined
>     > as the point when the guest re-enables interrupts or lowers the
>     tpr value.
>     > EOI is not used as the end of interrupt as this is sometimes
>     returned before
>     > the interrupt handler has done its work. A flag is passed to the
>     handler
>     > indicating
>     > whether this is the injection point (post = 1) or the interrupt
>     finished (post
>     > = 0) point.
>     > The need for the finished point callback is discussed in the
>     missed ticks
>     > policy section.
>     >
>     > To prevent a possible early trigger of the finished callback,
>     intr_en_notif
>     > logic
>     > has a two stage arm, the first at injection
>     (hvm_intr_en_notif_arm()) and the
>     > second when
>     > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
>     Once fully
>     > armed, re-enabling
>     > interrupts will cause hvm_intr_en_notif_disarm() to make the end
>     of interrupt
>     > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
>     are called by
>     > [vmx,svm]_intr_assist().
>     >
>     > 3. Interrupt delivery policies
>     >
>     > The existing hpet interrupt delivery is preserved. This includes
>     > vcpu round robin delivery used by Linux and broadcast delivery
>     used by
>     > Windows.
>     >
>     > There are two policies for interrupt delivery, one for Windows
>     2k8-64 and the
>     > other
>     > for Linux. The Linux policy takes advantage of the (guest) Linux
>     missed tick
>     > and offset
>     > calculations and does not attempt to deliver the right number of
>     interrupts.
>     > The Windows policy delivers the correct number of interrupts,
>     even if
>     > sometimes much
>     > closer to each other than the period. The policies are similar
>     to those in
>     > vpt.c, though
>     > there are some important differences.
>     >
>     > Policies are selected with an HVMOP_set_param hypercall with index
>     > HVM_PARAM_TIMER_MODE.
>     > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
>     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
>     two new ones
>     > are added is that
>     > in some guests (32bit Linux) a no-missed policy is needed for
>     clock sources
>     > other than hpet
>     > and a missed ticks policy for hpet. It was felt that there would
>     be less
>     > confusion by simply
>     > introducing the two hpet policies.
>     >
>     > 3.1. The missed ticks policy
>     >
>     > The Linux clock interrupt handler for hpet calculates missed
>     ticks and offset
>     > using the hpet
>     > main counter. The algorithm works well when the time since the
>     last interrupt
>     > is greater than
>     > or equal to a period and poorly otherwise.
>     >
>     > The missed ticks policy ensures that no two clock interrupts are
>     delivered to
>     > the guest at
>     > a time interval less than a period. A time stamp (hpet main
>     counter value) is
>     > recorded (by a
>     > callback registered with hvm_register_intr_en_notif) when Linux
>     finishes
>     > handling the clock
>     > interrupt. Then, ensuing interrupts are delivered to the vioapic
>     only if the
>     > current main
>     > counter value is a period greater than when the last interrupt
>     was handled.
>     >
>     > Tests showed a significant improvement in clock drift with end
>     of interrupt
>     > time stamps
>     > versus beginning of interrupt[1]. It is believed that the reason
>     for the
>     > improvement
>     > is that the clock interrupt handler goes for a spinlock and can
>     be therefore
>     > delayed in its
>     > processing. Furthermore, the main counter is read by the guest
>     under the lock.
>     > The net
>     > effect is that if we time stamp injection, we can get the
>     difference in time
>     > between successive interrupt handler lock acquisitions to be
>     less than the
>     > period.
>     >
>     > 3.2. The no-missed ticks policy
>     >
>     > Windows 2k864 keeps very poor time with the missed ticks policy.
>     So the
>     > no-missed ticks policy
>     > was developed. In the no-missed ticks policy we deliver the
>     correct number of
>     > interrupts,
>     > even if they are spaced less than a period apart (when catching
up).
>     >
>     > Windows 2k864 uses a broadcast mode in the interrupt routing
>     such that
>     > all vcpus get the clock interrupt. The best Windows drift
>     performance was
>     > achieved when the
>     > policy code ensured that all the previous interrupts (on the
>     various vcpus)
>     > had been injected
>     > before injecting the next interrupt to the vioapic..
>     >
>     > The policy code works as follows. It uses the
>     hvm_isa_irq_assert_cb() to
>     > record
>     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
>     the callback
>     > registered
>     > with hvm_register_intr_en_notif() at post=1 time it clears the
>     current vcpu in
>     > the pending_mask.
>     > When the pending_mask is clear it decrements
>     hpet.intr_pending_nr and if
>     > intr_pending_nr is still
>     > non-zero posts another interrupt to the ioapic with
>     hvm_isa_irq_assert_cb().
>     > Intr_pending_nr is incremented in
>     hpet_route_decision_not_missed_ticks().
>     >
>     > The missed ticks policy intr_en_notif callback also uses the
>     pending_mask
>     > method. So even though
>     > Linux does not broadcast its interrupts, the code could handle
>     it if it did.
>     > In this case the end of interrupt time stamp is made when the
>     pending_mask is
>     > clear.
>     >
>     > 4. Live Migration
>     >
>     > Live migration with hpet preserves the current offset of the
>     guest clock with
>     > respect
>     > to ntp. This is accomplished by migrating all of the state in
>     the h->hpet data
>     > structure
>     > in the usual way. The hp->mc_offset is recalculated on the
>     receiving node so
>     > that the
>     > guest sees a continuous hpet main counter.
>     >
>     > Code as been added to xc_domain_save.c to send a small message
>     after the
>     > domain context is sent. The contents of the message is the
>     physical tsc
>     > timestamp, last_tsc,
>     > read just before the message is sent. When the last_tsc message
>     is received in
>     > xc_domain_restore.c,
>     > another physical tsc timestamp, cur_tsc, is read. The two
>     timestamps are
>     > loaded into the domain
>     > structure as last_tsc_sender and first_tsc_receiver with
>     hypercalls. Then
>     > xc_domain_hvm_setcontext
>     > is called so that hpet_load has access to these time stamps.
>     Hpet_load uses
>     > the timestamps
>     > to account for the time spent saving and loading the domain
>     context. With this
>     > technique,
>     > the only neglected time is the time spent sending a small
>     network message.
>     >
>     > 5. Test Results
>     >
>     > Some recent test results are:
>     >
>     > 5.1 Linux 4u664 and Windows 2k864 load test.
>     >       Duration: 70 hours.
>     >       Test date: 6/2/08
>     >       Loads: usex -b48 on Linux; burn-in on Windows
>     >       Guest vcpus: 8 for Linux; 2 for Windows
>     >       Hardware: 8 physical cpu AMD
>     >       Clock drift : Linux: .0012% Windows: .009%
>     >
>     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>     >       Duration: 23 hours.
>     >       Test date: 6/3/08
>     >       Loads: none
>     >       Guest vcpus: 8 for each Linux; 2 for Windows
>     >       Hardware: 4 physical cpu AMD
>     >       Clock drift : Linux: .033% Windows: .019%
>     >
>     > 6. Relation to recent work in xen-unstable
>     >
>     > There is a similarity between hvm_get_guest_time() in
>     xen-unstable and
>     > read_64_main_counter()
>     > in this code. However, read_64_main_counter() is more tuned to
>     the needs of
>     > hpet.c. It has no
>     > "set" operation, only the get. It isolates the mode,
physical or
>     simulated, in
>     > read_64_main_counter()
>     > itself. It uses no vcpu or domain state as it is a physical
>     entity, in either
>     > mode. And it provides a real
>     > physical mode for every read for those applications that desire
>     this.
>     >
>     > 7. Conclusion
>     >
>     > The virtual hpet is improved by this patch in terms of accuracy
and
>     > monotonicity.
>     > Tests performed to date verify this and more testing is under way.
>     >
>     > 8. Future Work
>     >
>     > Testing with Windows Vista will be performed soon. The reason
>     for accuracy
>     > variations
>     > on different platforms using the physical hpet device will be
>     investigated.
>     > Additional overhead measurements on simulated vs physical hpet
>     mode will be
>     > made.
>     >
>     > Footnotes:
>     >
>     > 1. I don''t recall the accuracy improvement with end of
interrupt
>     stamping, but
>     > it was
>     > significant, perhaps better than two to one improvement. It
>     would be a very
>     > simple matter
>     > to re-measure the improvement as the facility can call back at
>     injection time
>     > as well.
>     >
>     >
>     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>     > <mailto:dwinchell@virtualiron.com>
>     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>     > <mailto:bguthro@virtualiron.com>
>     >
>     >
>     > _______________________________________________
>     > Xen-devel mailing list
>     > Xen-devel@lists.xensource.com
>     > http://lists.xensource.com/xen-devel
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-06 19:33 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan, Keir:

Preliminary tests results indicate an error of .1% for Linux 64 bit 
guests configured
for hpet with xen-unstable as is. As we have discussed many times, the 
ntp requirement is .05%.
Tests on the patch we just submitted for hpet have indicated errors of 
.0012%
on this platform under similar test conditions and .03% on other platforms.

Windows vista64 has an error of 11% using hpet with the xen-unstable bits.
In an overnight test with our hpet patch, the Windows vista error was .008%.

The tests are with two or three guests on a physical node, all under 
load, and with
the ratio of vcpus to phys cpus > 1.

I will continue to run tests over the next few days.

thanks,
Dave


Dan Magenheimer wrote:
> Hi Dave and Ben --
>  
> When running tests on xen-unstable (without your patch), please ensure 
> that hpet=1 is set in the hvm config and also I think that when hpet 
> is the clocksource on RHEL4-32, the clock IS resilient to missed ticks 
> so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, 
> all clock ticks must be delivered and so timer_mode should be 0).
>  
> Per 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html
it''s
> my intent to clean this up, but I won''t get to it until next week.
>  
> Thanks,
> Dan
>
>     -----Original Message-----
>     *From:* xen-devel-bounces@lists.xensource.com
>     [mailto:xen-devel-bounces@lists.xensource.com]*On Behalf Of *Dave
>     Winchell
>     *Sent:* Friday, June 06, 2008 4:46 AM
>     *To:* Keir Fraser; Ben Guthro; xen-devel
>     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Keir,
>
>     I think the changes are required. We''ll run some tests today
today so
>     that we have some data to talk about.
>
>     -Dave
>
>
>     -----Original Message-----
>     From: xen-devel-bounces@lists.xensource.com on behalf of Keir Fraser
>     Sent: Fri 6/6/2008 4:58 AM
>     To: Ben Guthro; xen-devel
>     Cc: dan.magenheimer@oracle.com
>     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>     Are these patches needed now the timers are built on Xen system
>     time rather
>     than host TSC? Dan has reported much better time-keeping with his
>     patch
>     checked in, and it¹s for sure a lot less invasive than this patchset.
>
>
>      -- Keir
>
>     On 5/6/08 15:59, "Ben Guthro" <bguthro@virtualiron.com>
wrote:
>
>     >
>     > 1. Introduction
>     >
>     > This patch improves the hpet based guest clock in terms of drift
and
>     > monotonicity.
>     > Prior to this work the drift with hpet was greater than 2%, far
>     above the .05%
>     > limit
>     > for ntp to synchronize. With this code, the drift ranges from
>     .001% to .0033%
>     > depending
>     > on guest and physical platform.
>     >
>     > Using hpet allows guest operating systems to provide monotonic
>     time to their
>     > applications. Time sources other than hpet are not monotonic
because
>     > of their reliance on tsc, which is not synchronized across
physical
>     > processors.
>     >
>     > Windows 2k864 and many Linux guests are supported with two
>     policies, one for
>     > guests
>     > that handle missed clock interrupts and the other for guests
>     that require the
>     > correct number of interrupts.
>     >
>     > Guests may use hpet for the timing source even if the physical
>     platform has no
>     > visible
>     > hpet. Migration is supported between physical machines which
>     differ in
>     > physical
>     > hpet visibility.
>     >
>     > Most of the changes are in hpet.c. Two general facilities are
>     added to track
>     > interrupt
>     > progress. The ideas here and the facilities would be useful in
>     vpt.c, for
>     > other time
>     > sources, though no attempt is made here to improve vpt.c.
>     >
>     > The following sections discuss hpet dependencies, interrupt
>     delivery policies,
>     > live migration,
>     > test results, and relation to recent work with monotonic time.
>     >
>     >
>     > 2. Virtual Hpet dependencies
>     >
>     > The virtual hpet depends on the ability to read the physical or
>     simulated
>     > (see discussion below) hpet.  For timekeeping, the virtual hpet
>     also depends
>     > on two new interrupt notification facilities to implement its
>     policies for
>     > interrupt delivery.
>     >
>     > 2.1. Two modes of low-level hpet main counter reads.
>     >
>     > In this implementation, the virtual hpet reads with
>     read_64_main_counter(),
>     > exported by
>     > time.c, either the real physical hpet main counter register
>     directly or a
>     > "simulated"
>     > hpet main counter.
>     >
>     > The simulated mode uses a monotonic version of get_s_time()
>     (NOW()), where the
>     > last
>     > time value is returned whenever the current time value is less
>     than the last
>     > time
>     > value. In simulated mode, since it is layered on s_time, the
>     underlying
>     > hardware
>     > can be hpet or some other device. The frequency of the main
>     counter in
>     > simulated
>     > mode is the same as the standard physical hpet frequency,
>     allowing live
>     > migration
>     > between nodes that are configured differently.
>     >
>     > If the physical platform does not have an hpet device, or if xen
>     is configured
>     > not
>     > to use the device, then the simulated method is used. If there
>     is a physical
>     > hpet device,
>     > and xen has initialized it, then either simulated or physical
>     mode can be
>     > used.
>     > This is governed by a boot time option, hpet-avoid. Setting this
>     option to 1
>     > gives the
>     > simulated mode and 0 the physical mode. The default is physical
>     mode.
>     >
>     > A disadvantage of the physical mode is that may take longer to
>     read the device
>     > than in simulated mode. On some platforms the cost is about the
>     same (less
>     > than 250 nsec) for
>     > physical and simulated modes, while on others physical cost is
>     much higher
>     > than simulated.
>     > A disadvantage of the simulated mode is that it can return the
>     same value
>     > for the counter in consecutive calls.
>     >
>     > 2.2. Interrupt notification facilities.
>     >
>     > Two interrupt notification facilities are introduced, one is
>     > hvm_isa_irq_assert_cb()
>     > and the other hvm_register_intr_en_notif().
>     >
>     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
>     the vioapic.
>     > hvm_isa_irq_assert_cb allows a callback to be passed along to
>     > vioapic_deliver()
>     > and this callback is called with a mask of the vcpus which will
>     get the
>     > interrupt. This callback is made before any vcpus receive an
>     interrupt.
>     >
>     > Vhpet uses hvm_register_intr_en_notif() to register a handler
>     for a particular
>     > vector that will be called when that vector is injected in
>     > [vmx,svm]_intr_assist()
>     > and also when the guest finishes handling the interrupt. Here
>     finished is
>     > defined
>     > as the point when the guest re-enables interrupts or lowers the
>     tpr value.
>     > EOI is not used as the end of interrupt as this is sometimes
>     returned before
>     > the interrupt handler has done its work. A flag is passed to the
>     handler
>     > indicating
>     > whether this is the injection point (post = 1) or the interrupt
>     finished (post
>     > = 0) point.
>     > The need for the finished point callback is discussed in the
>     missed ticks
>     > policy section.
>     >
>     > To prevent a possible early trigger of the finished callback,
>     intr_en_notif
>     > logic
>     > has a two stage arm, the first at injection
>     (hvm_intr_en_notif_arm()) and the
>     > second when
>     > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
>     Once fully
>     > armed, re-enabling
>     > interrupts will cause hvm_intr_en_notif_disarm() to make the end
>     of interrupt
>     > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
>     are called by
>     > [vmx,svm]_intr_assist().
>     >
>     > 3. Interrupt delivery policies
>     >
>     > The existing hpet interrupt delivery is preserved. This includes
>     > vcpu round robin delivery used by Linux and broadcast delivery
>     used by
>     > Windows.
>     >
>     > There are two policies for interrupt delivery, one for Windows
>     2k8-64 and the
>     > other
>     > for Linux. The Linux policy takes advantage of the (guest) Linux
>     missed tick
>     > and offset
>     > calculations and does not attempt to deliver the right number of
>     interrupts.
>     > The Windows policy delivers the correct number of interrupts,
>     even if
>     > sometimes much
>     > closer to each other than the period. The policies are similar
>     to those in
>     > vpt.c, though
>     > there are some important differences.
>     >
>     > Policies are selected with an HVMOP_set_param hypercall with index
>     > HVM_PARAM_TIMER_MODE.
>     > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
>     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
>     two new ones
>     > are added is that
>     > in some guests (32bit Linux) a no-missed policy is needed for
>     clock sources
>     > other than hpet
>     > and a missed ticks policy for hpet. It was felt that there would
>     be less
>     > confusion by simply
>     > introducing the two hpet policies.
>     >
>     > 3.1. The missed ticks policy
>     >
>     > The Linux clock interrupt handler for hpet calculates missed
>     ticks and offset
>     > using the hpet
>     > main counter. The algorithm works well when the time since the
>     last interrupt
>     > is greater than
>     > or equal to a period and poorly otherwise.
>     >
>     > The missed ticks policy ensures that no two clock interrupts are
>     delivered to
>     > the guest at
>     > a time interval less than a period. A time stamp (hpet main
>     counter value) is
>     > recorded (by a
>     > callback registered with hvm_register_intr_en_notif) when Linux
>     finishes
>     > handling the clock
>     > interrupt. Then, ensuing interrupts are delivered to the vioapic
>     only if the
>     > current main
>     > counter value is a period greater than when the last interrupt
>     was handled.
>     >
>     > Tests showed a significant improvement in clock drift with end
>     of interrupt
>     > time stamps
>     > versus beginning of interrupt[1]. It is believed that the reason
>     for the
>     > improvement
>     > is that the clock interrupt handler goes for a spinlock and can
>     be therefore
>     > delayed in its
>     > processing. Furthermore, the main counter is read by the guest
>     under the lock.
>     > The net
>     > effect is that if we time stamp injection, we can get the
>     difference in time
>     > between successive interrupt handler lock acquisitions to be
>     less than the
>     > period.
>     >
>     > 3.2. The no-missed ticks policy
>     >
>     > Windows 2k864 keeps very poor time with the missed ticks policy.
>     So the
>     > no-missed ticks policy
>     > was developed. In the no-missed ticks policy we deliver the
>     correct number of
>     > interrupts,
>     > even if they are spaced less than a period apart (when catching
up).
>     >
>     > Windows 2k864 uses a broadcast mode in the interrupt routing
>     such that
>     > all vcpus get the clock interrupt. The best Windows drift
>     performance was
>     > achieved when the
>     > policy code ensured that all the previous interrupts (on the
>     various vcpus)
>     > had been injected
>     > before injecting the next interrupt to the vioapic..
>     >
>     > The policy code works as follows. It uses the
>     hvm_isa_irq_assert_cb() to
>     > record
>     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
>     the callback
>     > registered
>     > with hvm_register_intr_en_notif() at post=1 time it clears the
>     current vcpu in
>     > the pending_mask.
>     > When the pending_mask is clear it decrements
>     hpet.intr_pending_nr and if
>     > intr_pending_nr is still
>     > non-zero posts another interrupt to the ioapic with
>     hvm_isa_irq_assert_cb().
>     > Intr_pending_nr is incremented in
>     hpet_route_decision_not_missed_ticks().
>     >
>     > The missed ticks policy intr_en_notif callback also uses the
>     pending_mask
>     > method. So even though
>     > Linux does not broadcast its interrupts, the code could handle
>     it if it did.
>     > In this case the end of interrupt time stamp is made when the
>     pending_mask is
>     > clear.
>     >
>     > 4. Live Migration
>     >
>     > Live migration with hpet preserves the current offset of the
>     guest clock with
>     > respect
>     > to ntp. This is accomplished by migrating all of the state in
>     the h->hpet data
>     > structure
>     > in the usual way. The hp->mc_offset is recalculated on the
>     receiving node so
>     > that the
>     > guest sees a continuous hpet main counter.
>     >
>     > Code as been added to xc_domain_save.c to send a small message
>     after the
>     > domain context is sent. The contents of the message is the
>     physical tsc
>     > timestamp, last_tsc,
>     > read just before the message is sent. When the last_tsc message
>     is received in
>     > xc_domain_restore.c,
>     > another physical tsc timestamp, cur_tsc, is read. The two
>     timestamps are
>     > loaded into the domain
>     > structure as last_tsc_sender and first_tsc_receiver with
>     hypercalls. Then
>     > xc_domain_hvm_setcontext
>     > is called so that hpet_load has access to these time stamps.
>     Hpet_load uses
>     > the timestamps
>     > to account for the time spent saving and loading the domain
>     context. With this
>     > technique,
>     > the only neglected time is the time spent sending a small
>     network message.
>     >
>     > 5. Test Results
>     >
>     > Some recent test results are:
>     >
>     > 5.1 Linux 4u664 and Windows 2k864 load test.
>     >       Duration: 70 hours.
>     >       Test date: 6/2/08
>     >       Loads: usex -b48 on Linux; burn-in on Windows
>     >       Guest vcpus: 8 for Linux; 2 for Windows
>     >       Hardware: 8 physical cpu AMD
>     >       Clock drift : Linux: .0012% Windows: .009%
>     >
>     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
>     >       Duration: 23 hours.
>     >       Test date: 6/3/08
>     >       Loads: none
>     >       Guest vcpus: 8 for each Linux; 2 for Windows
>     >       Hardware: 4 physical cpu AMD
>     >       Clock drift : Linux: .033% Windows: .019%
>     >
>     > 6. Relation to recent work in xen-unstable
>     >
>     > There is a similarity between hvm_get_guest_time() in
>     xen-unstable and
>     > read_64_main_counter()
>     > in this code. However, read_64_main_counter() is more tuned to
>     the needs of
>     > hpet.c. It has no
>     > "set" operation, only the get. It isolates the mode,
physical or
>     simulated, in
>     > read_64_main_counter()
>     > itself. It uses no vcpu or domain state as it is a physical
>     entity, in either
>     > mode. And it provides a real
>     > physical mode for every read for those applications that desire
>     this.
>     >
>     > 7. Conclusion
>     >
>     > The virtual hpet is improved by this patch in terms of accuracy
and
>     > monotonicity.
>     > Tests performed to date verify this and more testing is under way.
>     >
>     > 8. Future Work
>     >
>     > Testing with Windows Vista will be performed soon. The reason
>     for accuracy
>     > variations
>     > on different platforms using the physical hpet device will be
>     investigated.
>     > Additional overhead measurements on simulated vs physical hpet
>     mode will be
>     > made.
>     >
>     > Footnotes:
>     >
>     > 1. I don''t recall the accuracy improvement with end of
interrupt
>     stamping, but
>     > it was
>     > significant, perhaps better than two to one improvement. It
>     would be a very
>     > simple matter
>     > to re-measure the improvement as the facility can call back at
>     injection time
>     > as well.
>     >
>     >
>     > Signed-off-by: Dave Winchell <dwinchell@virtualiron.com>
>     > <mailto:dwinchell@virtualiron.com>
>     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
>     > <mailto:bguthro@virtualiron.com>
>     >
>     >
>     > _______________________________________________
>     > Xen-devel mailing list
>     > Xen-devel@lists.xensource.com
>     > http://lists.xensource.com/xen-devel
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-06 20:29 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there''s a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.

I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can''t be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there''s a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan, Keir:
> 
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on 
> other platforms.
> 
> Windows vista64 has an error of 11% using hpet with the 
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista 
> error was .008%.
> 
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
> 
> I will continue to run tests over the next few days.
> 
> thanks,
> Dave
> 
> 
> Dan Magenheimer wrote:
> 
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch), 
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to 
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource 
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> > 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it''s
> > my intent to clean this up, but I won''t get to it until next
week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On 
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We''ll run some tests 
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf 
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better 
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than 
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in 
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater 
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide
monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not 
> monotonic because
> >     > of their reliance on tsc, which is not synchronized 
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the
physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful
in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic
time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the 
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the 
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is
less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet 
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If
there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid. 
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default 
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer
to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is 
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost
is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return
the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus 
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or 
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is 
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the 
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled 
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to 
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and 
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved. 
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast
delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for
Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the 
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the 
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are
similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param 
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added, 
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that 
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since
the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock 
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif) 
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to 
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last
interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with
end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that 
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a 
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the
guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed 
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart 
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then,
in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could
handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small
message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the 
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned
to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the
mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications 
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms 
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing 
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical
hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don''t recall the accuracy improvement with end 
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back
at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-06 22:34 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...
Possibly there are bugs in the hpet device model which are fixed by
Dave''s
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-07 21:20 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> Possibly there are bugs in the hpet device model which are fixed by
Dave''s
> patch. If this is actually the case, it would be nice to break those out as
> separate patches, as I think an 11% drift must largely be due to
> device-model bugs rather than relatively insignificant differences between
> hvm_get_guest_time() and physical HPET main counter.
Hi Keir,

I tried an experiment on Friday where I short circuited the missed ticks policy
code in the hpet.c patch, but used the physical hpet each access. The result for
Linux
was a drift of .1%, same as the xen-unstable bits.

Conversely I get very good drift numbers, i.e., under .03%, when using the
missed ticks
policy code and  running in simulated mode (layered on stime) when stime uses
hpet.

So clearly, the improvement from .1% to .03% is due to the policy code.
I haven''t run the short circuit test with the windows policy but I can
do that
on Monday.

Note: For Windows and Linux I get < .03% drift using the policy code and
running
simulated mode whether stime is using hpet or some other device.


regards,
Dave



-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Fri 6/6/2008 6:34 PM
To: dan.magenheimer@oracle.com; Dave Winchell
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...
Possibly there are bugs in the hpet device model which are fixed by
Dave''s
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-08 20:32 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there''s a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
I don''t recall the direction. I can look it up in my notes at work
tomorrow.
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.
Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.
> And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
In our experience, Xen system time is accurate enough now.
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
I do not know the tsc accuracy.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there''s a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can''t be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there''s a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan, Keir:
> 
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on 
> other platforms.
> 
> Windows vista64 has an error of 11% using hpet with the 
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista 
> error was .008%.
> 
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
> 
> I will continue to run tests over the next few days.
> 
> thanks,
> Dave
> 
> 
> Dan Magenheimer wrote:
> 
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch), 
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to 
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource 
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> > 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it''s
> > my intent to clean this up, but I won''t get to it until next
week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On 
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We''ll run some tests 
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf 
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better 
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than 
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in 
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater 
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide
monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not 
> monotonic because
> >     > of their reliance on tsc, which is not synchronized 
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the
physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful
in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic
time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the 
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the 
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is
less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet 
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If
there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid. 
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default 
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer
to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is 
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost
is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return
the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus 
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or 
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is 
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the 
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled 
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to 
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and 
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved. 
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast
delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for
Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the 
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the 
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are
similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param 
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added, 
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that 
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since
the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock 
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif) 
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to 
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last
interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with
end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that 
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a 
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the
guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed 
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart 
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then,
in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could
handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small
message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the 
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned
to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the
mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications 
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms 
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing 
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical
hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don''t recall the accuracy improvement with end 
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back
at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-08 21:10 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave --

Thanks for the additional explanation.

Could you please be very precise, when you say "Linux",
as to what you are (and are not) testing?  Specifically:
1) kernel version number and/or distro info
2) 32 vs 64
3) kernel parameters specified
4) config file parameters
5) relevant CPU info that may be passed through by Xen
   to hvm guests (e.g. whether "tsc is synchronized")
6) relevant xen boot parameters (if any)

As we''ve seen, different combinations of the above can yield
very different test results.  We''d like to confirm your tests,
but if we can avoid unnecessary additional iterations (due to
mismatches on the above), that would be helpful.

Our testing goal is to ensure that there is at least one
known good combination of parameters for each of RHEL3,
RHEL4, and RHEL5 (both 32 and 64) and that works on
both tsc-synchronized and tsc-unsynchronized Intel
and AMD boxes.  And hopefully that works with and without
a real physical hpet available.

We don''t have a good test environment for Windows time,
but if you can provide the same test configuration detail,
we may be able to do some testing.

Thanks,
Dan

  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there''s a bug somewhere else in the
"time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don''t recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can''t be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there''s a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there''s a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can''t be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there''s a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it''s
  > > my intent to clean this up, but I won''t get to it until
next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We''ll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges
from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide
monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for
guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the
physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines
which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities
are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful
in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic
time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement
its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is
less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time,
the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If
there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or
physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer
to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost
is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return
the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one
is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts
to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along
to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive
an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a
handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt.
Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is
sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in
the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished
callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast
delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for
Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of
interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are
similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed
for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates
missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since
the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last
interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with
end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the
guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to
be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt
routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on
the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask.
Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses
the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could
handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when
the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of
the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state
in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on
the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small
message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time
stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load
test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned
to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the
mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The
reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will
be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical
hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don''t recall the accuracy improvement with
end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back
at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-08 21:18 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy> A disadvantage of the
simulated mode is that it can return the same value> for the counter in consecutive calls.
It also occurs to me that if the granularity is good enough, an easy fix
to this problem might be to always increment the returned value
by at least one.  Then time is always at least increasing rather
than stopped
  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there''s a bug somewhere else in the
"time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don''t recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can''t be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there''s a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there''s a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can''t be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there''s a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it''s
  > > my intent to clean this up, but I won''t get to it until
next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We''ll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges
from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide
monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for
guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the
physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines
which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities
are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful
in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic
time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement
its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is
less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time,
the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If
there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or
physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer
to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost
is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return
the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one
is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts
to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along
to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive
an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a
handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt.
Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is
sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in
the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished
callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast
delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for
Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of
interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are
similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed
for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates
missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since
the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last
interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with
end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the
guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to
be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt
routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on
the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask.
Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses
the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could
handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when
the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of
the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state
in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on
the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small
message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time
stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load
test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned
to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the
mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The
reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will
be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical
hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don''t recall the accuracy improvement with
end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back
at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-08 23:26 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,
> Could you please be very precise, when you say "Linux",
> as to what you are (and are not) testing?  Specifically:
> 1) kernel version number and/or distro info
I personally have been testing with Linux red hat 4u4, 4u5
and 4u6 64 bit and red hat 4u4 32 bit. I''ve also
tested Windows 2k8sp0 64bit and Vista sp1 64 bit.
Our QA group is currently testing a wider set of guests.
> 2) 32 vs 64
both.
> 3) kernel parameters specified
I''m pretty sloppy here. Frequently I have clock=pit because
our build process defaults to that and I know that the guests
I use ignore clock= when hpet is in the acpi table.
However, I don''t recommend that others do this.
I check that Linux is using hpet in its boot log and I have
statistics in the patch which tell me that hpet is being used.
You''ve done a lot of investigation on clock= and clocksourceso I would
like to hear your recommendation.
> 4) config file parameters
Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
for Windows 2k8-64 and Vista 64.
8 vcpus for Linux and 2 for Windows.
> 5) relevant CPU info that may be passed through by Xen
>   to hvm guests (e.g. whether "tsc is synchronized")
Whatever xen-unstable does. I have not changed it.
> 6) relevant xen boot parameters (if any)
Nothing relevant.
> As we''ve seen, different combinations of the above can yield
> very different test results.  We''d like to confirm your tests,
> but if we can avoid unnecessary additional iterations (due to
> mismatches on the above), that would be helpful.
> Our testing goal is to ensure that there is at least one
> known good combination of parameters for each of RHEL3,
> RHEL4, and RHEL5 (both 32 and 64) and that works on
> both tsc-synchronized and tsc-unsynchronized Intel
> and AMD boxes.  And hopefully that works with and without
> a real physical hpet available.
Thanks very much for testing this.
> We don''t have a good test environment for Windows time,
> but if you can provide the same test configuration detail,
> we may be able to do some testing.
The configuration information was given above.

Thanks,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Sun 6/8/2008 5:10 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyHi Dave --

Thanks for the additional explanation.

Could you please be very precise, when you say "Linux",
as to what you are (and are not) testing?  Specifically:
1) kernel version number and/or distro info
2) 32 vs 64
3) kernel parameters specified
4) config file parameters
5) relevant CPU info that may be passed through by Xen
   to hvm guests (e.g. whether "tsc is synchronized")
6) relevant xen boot parameters (if any)

As we''ve seen, different combinations of the above can yield
very different test results.  We''d like to confirm your tests,
but if we can avoid unnecessary additional iterations (due to
mismatches on the above), that would be helpful.

Our testing goal is to ensure that there is at least one
known good combination of parameters for each of RHEL3,
RHEL4, and RHEL5 (both 32 and 64) and that works on
both tsc-synchronized and tsc-unsynchronized Intel
and AMD boxes.  And hopefully that works with and without
a real physical hpet available.

We don''t have a good test environment for Windows time,
but if you can provide the same test configuration detail,
we may be able to do some testing.

Thanks,
Dan

  -----Original Message-----
  From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  Sent: Sunday, June 08, 2008 2:32 PM
  To: dan.magenheimer@oracle.com; Keir Fraser
  Cc: Ben Guthro; xen-devel; Dave Winchell
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


  Hi Dan,

  > While I am fully supportive of offering hardware hpet as an option
  > for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  > surprised by your preliminary results; the most obvious conclusion
  > is that Xen system time is losing time at the rate of 1000 PPM
  > though its possible there''s a bug somewhere else in the
"time
  > stack".  Your Windows result is jaw-dropping and inexplicable,
  > though I have to admit ignorance of how Windows manages time.

  I think xen system time is fine. You have to add the interrupt
  delivery policies decribed in the write-up for the patch to get
  accurate timekeeping in the guest.

  The windows policy is obvious and results in a large improvement
  in accuracy. The Linux policy is more subtle, but is required to go
  from .1% to .03%.

  > I think with my recent patch and hpet=1 (essentially the same as
  > your emulated hpet), hvm guest time should track Xen system time.
  > I wonder if domain0 (which if I understand correctly is directly
  > using Xen system time) is also seeing an error of .1%?  Also
  > I wonder for the skew you are seeing (in both hvm guests and
  > domain0) is time moving too fast or two slow?

  I don''t recall the direction. I can look it up in my notes at work
  tomorrow.

  > Although hwhpet=1 is a fine alternative in many cases, it may
  > be unavailable on some systems and may cause significant performance
  > issues on others.  So I think we will still need to track down
  > the poor accuracy when hwhpet=0.

  Our patch is accurate to < .03% using the physical hpet mode or
  the simulated mode.

  > And if for some reason
  > Xen system time can''t be made accurate enough (< 0.05%), then
  > I think we should consider building Xen system time itself on
  > top of hardware hpet instead of TSC... at least when Xen discovers
  > a capable hpet.

  In our experience, Xen system time is accurate enough now.

  > One more thought... do you know the accuracy of the TSC crystals
  > on your test systems?  I posted a patch awhile ago that was
  > intended to test that, though I guess it was only testing skew
  > of different TSCs on the same system, not TSCs against an
  > external time source.

  I do not know the tsc accuracy.

  > Or maybe there''s a computation error somewhere in the hvm hpet
  > scaling code?  Hmmm...


  Regards,
  Dave


  -----Original Message-----
  From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
  Sent: Fri 6/6/2008 4:29 PM
  To: Dave Winchell; Keir Fraser
  Cc: Ben Guthro; xen-devel
  Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

  Dave --

  Thanks much for posting the preliminary results!

  While I am fully supportive of offering hardware hpet as an option
  for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
  surprised by your preliminary results; the most obvious conclusion
  is that Xen system time is losing time at the rate of 1000 PPM
  though its possible there''s a bug somewhere else in the "time
  stack".  Your Windows result is jaw-dropping and inexplicable,
  though I have to admit ignorance of how Windows manages time.


  I think with my recent patch and hpet=1 (essentially the same as
  your emulated hpet), hvm guest time should track Xen system time.
  I wonder if domain0 (which if I understand correctly is directly
  using Xen system time) is also seeing an error of .1%?  Also
  I wonder for the skew you are seeing (in both hvm guests and
  domain0) is time moving too fast or two slow?

  Although hwhpet=1 is a fine alternative in many cases, it may
  be unavailable on some systems and may cause significant performance
  issues on others.  So I think we will still need to track down
  the poor accuracy when hwhpet=0.  And if for some reason
  Xen system time can''t be made accurate enough (< 0.05%), then
  I think we should consider building Xen system time itself on
  top of hardware hpet instead of TSC... at least when Xen discovers
  a capable hpet.

  One more thought... do you know the accuracy of the TSC crystals
  on your test systems?  I posted a patch awhile ago that was
  intended to test that, though I guess it was only testing skew
  of different TSCs on the same system, not TSCs against an
  external time source.

  Or maybe there''s a computation error somewhere in the hvm hpet
  scaling code?  Hmmm...

  Thanks,
  Dan

  > -----Original Message-----
  > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
  > Sent: Friday, June 06, 2008 1:33 PM
  > To: dan.magenheimer@oracle.com; Keir Fraser
  > Cc: Ben Guthro; xen-devel; Dave Winchell
  > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  >
  >
  > Dan, Keir:
  >
  > Preliminary tests results indicate an error of .1% for Linux 64 bit
  > guests configured
  > for hpet with xen-unstable as is. As we have discussed many times, the
  > ntp requirement is .05%.
  > Tests on the patch we just submitted for hpet have indicated errors of
  > .0012%
  > on this platform under similar test conditions and .03% on
  > other platforms.
  >
  > Windows vista64 has an error of 11% using hpet with the
  > xen-unstable bits.
  > In an overnight test with our hpet patch, the Windows vista
  > error was .008%.
  >
  > The tests are with two or three guests on a physical node, all under
  > load, and with
  > the ratio of vcpus to phys cpus > 1.
  >
  > I will continue to run tests over the next few days.
  >
  > thanks,
  > Dave
  >
  >
  > Dan Magenheimer wrote:
  >
  > > Hi Dave and Ben --
  > >
  > > When running tests on xen-unstable (without your patch),
  > please ensure
  > > that hpet=1 is set in the hvm config and also I think that when hpet
  > > is the clocksource on RHEL4-32, the clock IS resilient to
  > missed ticks
  > > so timer_mode should be 2 (vs when pit is the clocksource
  > on RHEL4-32,
  > > all clock ticks must be delivered and so timer_mode should be 0).
  > >
  > > Per
  > >
  > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
  > 00098.html it''s
  > > my intent to clean this up, but I won''t get to it until
next week.
  > >
  > > Thanks,
  > > Dan
  > >
  > >     -----Original Message-----
  > >     *From:* xen-devel-bounces@lists.xensource.com
  > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
  > Behalf Of *Dave
  > >     Winchell
  > >     *Sent:* Friday, June 06, 2008 4:46 AM
  > >     *To:* Keir Fraser; Ben Guthro; xen-devel
  > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
  > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Keir,
  > >
  > >     I think the changes are required. We''ll run some tests
  > today today so
  > >     that we have some data to talk about.
  > >
  > >     -Dave
  > >
  > >
  > >     -----Original Message-----
  > >     From: xen-devel-bounces@lists.xensource.com on behalf
  > of Keir Fraser
  > >     Sent: Fri 6/6/2008 4:58 AM
  > >     To: Ben Guthro; xen-devel
  > >     Cc: dan.magenheimer@oracle.com
  > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
  > >
  > >     Are these patches needed now the timers are built on Xen system
  > >     time rather
  > >     than host TSC? Dan has reported much better
  > time-keeping with his
  > >     patch
  > >     checked in, and it¹s for sure a lot less invasive than
  > this patchset.
  > >
  > >
  > >      -- Keir
  > >
  > >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
  > >
  > >     >
  > >     > 1. Introduction
  > >     >
  > >     > This patch improves the hpet based guest clock in
  > terms of drift and
  > >     > monotonicity.
  > >     > Prior to this work the drift with hpet was greater
  > than 2%, far
  > >     above the .05%
  > >     > limit
  > >     > for ntp to synchronize. With this code, the drift ranges
from
  > >     .001% to .0033%
  > >     > depending
  > >     > on guest and physical platform.
  > >     >
  > >     > Using hpet allows guest operating systems to provide
monotonic
  > >     time to their
  > >     > applications. Time sources other than hpet are not
  > monotonic because
  > >     > of their reliance on tsc, which is not synchronized
  > across physical
  > >     > processors.
  > >     >
  > >     > Windows 2k864 and many Linux guests are supported with two
  > >     policies, one for
  > >     > guests
  > >     > that handle missed clock interrupts and the other for
guests
  > >     that require the
  > >     > correct number of interrupts.
  > >     >
  > >     > Guests may use hpet for the timing source even if the
physical
  > >     platform has no
  > >     > visible
  > >     > hpet. Migration is supported between physical machines
which
  > >     differ in
  > >     > physical
  > >     > hpet visibility.
  > >     >
  > >     > Most of the changes are in hpet.c. Two general facilities
are
  > >     added to track
  > >     > interrupt
  > >     > progress. The ideas here and the facilities would be useful
in
  > >     vpt.c, for
  > >     > other time
  > >     > sources, though no attempt is made here to improve vpt.c.
  > >     >
  > >     > The following sections discuss hpet dependencies, interrupt
  > >     delivery policies,
  > >     > live migration,
  > >     > test results, and relation to recent work with monotonic
time.
  > >     >
  > >     >
  > >     > 2. Virtual Hpet dependencies
  > >     >
  > >     > The virtual hpet depends on the ability to read the
  > physical or
  > >     simulated
  > >     > (see discussion below) hpet.  For timekeeping, the
  > virtual hpet
  > >     also depends
  > >     > on two new interrupt notification facilities to implement
its
  > >     policies for
  > >     > interrupt delivery.
  > >     >
  > >     > 2.1. Two modes of low-level hpet main counter reads.
  > >     >
  > >     > In this implementation, the virtual hpet reads with
  > >     read_64_main_counter(),
  > >     > exported by
  > >     > time.c, either the real physical hpet main counter register
  > >     directly or a
  > >     > "simulated"
  > >     > hpet main counter.
  > >     >
  > >     > The simulated mode uses a monotonic version of get_s_time()
  > >     (NOW()), where the
  > >     > last
  > >     > time value is returned whenever the current time value is
less
  > >     than the last
  > >     > time
  > >     > value. In simulated mode, since it is layered on s_time,
the
  > >     underlying
  > >     > hardware
  > >     > can be hpet or some other device. The frequency of the main
  > >     counter in
  > >     > simulated
  > >     > mode is the same as the standard physical hpet frequency,
  > >     allowing live
  > >     > migration
  > >     > between nodes that are configured differently.
  > >     >
  > >     > If the physical platform does not have an hpet
  > device, or if xen
  > >     is configured
  > >     > not
  > >     > to use the device, then the simulated method is used. If
there
  > >     is a physical
  > >     > hpet device,
  > >     > and xen has initialized it, then either simulated or
physical
  > >     mode can be
  > >     > used.
  > >     > This is governed by a boot time option, hpet-avoid.
  > Setting this
  > >     option to 1
  > >     > gives the
  > >     > simulated mode and 0 the physical mode. The default
  > is physical
  > >     mode.
  > >     >
  > >     > A disadvantage of the physical mode is that may take longer
to
  > >     read the device
  > >     > than in simulated mode. On some platforms the cost is
  > about the
  > >     same (less
  > >     > than 250 nsec) for
  > >     > physical and simulated modes, while on others physical cost
is
  > >     much higher
  > >     > than simulated.
  > >     > A disadvantage of the simulated mode is that it can return
the
  > >     same value
  > >     > for the counter in consecutive calls.
  > >     >
  > >     > 2.2. Interrupt notification facilities.
  > >     >
  > >     > Two interrupt notification facilities are introduced, one
is
  > >     > hvm_isa_irq_assert_cb()
  > >     > and the other hvm_register_intr_en_notif().
  > >     >
  > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts
to
  > >     the vioapic.
  > >     > hvm_isa_irq_assert_cb allows a callback to be passed along
to
  > >     > vioapic_deliver()
  > >     > and this callback is called with a mask of the vcpus
  > which will
  > >     get the
  > >     > interrupt. This callback is made before any vcpus receive
an
  > >     interrupt.
  > >     >
  > >     > Vhpet uses hvm_register_intr_en_notif() to register a
handler
  > >     for a particular
  > >     > vector that will be called when that vector is injected in
  > >     > [vmx,svm]_intr_assist()
  > >     > and also when the guest finishes handling the interrupt.
Here
  > >     finished is
  > >     > defined
  > >     > as the point when the guest re-enables interrupts or
  > lowers the
  > >     tpr value.
  > >     > EOI is not used as the end of interrupt as this is
sometimes
  > >     returned before
  > >     > the interrupt handler has done its work. A flag is
  > passed to the
  > >     handler
  > >     > indicating
  > >     > whether this is the injection point (post = 1) or the
  > interrupt
  > >     finished (post
  > >     > = 0) point.
  > >     > The need for the finished point callback is discussed in
the
  > >     missed ticks
  > >     > policy section.
  > >     >
  > >     > To prevent a possible early trigger of the finished
callback,
  > >     intr_en_notif
  > >     > logic
  > >     > has a two stage arm, the first at injection
  > >     (hvm_intr_en_notif_arm()) and the
  > >     > second when
  > >     > interrupts are seen to be disabled
  > (hvm_intr_en_notif_disarm()).
  > >     Once fully
  > >     > armed, re-enabling
  > >     > interrupts will cause hvm_intr_en_notif_disarm() to
  > make the end
  > >     of interrupt
  > >     > callback. hvm_intr_en_notif_arm() and
  > hvm_intr_en_notif_disarm()
  > >     are called by
  > >     > [vmx,svm]_intr_assist().
  > >     >
  > >     > 3. Interrupt delivery policies
  > >     >
  > >     > The existing hpet interrupt delivery is preserved.
  > This includes
  > >     > vcpu round robin delivery used by Linux and broadcast
delivery
  > >     used by
  > >     > Windows.
  > >     >
  > >     > There are two policies for interrupt delivery, one for
Windows
  > >     2k8-64 and the
  > >     > other
  > >     > for Linux. The Linux policy takes advantage of the
  > (guest) Linux
  > >     missed tick
  > >     > and offset
  > >     > calculations and does not attempt to deliver the
  > right number of
  > >     interrupts.
  > >     > The Windows policy delivers the correct number of
interrupts,
  > >     even if
  > >     > sometimes much
  > >     > closer to each other than the period. The policies are
similar
  > >     to those in
  > >     > vpt.c, though
  > >     > there are some important differences.
  > >     >
  > >     > Policies are selected with an HVMOP_set_param
  > hypercall with index
  > >     > HVM_PARAM_TIMER_MODE.
  > >     > Two new values are added,
  > HVM_HPET_guest_computes_missed_ticks and
  > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
  > >     two new ones
  > >     > are added is that
  > >     > in some guests (32bit Linux) a no-missed policy is needed
for
  > >     clock sources
  > >     > other than hpet
  > >     > and a missed ticks policy for hpet. It was felt that
  > there would
  > >     be less
  > >     > confusion by simply
  > >     > introducing the two hpet policies.
  > >     >
  > >     > 3.1. The missed ticks policy
  > >     >
  > >     > The Linux clock interrupt handler for hpet calculates
missed
  > >     ticks and offset
  > >     > using the hpet
  > >     > main counter. The algorithm works well when the time since
the
  > >     last interrupt
  > >     > is greater than
  > >     > or equal to a period and poorly otherwise.
  > >     >
  > >     > The missed ticks policy ensures that no two clock
  > interrupts are
  > >     delivered to
  > >     > the guest at
  > >     > a time interval less than a period. A time stamp (hpet main
  > >     counter value) is
  > >     > recorded (by a
  > >     > callback registered with hvm_register_intr_en_notif)
  > when Linux
  > >     finishes
  > >     > handling the clock
  > >     > interrupt. Then, ensuing interrupts are delivered to
  > the vioapic
  > >     only if the
  > >     > current main
  > >     > counter value is a period greater than when the last
interrupt
  > >     was handled.
  > >     >
  > >     > Tests showed a significant improvement in clock drift with
end
  > >     of interrupt
  > >     > time stamps
  > >     > versus beginning of interrupt[1]. It is believed that
  > the reason
  > >     for the
  > >     > improvement
  > >     > is that the clock interrupt handler goes for a
  > spinlock and can
  > >     be therefore
  > >     > delayed in its
  > >     > processing. Furthermore, the main counter is read by the
guest
  > >     under the lock.
  > >     > The net
  > >     > effect is that if we time stamp injection, we can get the
  > >     difference in time
  > >     > between successive interrupt handler lock acquisitions to
be
  > >     less than the
  > >     > period.
  > >     >
  > >     > 3.2. The no-missed ticks policy
  > >     >
  > >     > Windows 2k864 keeps very poor time with the missed
  > ticks policy.
  > >     So the
  > >     > no-missed ticks policy
  > >     > was developed. In the no-missed ticks policy we deliver the
  > >     correct number of
  > >     > interrupts,
  > >     > even if they are spaced less than a period apart
  > (when catching up).
  > >     >
  > >     > Windows 2k864 uses a broadcast mode in the interrupt
routing
  > >     such that
  > >     > all vcpus get the clock interrupt. The best Windows drift
  > >     performance was
  > >     > achieved when the
  > >     > policy code ensured that all the previous interrupts (on
the
  > >     various vcpus)
  > >     > had been injected
  > >     > before injecting the next interrupt to the vioapic..
  > >     >
  > >     > The policy code works as follows. It uses the
  > >     hvm_isa_irq_assert_cb() to
  > >     > record
  > >     > the vcpus to be interrupted in h->hpet.pending_mask.
Then, in
  > >     the callback
  > >     > registered
  > >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
  > >     current vcpu in
  > >     > the pending_mask.
  > >     > When the pending_mask is clear it decrements
  > >     hpet.intr_pending_nr and if
  > >     > intr_pending_nr is still
  > >     > non-zero posts another interrupt to the ioapic with
  > >     hvm_isa_irq_assert_cb().
  > >     > Intr_pending_nr is incremented in
  > >     hpet_route_decision_not_missed_ticks().
  > >     >
  > >     > The missed ticks policy intr_en_notif callback also uses
the
  > >     pending_mask
  > >     > method. So even though
  > >     > Linux does not broadcast its interrupts, the code could
handle
  > >     it if it did.
  > >     > In this case the end of interrupt time stamp is made when
the
  > >     pending_mask is
  > >     > clear.
  > >     >
  > >     > 4. Live Migration
  > >     >
  > >     > Live migration with hpet preserves the current offset of
the
  > >     guest clock with
  > >     > respect
  > >     > to ntp. This is accomplished by migrating all of the state
in
  > >     the h->hpet data
  > >     > structure
  > >     > in the usual way. The hp->mc_offset is recalculated on
the
  > >     receiving node so
  > >     > that the
  > >     > guest sees a continuous hpet main counter.
  > >     >
  > >     > Code as been added to xc_domain_save.c to send a small
message
  > >     after the
  > >     > domain context is sent. The contents of the message is the
  > >     physical tsc
  > >     > timestamp, last_tsc,
  > >     > read just before the message is sent. When the
  > last_tsc message
  > >     is received in
  > >     > xc_domain_restore.c,
  > >     > another physical tsc timestamp, cur_tsc, is read. The two
  > >     timestamps are
  > >     > loaded into the domain
  > >     > structure as last_tsc_sender and first_tsc_receiver with
  > >     hypercalls. Then
  > >     > xc_domain_hvm_setcontext
  > >     > is called so that hpet_load has access to these time
stamps.
  > >     Hpet_load uses
  > >     > the timestamps
  > >     > to account for the time spent saving and loading the domain
  > >     context. With this
  > >     > technique,
  > >     > the only neglected time is the time spent sending a small
  > >     network message.
  > >     >
  > >     > 5. Test Results
  > >     >
  > >     > Some recent test results are:
  > >     >
  > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
  > >     >       Duration: 70 hours.
  > >     >       Test date: 6/2/08
  > >     >       Loads: usex -b48 on Linux; burn-in on Windows
  > >     >       Guest vcpus: 8 for Linux; 2 for Windows
  > >     >       Hardware: 8 physical cpu AMD
  > >     >       Clock drift : Linux: .0012% Windows: .009%
  > >     >
  > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load
test
  > >     >       Duration: 23 hours.
  > >     >       Test date: 6/3/08
  > >     >       Loads: none
  > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
  > >     >       Hardware: 4 physical cpu AMD
  > >     >       Clock drift : Linux: .033% Windows: .019%
  > >     >
  > >     > 6. Relation to recent work in xen-unstable
  > >     >
  > >     > There is a similarity between hvm_get_guest_time() in
  > >     xen-unstable and
  > >     > read_64_main_counter()
  > >     > in this code. However, read_64_main_counter() is more tuned
to
  > >     the needs of
  > >     > hpet.c. It has no
  > >     > "set" operation, only the get. It isolates the
mode,
  > physical or
  > >     simulated, in
  > >     > read_64_main_counter()
  > >     > itself. It uses no vcpu or domain state as it is a physical
  > >     entity, in either
  > >     > mode. And it provides a real
  > >     > physical mode for every read for those applications
  > that desire
  > >     this.
  > >     >
  > >     > 7. Conclusion
  > >     >
  > >     > The virtual hpet is improved by this patch in terms
  > of accuracy and
  > >     > monotonicity.
  > >     > Tests performed to date verify this and more testing
  > is under way.
  > >     >
  > >     > 8. Future Work
  > >     >
  > >     > Testing with Windows Vista will be performed soon. The
reason
  > >     for accuracy
  > >     > variations
  > >     > on different platforms using the physical hpet device will
be
  > >     investigated.
  > >     > Additional overhead measurements on simulated vs physical
hpet
  > >     mode will be
  > >     > made.
  > >     >
  > >     > Footnotes:
  > >     >
  > >     > 1. I don''t recall the accuracy improvement with
end
  > of interrupt
  > >     stamping, but
  > >     > it was
  > >     > significant, perhaps better than two to one improvement. It
  > >     would be a very
  > >     > simple matter
  > >     > to re-measure the improvement as the facility can call back
at
  > >     injection time
  > >     > as well.
  > >     >
  > >     >
  > >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
  > >     > <mailto:dwinchell@virtualiron.com>
  > >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
  > >     > <mailto:bguthro@virtualiron.com>
  > >     >
  > >     >
  > >     > _______________________________________________
  > >     > Xen-devel mailing list
  > >     > Xen-devel@lists.xensource.com
  > >     > http://lists.xensource.com/xen-devel
  > >
  > >
  > >
  >
  >






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-09 07:36 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
>> > 4) config file parameters
> 
> Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
> for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
> for Windows 2k8-64 and Vista 64.
> 8 vcpus for Linux and 2 for Windows.
These new HVM_HPET options seems a weird design choice. It appears that you
can only set these or one of the old options, so there¹s not actually any
independence between the mode used by vpt.c and the mode used by hpet.c. At
guest install time you ought to be able to tell whether the guest will use
hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
whether missed-ticks accounting is required or not.

I¹d be more agreeable to a partch that stripped out the physical hpet
accesses (since you say they are not the reason for the improvement in
accuracy), built hpet on top of vpt, and added the necessary extra
mechanisms to deal with interrupt broadcasts into vpt.c. And which was split
into more separate pieces of mechanism.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-09 11:13 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> These new HVM_HPET options seems a weird design choice. It appears that you
> can only set these or one of the old options, so there¹s not actually any
> independence between the mode used by vpt.c and the mode used by hpet.c. At
> guest install time you ought to be able to tell whether the guest will use
> hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
> whether missed-ticks accounting is required or not.
I''ll use the existing options instead.
> I¹d be more agreeable to a partch that stripped out the physical hpet
> accesses (since you say they are not the reason for the improvement in
> accuracy), built hpet on top of vpt, and added the necessary extra
> mechanisms to deal with interrupt broadcasts into vpt.c. And which was
split
> into more separate pieces of mechanism.
ok, I''ll work on this. How much time do I have to make the release you
are
working on?

thanks,
Dave




-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Mon 6/9/2008 3:36 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
>> > 4) config file parameters
> 
> Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
> for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
> for Windows 2k8-64 and Vista 64.
> 8 vcpus for Linux and 2 for Windows.
These new HVM_HPET options seems a weird design choice. It appears that you
can only set these or one of the old options, so there¹s not actually any
independence between the mode used by vpt.c and the mode used by hpet.c. At
guest install time you ought to be able to tell whether the guest will use
hpet or not based on its version (RHELx, SLESy, Winz etc etc) and decide
whether missed-ticks accounting is required or not.

I¹d be more agreeable to a partch that stripped out the physical hpet
accesses (since you say they are not the reason for the improvement in
accuracy), built hpet on top of vpt, and added the necessary extra
mechanisms to deal with interrupt broadcasts into vpt.c. And which was split
into more separate pieces of mechanism.

 -- Keir




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-09 12:03 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 12:13, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
>> > I¹d be more agreeable to a partch that stripped out the physical
hpet
>> > accesses (since you say they are not the reason for the
improvement in
>> > accuracy), built hpet on top of vpt, and added the necessary extra
>> > mechanisms to deal with interrupt broadcasts into vpt.c. And which
was
>> split
>> > into more separate pieces of mechanism.
> 
> ok, I''ll work on this. How much time do I have to make the release
you are
> working on?
I¹m thinking feature freeze at the end of the month.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-09 12:10 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com>
wrote:
>> ok, I''ll work on this. How much time do I have to make the
release you are
>> working on?
> 
> I¹m thinking feature freeze at the end of the month.
Oh, while I¹m commenting on the current version of the patch, I will point
out that changes to the save/restore format need careful thought. At the
very least we¹d like to be backward compatible (old images restore on new
Xen) even if we don¹t achieve compatibility the other way round. I didn¹t
really look too closely at this aspect of the patches so perhaps these
patches are fine in this regard. Either way I think the core re-architecting
of the hpet device model to handle missed ticks and so on is independent of
interface changes anyway (apart from reasonable extensions to the timer_mode
option). Changes to save/restore format, addition of extra debugging code,
and peripheral things like that it¹d be nice to have in separate patches. It
makes the core stuff easier to review and more likely to get accepted
without quibble (because there¹s less of it, and it is all dedicated to a
single purpose which I think we all agree is where we want to be).

 Thanks,
 Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-09 13:08 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Keir Fraser wrote:
> On 9/6/08 13:03, "Keir Fraser" <keir.fraser@eu.citrix.com>
wrote:
>
>         ok, I''ll work on this. How much time do I have to make the
>         release you are
>         working on?
>
>
>     I’m thinking feature freeze at the end of the month.
>
>
> Oh, while I’m commenting on the current version of the patch, I will 
> point out that changes to the save/restore format need careful 
> thought. At the very least we’d like to be backward compatible (old 
> images restore on new Xen) even if we don’t achieve compatibility the 
> other way round. I didn’t really look too closely at this aspect of 
> the patches so perhaps these patches are fine in this regard.
I''ll look into the live migrate from old to new.
> Either way I think the core re-architecting of the hpet device model 
> to handle missed ticks and so on is independent of interface changes 
> anyway (apart from reasonable extensions to the timer_mode option). 
> Changes to save/restore format, addition of extra debugging code, and 
> peripheral things like that it’d be nice to have in separate patches. 
> It makes the core stuff easier to review and more likely to get 
> accepted without quibble (because there’s less of it, and it is all 
> dedicated to a single purpose which I think we all agree is where we 
> want to be).
This is fine. I''ll break up the patch along the lines you suggest.
>
> Thanks,
> Keir
Thanks,
Dave

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-09 20:48 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> At guest install time you ought to be able to tell whether the guest
> will use hpet or not based on its version (RHELx, SLESy, Winz etc etc)
> and decide whether missed-ticks accounting is required or not.
Unfortunately this is not true on Linux, at least without gathering
(and hardcoding) more information about the system.  Whether hpet is
used or not is dependent not only on the OS/version and hvm config
parameters, but also on kernel command line parameters and even
the underlying CPU.  For example, on RHEL5u1, if the tsc is synchronized
and the CPU is Intel, and no kernel parameters are chosen, tsc will be
chosen as the default clocksource even if hpet is present.  Ugly.

That said, if Dave''s patch achieves the stated accuracy on most
versions of Linux (e.g. at least RHEL4+5, 32+64, smp+1p) for SOME
set of parameters (which might be different on each Linux version),
it would still be better than what we have now.

The ideal solution, I think, would be for the default hvm settings
to "do the right thing" for both Windows and Linux at least for the
vast majority of configuration choices.  I''m not sure this is possible,
but it sure would be nice.

Dan

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Monday, June 09, 2008 1:36 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 00:26, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:

> 4) config file parameters
Hpet enabled. Timer_mode set to HVM_HPET_guest_computes_missed_ticks
for all Linux guests and to HVM_HPET_guest_does_not_compute_missed_ticks
for Windows 2k8-64 and Vista 64.
8 vcpus for Linux and 2 for Windows.

These new HVM_HPET options seems a weird design choice. It appears that you can
only set these or one of the old options, so there''s not actually any
independence between the mode used by vpt.c and the mode used by hpet.c. At
guest install time you ought to be able to tell whether the guest will use hpet
or not based on its version (RHELx, SLESy, Winz etc etc) and decide whether
missed-ticks accounting is required or not.

I''d be more agreeable to a partch that stripped out the physical hpet
accesses (since you say they are not the reason for the improvement in
accuracy), built hpet on top of vpt, and added the necessary extra mechanisms to
deal with interrupt broadcasts into vpt.c. And which was split into more
separate pieces of mechanism.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-09 21:07 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

I''m wondering what is "magic" about 0.03% in all the
non-hw-hpet
measurements.  Is that just the accuracy of the underlying tsc
on your test system, e.g. the skew of tsc relative to an external
(ntp) source?  Or is Xen (tsc-based) system time skewing that much
on an overcommitted system (and skewing much less than 0.03% on an
unloaded system)?

Running the following on dom0 both on an unloaded and overcommitted
system (with ntpd off in dom0 and all guests) might be interesting:

# ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Saturday, June 07, 2008 3:21 PM
To: Keir Fraser; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> Possibly there are bugs in the hpet device model which are fixed by
Dave''s
> patch. If this is actually the case, it would be nice to break those out as
> separate patches, as I think an 11% drift must largely be due to
> device-model bugs rather than relatively insignificant differences between
> hvm_get_guest_time() and physical HPET main counter.
Hi Keir,

I tried an experiment on Friday where I short circuited the missed ticks policy
code in the hpet.c patch, but used the physical hpet each access. The result for
Linux
was a drift of .1%, same as the xen-unstable bits.

Conversely I get very good drift numbers, i.e., under .03%, when using the
missed ticks
policy code and  running in simulated mode (layered on stime) when stime uses
hpet.

So clearly, the improvement from .1% to .03% is due to the policy code.
I haven''t run the short circuit test with the windows policy but I can
do that
on Monday.

Note: For Windows and Linux I get < .03% drift using the policy code and
running
simulated mode whether stime is using hpet or some other device.

regards,
Dave

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Fri 6/6/2008 6:34 PM
To: dan.magenheimer@oracle.com; Dave Winchell
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 6/6/08 21:29, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
call to the platform time read function, and bypass TSC altogether. This
would be cleaner than having only the vHPET code punch through to the
physical HPET: instead we have the boot-time chosen platform timesource used
by all virtual timers.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...
Possibly there are bugs in the hpet device model which are fixed by
Dave''s
patch. If this is actually the case, it would be nice to break those out as
separate patches, as I think an 11% drift must largely be due to
device-model bugs rather than relatively insignificant differences between
hvm_get_guest_time() and physical HPET main counter.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-09 21:18 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 9/6/08 21:48, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
>> At guest install time you ought to be able to tell whether the guest
>> will use hpet or not based on its version (RHELx, SLESy, Winz etc etc)
>> and decide whether missed-ticks accounting is required or not.
> 
> Unfortunately this is not true on Linux, at least without gathering
> (and hardcoding) more information about the system.  Whether hpet is
> used or not is dependent not only on the OS/version and hvm config
> parameters, but also on kernel command line parameters and even
> the underlying CPU.  For example, on RHEL5u1, if the tsc is synchronized
> and the CPU is Intel, and no kernel parameters are chosen, tsc will be
> chosen as the default clocksource even if hpet is present.  Ugly.
It''s not immediately obvious that adding further independent
configuration
knobs to twiddle would make our lives that much easier. However it certainly
increases the test matrix.

In your example above, by synchronised TSC do you mean constant-rate TSC?
That can at least be hidden in CPUID now.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-09 21:44 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan Magenheimer wrote:
>I''m wondering what is "magic" about 0.03% in all the
non-hw-hpet
>measurements.
>.03% is simply the maximum error we''ve seen with hpet.
The maximum value (.03) is the same whether its simulated or physical.
The best value physical is .001% and I don''t remember the best value
simulated bit I believe it is under .01%, perhaps well under. I''ll have
to repeat
that measurement. I would think that simulated and physical would give 
roughly
the same drift values, but perhaps at very low drifts that doesn''t
hold.

I think the .03% is mostly due to the stability of the physical hpet
device on a platform. I''ve noticed on some platforms, the simulated
hpet
time actually improves if you disable the hpet in the bios so that
stime() is layered on the pm timer or whatever.

I would like to get to the bottom of this hpet stability variance
from platform to platform.

Regards,
Dave
>  Is that just the accuracy of the underlying tsc
>on your test system, e.g. the skew of tsc relative to an external
>(ntp) source?  Or is Xen (tsc-based) system time skewing that much
>on an overcommitted system (and skewing much less than 0.03% on an
>unloaded system)?
>
>Running the following on dom0 both on an unloaded and overcommitted
>system (with ntpd off in dom0 and all guests) might be interesting:
>
># ntpdate $NTPSERVER; sleep 3600; ntpdate -q $NTPSERVER
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Saturday, June 07, 2008 3:21 PM
>To: Keir Fraser; dan.magenheimer@oracle.com
>Cc: Ben Guthro; xen-devel; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>  
>
>>Possibly there are bugs in the hpet device model which are fixed by
Dave''s
>>patch. If this is actually the case, it would be nice to break those out
as
>>separate patches, as I think an 11% drift must largely be due to
>>device-model bugs rather than relatively insignificant differences
between
>>hvm_get_guest_time() and physical HPET main counter.
>>    
>>
>
>Hi Keir,
>
>I tried an experiment on Friday where I short circuited the missed ticks
policy
>code in the hpet.c patch, but used the physical hpet each access. The result
for Linux
>was a drift of .1%, same as the xen-unstable bits.
>
>Conversely I get very good drift numbers, i.e., under .03%, when using the
missed ticks
>policy code and  running in simulated mode (layered on stime) when stime
uses hpet.
>
>So clearly, the improvement from .1% to .03% is due to the policy code.
>I haven''t run the short circuit test with the windows policy but I
can do that
>on Monday.
>
>Note: For Windows and Linux I get < .03% drift using the policy code and
running
>simulated mode whether stime is using hpet or some other device.
>
>
>regards,
>Dave
>
>
>
>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>Sent: Fri 6/6/2008 6:34 PM
>To: dan.magenheimer@oracle.com; Dave Winchell
>Cc: Ben Guthro; xen-devel
>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>On 6/6/08 21:29, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
>
>  
>
>>Although hwhpet=1 is a fine alternative in many cases, it may
>>be unavailable on some systems and may cause significant performance
>>issues on others.  So I think we will still need to track down
>>the poor accuracy when hwhpet=0.  And if for some reason
>>Xen system time can''t be made accurate enough (< 0.05%),
then
>>I think we should consider building Xen system time itself on
>>top of hardware hpet instead of TSC... at least when Xen discovers
>>a capable hpet.
>>    
>>
>
>Yes, this would be a sensible extra timer_mode: have hvm_get_guest_time()
>call to the platform time read function, and bypass TSC altogether. This
>would be cleaner than having only the vHPET code punch through to the
>physical HPET: instead we have the boot-time chosen platform timesource used
>by all virtual timers.
>
>  
>
>>Or maybe there''s a computation error somewhere in the hvm hpet
>>scaling code?  Hmmm...
>>    
>>
>
>Possibly there are bugs in the hpet device model which are fixed by
Dave''s
>patch. If this is actually the case, it would be nice to break those out as
>separate patches, as I think an 11% drift must largely be due to
>device-model bugs rather than relatively insignificant differences between
>hvm_get_guest_time() and physical HPET main counter.
>
> -- Keir
>
>  
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-09 21:46 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> >> At guest install time you ought to be able to tell whether 
> the guest
> >> will use hpet or not based on its version (RHELx, SLESy, 
> Winz etc etc)
> >> and decide whether missed-ticks accounting is required or not.
> >
> > Unfortunately this is not true on Linux, at least without gathering
> > (and hardcoding) more information about the system.  Whether hpet is
> > used or not is dependent not only on the OS/version and hvm config
> > parameters, but also on kernel command line parameters and even
> > the underlying CPU.  For example, on RHEL5u1, if the tsc is 
> synchronized
> > and the CPU is Intel, and no kernel parameters are chosen, 
> tsc will be
> > chosen as the default clocksource even if hpet is present.  Ugly.
> 
> It''s not immediately obvious that adding further independent 
> configuration
> knobs to twiddle would make our lives that much easier. 
> However it certainly
> increases the test matrix.
I fully agree.  That''s why I think the default parameters in Xen
should "do the right thing".  The default will get the most
testing and if users say "time hurts when I change the parameters"
we can say "then don''t change the parameters" ;-)
> In your example above, by synchronised TSC do you mean 
> constant-rate TSC?
> That can at least be hidden in CPUID now.
I meant synchronized as defined in 2.6.18/arch/x86_64/kernel/time.c
in the function unsynchronized_tsc() and as used in the same file
in time_init_gtod().  To make this more complicated, these routines
have had not-insignificant bug fixes in RHEL5u1/2.

But yes, it might be a good idea if X86_FEATURE_CONSTANT_TSC
always returns 0 (or at least is configurable and defaults off),
since the test may only be made in the guest at boot time and
the guest may migrate to a machine without the feature.

More ugliness, I know.  My head hurts.

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-09 22:02 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> The Linux policy is more subtle, but is required to go
> from .1% to .03%.
Thanks for the good documentation which I hadn''t thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I''m trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I''d like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I''d rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there''s a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
I don''t recall the direction. I can look it up in my notes at work
tomorrow.
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.
Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.
> And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
In our experience, Xen system time is accurate enough now.
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
I do not know the tsc accuracy.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there''s a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can''t be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there''s a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it''s
> > my intent to clean this up, but I won''t get to it until next
week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We''ll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide
monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the
physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful
in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic
time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is
less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If
there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer
to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost
is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return
the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast
delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for
Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are
similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since
the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last
interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with
end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the
guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then,
in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could
handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small
message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned
to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the
mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical
hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don''t recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back
at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-09 23:34 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.
> Thanks for the good documentation which I hadn''t thoroughly
> read until now.
> I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I''m trying to understand
> WHY your patch works, in other words, what problem it is
> countering.
I''ll tell  you what I recall about this. Tomorrow I''ll check
the
guest code to verify. I think that Linux declares a full tick,
even if the interrupt is early. That''s the problem.
On the other hand, if the interrupt is late it in effect declares
a tick plus fraction. If it just declared the fraction in the first place,
we could deliver the interrupts whenever we wanted.

Its really not that different than the missed ticks policy in vpt.c
except that there the period in vpt.c is based on start of interrupt
and I have improved that with end-of interrupt as described
in the patch note.

I don''t recall what prompted me to try end-of-interrupt,
but I saw a significant improvement. I may have been running
a monotonicity test at the same time to explain the lock
contention mentioned in the write-up.
>  I care about this for more reasons than just
> because it is interesting: (1) I''d like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.
Its worked well my my small set of guests. You and our
QA are going to tell us about the wider set. It doesn''t
matter if guest A handles interrupts closely spaced or not,
just whether it handles them far apart. So it should be pretty
universal with guests that really handle missed ticks.
I think its interesting that some 32bit Linux guests handle
missed ticks for hpet.
> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?
I haven''t looked at that code, perhaps.
I''ll check it tomorrow.
> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless). 
I wasn''t aware of this trend. If its robust, however, it should
handle late interrupts ...
> That said, I''d rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)
Regards,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Mon 6/9/2008 6:02 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 > The Linux policy is more subtle, but is required to go
> from .1% to .03%.
Thanks for the good documentation which I hadn''t thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I''m trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I''d like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I''d rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there''s a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
I don''t recall the direction. I can look it up in my notes at work
tomorrow.
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.
Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.
> And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
In our experience, Xen system time is accurate enough now.
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
I do not know the tsc accuracy.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there''s a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can''t be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there''s a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it''s
> > my intent to clean this up, but I won''t get to it until next
week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We''ll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide
monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the
physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful
in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic
time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is
less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If
there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer
to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost
is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return
the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast
delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for
Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are
similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since
the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last
interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with
end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the
guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then,
in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could
handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small
message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned
to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the
mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical
hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don''t recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back
at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-10 03:21 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> I''ll tell  you what I recall about this. Tomorrow I''ll
check the
> guest code to verify. I think that Linux declares a full tick,
> even if the interrupt is early. That''s the problem.
Yes, that makes sense and concurs with what I remember from
the EL4u5-32 code.  If this is true, one would expect the
default "no missed tick" policy to see time moving faster
than an external source -- the first missed tick delivered
after a long sleep would "catch up" and then the remainder
would each add another tick.
> On the other hand, if the interrupt is late it in effect declares
> a tick plus fraction. If it just declared the fraction in the first place,
> we could deliver the interrupts whenever we wanted.
My read of the EL4u5-32 code is that the fraction is discarded
and a new tick period commences at "now", so the fractions
eventually accumulate as lost time.

In EL5u1-32 however it looks like the fractions are accounted
for.  Indeed the EL5u1-32 "lost tick handling" code resembles
the Linux/ia64 code which is what I''ve always assumed was
the "missed tick" model.  In this case, I think no policy
is necessary and the measured skew should be identical to
any physical hpet skew.  I''ll have to test this hypothesis though.

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dave Winchell
Sent: Monday, June 09, 2008 5:35 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Dave Winchell; xen-devel; Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.
> Thanks for the good documentation which I hadn''t thoroughly
> read until now.
> I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I''m trying to understand
> WHY your patch works, in other words, what problem it is
> countering.
I''ll tell  you what I recall about this. Tomorrow I''ll check
the
guest code to verify. I think that Linux declares a full tick,
even if the interrupt is early. That''s the problem.
On the other hand, if the interrupt is late it in effect declares
a tick plus fraction. If it just declared the fraction in the first place,
we could deliver the interrupts whenever we wanted.

Its really not that different than the missed ticks policy in vpt.c
except that there the period in vpt.c is based on start of interrupt
and I have improved that with end-of interrupt as described
in the patch note.

I don''t recall what prompted me to try end-of-interrupt,
but I saw a significant improvement. I may have been running
a monotonicity test at the same time to explain the lock
contention mentioned in the write-up.
>  I care about this for more reasons than just
> because it is interesting: (1) I''d like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.
Its worked well my my small set of guests. You and our
QA are going to tell us about the wider set. It doesn''t
matter if guest A handles interrupts closely spaced or not,
just whether it handles them far apart. So it should be pretty
universal with guests that really handle missed ticks.
I think its interesting that some 32bit Linux guests handle
missed ticks for hpet.
> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?
I haven''t looked at that code, perhaps.
I''ll check it tomorrow.
> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless).
I wasn''t aware of this trend. If its robust, however, it should
handle late interrupts ...
> That said, I''d rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)
Regards,
Dave



-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Mon 6/9/2008 6:02 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> The Linux policy is more subtle, but is required to go
> from .1% to .03%.
Thanks for the good documentation which I hadn''t thoroughly
read until now.  I now understand that the essence of your
hpet missed ticks policy is to ensure that ticks are never
delivered too close together.  But I''m trying to understand
WHY your patch works, in other words, what problem it is
countering.  I care about this for more reasons than just
because it is interesting: (1) I''d like to feel confident that
it is fixing a bug rather than just a symptom of a bug;
and (2) I wonder how universally it is applicable.

I see from code examination in mark_offset_hpet() in
RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
the correction for lost ticks is just plain wrong in
a virtual environment. (Suppose for example that a virtual
tick was delivered every 1.999*hpet_tick... I think
the clock would be off by 50%!)  Is this the bug that
is being "countered" by your policy?

However, the lost tick handling in RHEL5u1/kernel/timer.c
(which I think is used also for hpet) is much better
so I am eager to find out if your policy works there
too.

If the hpet missed tick policy works for both, though,
I should be happy, though I wonder about upstream kernels
(e.g. the trend toward tickless).  That said, I''d rather
see this get into Xen 3.3 and worry about upstream kernels
later :-)

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@oracle.com; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there''s a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
I don''t recall the direction. I can look it up in my notes at work
tomorrow.
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.
Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.
> And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
In our experience, Xen system time is accurate enough now.
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
I do not know the tsc accuracy.
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...

Regards,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there''s a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.


I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can''t be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there''s a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it''s
> > my intent to clean this up, but I won''t get to it until next
week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@lists.xensource.com
> >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We''ll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@lists.xensource.com on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@oracle.com
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide
monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the
physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful
in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic
time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is
less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If
there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer
to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost
is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return
the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast
delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for
Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are
similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason
that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since
the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last
interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with
end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the
guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then,
in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears
the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could
handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small
message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned
to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the
mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical
hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don''t recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back
at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >     > <mailto:dwinchell@virtualiron.com>
> >     > Signed-off-by: Ben Guthro <bguthro@virtualiron.com>
> >     > <mailto:bguthro@virtualiron.com>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@lists.xensource.com
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-10 07:52 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
> I don''t recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.
Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers  those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-10 11:49 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > I don''t recall what prompted me to try end-of-interrupt,
> > but I saw a significant improvement. I may have been running
> > a monotonicity test at the same time to explain the lock
> > contention mentioned in the write-up.
> Doesn¹t this policy guarantee that you actually deliver interrupts at
> consistently too low a rate? Since the delivery period is now timer period
+
> latency of interrupt handling?
Yes, the rate ends up being about half the normal rate because
I toss the interrupt if it doesn''t meet the requirement. If, in
testing, we find
a guest that has a problem with half the normal rate, we can fine tune
the policy. For example, instead of discarding set a small timer.
> I suppose it works for this guest type
> because it doesn¹t actually care about getting interrupts at the correct
> rate, so long as the ticks are always a bit late?
True.
> For those that do need missed ticks to be delivered, do you track missed
> ticks at the absolute correct rate?
Yes.
> 
> This is perhaps a fine tradeoff for all platform timers < those guests
that
> can handle missed ticks obviously do not care about getting their timer
> interrupts  at absolutely the correct rate, and delivering a little late is
> what they are geared to handle (getting delivered consistently early is
just
> weird!). Whereas guests that need all ticks also want them (at least over
> the long run) at exactly the correct rate.
> I think there¹s good empirical analysis in the work you¹ve done. We just
> need the patches cleaned up and generalised for vpt.c now.
Thanks. I''ll get to work on the generalization, etc.

-Dave


-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Tue 6/10/2008 3:52 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
> I don''t recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.
Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers < those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-10 12:34 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > Doesn¹t this policy guarantee that you actually deliver interrupts at
> > consistently too low a rate? Since the delivery period is now timer
period +
> > latency of interrupt handling?
> 
> Yes, the rate ends up being about half the normal rate because
> I toss the interrupt if it doesn''t meet the requirement. If, in
testing, we find
> a guest that has a problem with half the normal rate, we can fine tune
> the policy. For example, instead of discarding set a small timer.
A better solution is to set the new period timer at end-of-interrupt time
(in the callback) instead of delivery time (timer expiration time).
This way the rate will be very close to what the guest expects.
I think I''ll make this change.

-Dave


-----Original Message-----
From: Dave Winchell
Sent: Tue 6/10/2008 7:49 AM
To: Keir Fraser; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 > > I don''t recall what prompted me to try end-of-interrupt,
> > but I saw a significant improvement. I may have been running
> > a monotonicity test at the same time to explain the lock
> > contention mentioned in the write-up.
> Doesn¹t this policy guarantee that you actually deliver interrupts at
> consistently too low a rate? Since the delivery period is now timer period
+
> latency of interrupt handling?
Yes, the rate ends up being about half the normal rate because
I toss the interrupt if it doesn''t meet the requirement. If, in
testing, we find
a guest that has a problem with half the normal rate, we can fine tune
the policy. For example, instead of discarding set a small timer.
> I suppose it works for this guest type
> because it doesn¹t actually care about getting interrupts at the correct
> rate, so long as the ticks are always a bit late?
True.
> For those that do need missed ticks to be delivered, do you track missed
> ticks at the absolute correct rate?
Yes.
> 
> This is perhaps a fine tradeoff for all platform timers < those guests
that
> can handle missed ticks obviously do not care about getting their timer
> interrupts  at absolutely the correct rate, and delivering a little late is
> what they are geared to handle (getting delivered consistently early is
just
> weird!). Whereas guests that need all ticks also want them (at least over
> the long run) at exactly the correct rate.
> I think there¹s good empirical analysis in the work you¹ve done. We just
> need the patches cleaned up and generalised for vpt.c now.
Thanks. I''ll get to work on the generalization, etc.

-Dave


-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Tue 6/10/2008 3:52 AM
To: Dave Winchell; dan.magenheimer@oracle.com
Cc: Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
On 10/6/08 00:34, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
> I don''t recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.
Doesn¹t this policy guarantee that you actually deliver interrupts at
consistently too low a rate? Since the delivery period is now timer period +
latency of interrupt handling? I suppose it works for this guest type
because it doesn¹t actually care about getting interrupts at the correct
rate, so long as the ticks are always a bit late?

For those that do need missed ticks to be delivered, do you track missed
ticks at the absolute correct rate?

This is perhaps a fine tradeoff for all platform timers < those guests that
can handle missed ticks obviously do not care about getting their timer
interrupts  at absolutely the correct rate, and delivering a little late is
what they are geared to handle (getting delivered consistently early is just
weird!). Whereas guests that need all ticks also want them (at least over
the long run) at exactly the correct rate.

I think there¹s good empirical analysis in the work you¹ve done. We just
need the patches cleaned up and generalised for vpt.c now.

 -- Keir





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-10 12:42 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 10/6/08 13:34, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
>> > Yes, the rate ends up being about half the normal rate because
>> > I toss the interrupt if it doesn''t meet the requirement.
If, in testing, we
>> find
>> > a guest that has a problem with half the normal rate, we can fine
tune
>> > the policy. For example, instead of discarding set a small timer.
> 
> A better solution is to set the new period timer at end-of-interrupt time
> (in the callback) instead of delivery time (timer expiration time).
> This way the rate will be very close to what the guest expects.
> I think I''ll make this change.
Yes, that¹s what I thought you¹d done. It sounds nicer to me.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-10 17:13 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Keir, Dan:

Although I plan to break up the patch, etc., I''m posting
this fix to the patch for anyone who might be interested.

thanks,
Dave


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-11 01:44 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> In EL5u1-32 however it looks like the fractions are accounted
> for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> the Linux/ia64 code which is what I''ve always assumed was
> the "missed tick" model.  In this case, I think no policy
> is necessary and the measured skew should be identical to
> any physical hpet skew.  I''ll have to test this hypothesis though.
I''ve tested this hypothesis and it seems to hold true.
This means the existing (unpatched) hpet code works fine
on EL5-32bit (vcpus=1) when hpet is the clocksource,
even when the machine is overcommitted.  A second hypothesis
still needs to be tested that Dave''s patch will not make this worse.

(Note that per previous discussion, my EL5u1-32bit guest
running on an Intel dual-core physical box chose tsc as
the best clocksource and I had to override it with
clock=hpet in the kernel command line.)
> Yes, that makes sense and concurs with what I remember from
> the EL4u5-32 code.  If this is true, one would expect the
> default "no missed tick" policy to see time moving faster
> than an external source -- the first missed tick delivered
> after a long sleep would "catch up" and then the remainder
> would each add another tick.
Indeed with the existing (unpatched) hpet code, time is
running faster on EL4u5-32 (vcpus=1, when overcommited).
So Dave''s patch is definitely needed here.

Will try 64-bit next.

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Monday, June 09, 2008 9:21 PM
> To: ''Dave Winchell''; ''Keir Fraser''
> Cc: ''xen-devel''; ''Ben Guthro''
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> > I''ll tell  you what I recall about this. Tomorrow
I''ll check the
> > guest code to verify. I think that Linux declares a full tick,
> > even if the interrupt is early. That''s the problem.
> 
> Yes, that makes sense and concurs with what I remember from
> the EL4u5-32 code.  If this is true, one would expect the
> default "no missed tick" policy to see time moving faster
> than an external source -- the first missed tick delivered
> after a long sleep would "catch up" and then the remainder
> would each add another tick.
> 
> > On the other hand, if the interrupt is late it in effect declares
> > a tick plus fraction. If it just declared the fraction in 
> the first place,
> > we could deliver the interrupts whenever we wanted.
> 
> My read of the EL4u5-32 code is that the fraction is discarded
> and a new tick period commences at "now", so the fractions
> eventually accumulate as lost time.
> 
> In EL5u1-32 however it looks like the fractions are accounted
> for.  Indeed the EL5u1-32 "lost tick handling" code resembles
> the Linux/ia64 code which is what I''ve always assumed was
> the "missed tick" model.  In this case, I think no policy
> is necessary and the measured skew should be identical to
> any physical hpet skew.  I''ll have to test this hypothesis though.
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of 
> Dave Winchell
> Sent: Monday, June 09, 2008 5:35 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Dave Winchell; xen-devel; Ben Guthro
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> > > The Linux policy is more subtle, but is required to go
> > > from .1% to .03%.
> 
> > Thanks for the good documentation which I hadn''t thoroughly
> > read until now.
> > I now understand that the essence of your
> > hpet missed ticks policy is to ensure that ticks are never
> > delivered too close together.  But I''m trying to understand
> > WHY your patch works, in other words, what problem it is
> > countering.
> 
> I''ll tell  you what I recall about this. Tomorrow I''ll
check the
> guest code to verify. I think that Linux declares a full tick,
> even if the interrupt is early. That''s the problem.
> On the other hand, if the interrupt is late it in effect declares
> a tick plus fraction. If it just declared the fraction in the 
> first place,
> we could deliver the interrupts whenever we wanted.
> 
> Its really not that different than the missed ticks policy in vpt.c
> except that there the period in vpt.c is based on start of interrupt
> and I have improved that with end-of interrupt as described
> in the patch note.
> 
> I don''t recall what prompted me to try end-of-interrupt,
> but I saw a significant improvement. I may have been running
> a monotonicity test at the same time to explain the lock
> contention mentioned in the write-up.
> 
> >  I care about this for more reasons than just
> > because it is interesting: (1) I''d like to feel confident
that
> > it is fixing a bug rather than just a symptom of a bug;
> > and (2) I wonder how universally it is applicable.
> 
> Its worked well my my small set of guests. You and our
> QA are going to tell us about the wider set. It doesn''t
> matter if guest A handles interrupts closely spaced or not,
> just whether it handles them far apart. So it should be pretty
> universal with guests that really handle missed ticks.
> I think its interesting that some 32bit Linux guests handle
> missed ticks for hpet.
> 
> > I see from code examination in mark_offset_hpet() in
> > RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> > the correction for lost ticks is just plain wrong in
> > a virtual environment. (Suppose for example that a virtual
> > tick was delivered every 1.999*hpet_tick... I think
> > the clock would be off by 50%!)  Is this the bug that
> > is being "countered" by your policy?
> 
> I haven''t looked at that code, perhaps.
> I''ll check it tomorrow.
> 
> > However, the lost tick handling in RHEL5u1/kernel/timer.c
> > (which I think is used also for hpet) is much better
> > so I am eager to find out if your policy works there
> > too.
> > If the hpet missed tick policy works for both, though,
> > I should be happy, though I wonder about upstream kernels
> > (e.g. the trend toward tickless).
> 
> I wasn''t aware of this trend. If its robust, however, it should
> handle late interrupts ...
> 
> > That said, I''d rather
> > see this get into Xen 3.3 and worry about upstream kernels
> > later :-)
> 
> Regards,
> Dave
> 
> 
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Mon 6/9/2008 6:02 PM
> To: Dave Winchell; Keir Fraser
> Cc: Ben Guthro; xen-devel
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> > The Linux policy is more subtle, but is required to go
> > from .1% to .03%.
> 
> Thanks for the good documentation which I hadn''t thoroughly
> read until now.  I now understand that the essence of your
> hpet missed ticks policy is to ensure that ticks are never
> delivered too close together.  But I''m trying to understand
> WHY your patch works, in other words, what problem it is
> countering.  I care about this for more reasons than just
> because it is interesting: (1) I''d like to feel confident that
> it is fixing a bug rather than just a symptom of a bug;
> and (2) I wonder how universally it is applicable.
> 
> I see from code examination in mark_offset_hpet() in
> RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> the correction for lost ticks is just plain wrong in
> a virtual environment. (Suppose for example that a virtual
> tick was delivered every 1.999*hpet_tick... I think
> the clock would be off by 50%!)  Is this the bug that
> is being "countered" by your policy?
> 
> However, the lost tick handling in RHEL5u1/kernel/timer.c
> (which I think is used also for hpet) is much better
> so I am eager to find out if your policy works there
> too.
> 
> If the hpet missed tick policy works for both, though,
> I should be happy, though I wonder about upstream kernels
> (e.g. the trend toward tickless).  That said, I''d rather
> see this get into Xen 3.3 and worry about upstream kernels
> later :-)
> 
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Sunday, June 08, 2008 2:32 PM
> To: dan.magenheimer@oracle.com; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Hi Dan,
> 
> > While I am fully supportive of offering hardware hpet as an option
> > for hvm guests (let''s call it hwhpet=1 for shorthand), I am
very
> > surprised by your preliminary results; the most obvious conclusion
> > is that Xen system time is losing time at the rate of 1000 PPM
> > though its possible there''s a bug somewhere else in the
"time
> > stack".  Your Windows result is jaw-dropping and inexplicable,
> > though I have to admit ignorance of how Windows manages time.
> 
> I think xen system time is fine. You have to add the interrupt
> delivery policies decribed in the write-up for the patch to get
> accurate timekeeping in the guest.
> 
> The windows policy is obvious and results in a large improvement
> in accuracy. The Linux policy is more subtle, but is required to go
> from .1% to .03%.
> 
> > I think with my recent patch and hpet=1 (essentially the same as
> > your emulated hpet), hvm guest time should track Xen system time.
> > I wonder if domain0 (which if I understand correctly is directly
> > using Xen system time) is also seeing an error of .1%?  Also
> > I wonder for the skew you are seeing (in both hvm guests and
> > domain0) is time moving too fast or two slow?
> 
> I don''t recall the direction. I can look it up in my notes at work
> tomorrow.
> 
> > Although hwhpet=1 is a fine alternative in many cases, it may
> > be unavailable on some systems and may cause significant performance
> > issues on others.  So I think we will still need to track down
> > the poor accuracy when hwhpet=0.
> 
> Our patch is accurate to < .03% using the physical hpet mode or
> the simulated mode.
> 
> > And if for some reason
> > Xen system time can''t be made accurate enough (< 0.05%),
then
> > I think we should consider building Xen system time itself on
> > top of hardware hpet instead of TSC... at least when Xen discovers
> > a capable hpet.
> 
> In our experience, Xen system time is accurate enough now.
> 
> > One more thought... do you know the accuracy of the TSC crystals
> > on your test systems?  I posted a patch awhile ago that was
> > intended to test that, though I guess it was only testing skew
> > of different TSCs on the same system, not TSCs against an
> > external time source.
> 
> I do not know the tsc accuracy.
> 
> > Or maybe there''s a computation error somewhere in the hvm
hpet
> > scaling code?  Hmmm...
> 
> 
> Regards,
> Dave
> 
> 
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Fri 6/6/2008 4:29 PM
> To: Dave Winchell; Keir Fraser
> Cc: Ben Guthro; xen-devel
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> Dave --
> 
> Thanks much for posting the preliminary results!
> 
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let''s call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there''s a bug somewhere else in the "time
> stack".  Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
> 
> 
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%?  Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
> 
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others.  So I think we will still need to track down
> the poor accuracy when hwhpet=0.  And if for some reason
> Xen system time can''t be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
> 
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems?  I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
> 
> Or maybe there''s a computation error somewhere in the hvm hpet
> scaling code?  Hmmm...
> 
> Thanks,
> Dan
> 
> > -----Original Message-----
> > From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> > Sent: Friday, June 06, 2008 1:33 PM
> > To: dan.magenheimer@oracle.com; Keir Fraser
> > Cc: Ben Guthro; xen-devel; Dave Winchell
> > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >
> > Dan, Keir:
> >
> > Preliminary tests results indicate an error of .1% for Linux 64 bit
> > guests configured
> > for hpet with xen-unstable as is. As we have discussed many 
> times, the
> > ntp requirement is .05%.
> > Tests on the patch we just submitted for hpet have 
> indicated errors of
> > .0012%
> > on this platform under similar test conditions and .03% on
> > other platforms.
> >
> > Windows vista64 has an error of 11% using hpet with the
> > xen-unstable bits.
> > In an overnight test with our hpet patch, the Windows vista
> > error was .008%.
> >
> > The tests are with two or three guests on a physical node, all under
> > load, and with
> > the ratio of vcpus to phys cpus > 1.
> >
> > I will continue to run tests over the next few days.
> >
> > thanks,
> > Dave
> >
> >
> > Dan Magenheimer wrote:
> >
> > > Hi Dave and Ben --
> > >
> > > When running tests on xen-unstable (without your patch),
> > please ensure
> > > that hpet=1 is set in the hvm config and also I think 
> that when hpet
> > > is the clocksource on RHEL4-32, the clock IS resilient to
> > missed ticks
> > > so timer_mode should be 2 (vs when pit is the clocksource
> > on RHEL4-32,
> > > all clock ticks must be delivered and so timer_mode should be 0).
> > >
> > > Per
> > >
> > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> > 00098.html it''s
> > > my intent to clean this up, but I won''t get to it until
next week.
> > >
> > > Thanks,
> > > Dan
> > >
> > >     -----Original Message-----
> > >     *From:* xen-devel-bounces@lists.xensource.com
> > >     [mailto:xen-devel-bounces@lists.xensource.com]*On
> > Behalf Of *Dave
> > >     Winchell
> > >     *Sent:* Friday, June 06, 2008 4:46 AM
> > >     *To:* Keir Fraser; Ben Guthro; xen-devel
> > >     *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> > >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> > >
> > >     Keir,
> > >
> > >     I think the changes are required. We''ll run some
tests
> > today today so
> > >     that we have some data to talk about.
> > >
> > >     -Dave
> > >
> > >
> > >     -----Original Message-----
> > >     From: xen-devel-bounces@lists.xensource.com on behalf
> > of Keir Fraser
> > >     Sent: Fri 6/6/2008 4:58 AM
> > >     To: Ben Guthro; xen-devel
> > >     Cc: dan.magenheimer@oracle.com
> > >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> > >
> > >     Are these patches needed now the timers are built on 
> Xen system
> > >     time rather
> > >     than host TSC? Dan has reported much better
> > time-keeping with his
> > >     patch
> > >     checked in, and it¹s for sure a lot less invasive than
> > this patchset.
> > >
> > >
> > >      -- Keir
> > >
> > >     On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> > >
> > >     >
> > >     > 1. Introduction
> > >     >
> > >     > This patch improves the hpet based guest clock in
> > terms of drift and
> > >     > monotonicity.
> > >     > Prior to this work the drift with hpet was greater
> > than 2%, far
> > >     above the .05%
> > >     > limit
> > >     > for ntp to synchronize. With this code, the drift 
> ranges from
> > >     .001% to .0033%
> > >     > depending
> > >     > on guest and physical platform.
> > >     >
> > >     > Using hpet allows guest operating systems to 
> provide monotonic
> > >     time to their
> > >     > applications. Time sources other than hpet are not
> > monotonic because
> > >     > of their reliance on tsc, which is not synchronized
> > across physical
> > >     > processors.
> > >     >
> > >     > Windows 2k864 and many Linux guests are supported with
two
> > >     policies, one for
> > >     > guests
> > >     > that handle missed clock interrupts and the other for
guests
> > >     that require the
> > >     > correct number of interrupts.
> > >     >
> > >     > Guests may use hpet for the timing source even if 
> the physical
> > >     platform has no
> > >     > visible
> > >     > hpet. Migration is supported between physical machines
which
> > >     differ in
> > >     > physical
> > >     > hpet visibility.
> > >     >
> > >     > Most of the changes are in hpet.c. Two general 
> facilities are
> > >     added to track
> > >     > interrupt
> > >     > progress. The ideas here and the facilities would 
> be useful in
> > >     vpt.c, for
> > >     > other time
> > >     > sources, though no attempt is made here to improve
vpt.c.
> > >     >
> > >     > The following sections discuss hpet dependencies,
interrupt
> > >     delivery policies,
> > >     > live migration,
> > >     > test results, and relation to recent work with 
> monotonic time.
> > >     >
> > >     >
> > >     > 2. Virtual Hpet dependencies
> > >     >
> > >     > The virtual hpet depends on the ability to read the
> > physical or
> > >     simulated
> > >     > (see discussion below) hpet.  For timekeeping, the
> > virtual hpet
> > >     also depends
> > >     > on two new interrupt notification facilities to 
> implement its
> > >     policies for
> > >     > interrupt delivery.
> > >     >
> > >     > 2.1. Two modes of low-level hpet main counter reads.
> > >     >
> > >     > In this implementation, the virtual hpet reads with
> > >     read_64_main_counter(),
> > >     > exported by
> > >     > time.c, either the real physical hpet main counter
register
> > >     directly or a
> > >     > "simulated"
> > >     > hpet main counter.
> > >     >
> > >     > The simulated mode uses a monotonic version of
get_s_time()
> > >     (NOW()), where the
> > >     > last
> > >     > time value is returned whenever the current time 
> value is less
> > >     than the last
> > >     > time
> > >     > value. In simulated mode, since it is layered on s_time,
the
> > >     underlying
> > >     > hardware
> > >     > can be hpet or some other device. The frequency of the
main
> > >     counter in
> > >     > simulated
> > >     > mode is the same as the standard physical hpet
frequency,
> > >     allowing live
> > >     > migration
> > >     > between nodes that are configured differently.
> > >     >
> > >     > If the physical platform does not have an hpet
> > device, or if xen
> > >     is configured
> > >     > not
> > >     > to use the device, then the simulated method is 
> used. If there
> > >     is a physical
> > >     > hpet device,
> > >     > and xen has initialized it, then either simulated 
> or physical
> > >     mode can be
> > >     > used.
> > >     > This is governed by a boot time option, hpet-avoid.
> > Setting this
> > >     option to 1
> > >     > gives the
> > >     > simulated mode and 0 the physical mode. The default
> > is physical
> > >     mode.
> > >     >
> > >     > A disadvantage of the physical mode is that may 
> take longer to
> > >     read the device
> > >     > than in simulated mode. On some platforms the cost is
> > about the
> > >     same (less
> > >     > than 250 nsec) for
> > >     > physical and simulated modes, while on others 
> physical cost is
> > >     much higher
> > >     > than simulated.
> > >     > A disadvantage of the simulated mode is that it can 
> return the
> > >     same value
> > >     > for the counter in consecutive calls.
> > >     >
> > >     > 2.2. Interrupt notification facilities.
> > >     >
> > >     > Two interrupt notification facilities are introduced,
one is
> > >     > hvm_isa_irq_assert_cb()
> > >     > and the other hvm_register_intr_en_notif().
> > >     >
> > >     > The vhpet uses hvm_isa_irq_assert_cb to deliver 
> interrupts to
> > >     the vioapic.
> > >     > hvm_isa_irq_assert_cb allows a callback to be 
> passed along to
> > >     > vioapic_deliver()
> > >     > and this callback is called with a mask of the vcpus
> > which will
> > >     get the
> > >     > interrupt. This callback is made before any vcpus
receive an
> > >     interrupt.
> > >     >
> > >     > Vhpet uses hvm_register_intr_en_notif() to register 
> a handler
> > >     for a particular
> > >     > vector that will be called when that vector is injected
in
> > >     > [vmx,svm]_intr_assist()
> > >     > and also when the guest finishes handling the 
> interrupt. Here
> > >     finished is
> > >     > defined
> > >     > as the point when the guest re-enables interrupts or
> > lowers the
> > >     tpr value.
> > >     > EOI is not used as the end of interrupt as this is
sometimes
> > >     returned before
> > >     > the interrupt handler has done its work. A flag is
> > passed to the
> > >     handler
> > >     > indicating
> > >     > whether this is the injection point (post = 1) or the
> > interrupt
> > >     finished (post
> > >     > = 0) point.
> > >     > The need for the finished point callback is discussed in
the
> > >     missed ticks
> > >     > policy section.
> > >     >
> > >     > To prevent a possible early trigger of the finished 
> callback,
> > >     intr_en_notif
> > >     > logic
> > >     > has a two stage arm, the first at injection
> > >     (hvm_intr_en_notif_arm()) and the
> > >     > second when
> > >     > interrupts are seen to be disabled
> > (hvm_intr_en_notif_disarm()).
> > >     Once fully
> > >     > armed, re-enabling
> > >     > interrupts will cause hvm_intr_en_notif_disarm() to
> > make the end
> > >     of interrupt
> > >     > callback. hvm_intr_en_notif_arm() and
> > hvm_intr_en_notif_disarm()
> > >     are called by
> > >     > [vmx,svm]_intr_assist().
> > >     >
> > >     > 3. Interrupt delivery policies
> > >     >
> > >     > The existing hpet interrupt delivery is preserved.
> > This includes
> > >     > vcpu round robin delivery used by Linux and 
> broadcast delivery
> > >     used by
> > >     > Windows.
> > >     >
> > >     > There are two policies for interrupt delivery, one 
> for Windows
> > >     2k8-64 and the
> > >     > other
> > >     > for Linux. The Linux policy takes advantage of the
> > (guest) Linux
> > >     missed tick
> > >     > and offset
> > >     > calculations and does not attempt to deliver the
> > right number of
> > >     interrupts.
> > >     > The Windows policy delivers the correct number of 
> interrupts,
> > >     even if
> > >     > sometimes much
> > >     > closer to each other than the period. The policies 
> are similar
> > >     to those in
> > >     > vpt.c, though
> > >     > there are some important differences.
> > >     >
> > >     > Policies are selected with an HVMOP_set_param
> > hypercall with index
> > >     > HVM_PARAM_TIMER_MODE.
> > >     > Two new values are added,
> > HVM_HPET_guest_computes_missed_ticks and
> > >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The 
> reason that
> > >     two new ones
> > >     > are added is that
> > >     > in some guests (32bit Linux) a no-missed policy is 
> needed for
> > >     clock sources
> > >     > other than hpet
> > >     > and a missed ticks policy for hpet. It was felt that
> > there would
> > >     be less
> > >     > confusion by simply
> > >     > introducing the two hpet policies.
> > >     >
> > >     > 3.1. The missed ticks policy
> > >     >
> > >     > The Linux clock interrupt handler for hpet calculates
missed
> > >     ticks and offset
> > >     > using the hpet
> > >     > main counter. The algorithm works well when the 
> time since the
> > >     last interrupt
> > >     > is greater than
> > >     > or equal to a period and poorly otherwise.
> > >     >
> > >     > The missed ticks policy ensures that no two clock
> > interrupts are
> > >     delivered to
> > >     > the guest at
> > >     > a time interval less than a period. A time stamp (hpet
main
> > >     counter value) is
> > >     > recorded (by a
> > >     > callback registered with hvm_register_intr_en_notif)
> > when Linux
> > >     finishes
> > >     > handling the clock
> > >     > interrupt. Then, ensuing interrupts are delivered to
> > the vioapic
> > >     only if the
> > >     > current main
> > >     > counter value is a period greater than when the 
> last interrupt
> > >     was handled.
> > >     >
> > >     > Tests showed a significant improvement in clock 
> drift with end
> > >     of interrupt
> > >     > time stamps
> > >     > versus beginning of interrupt[1]. It is believed that
> > the reason
> > >     for the
> > >     > improvement
> > >     > is that the clock interrupt handler goes for a
> > spinlock and can
> > >     be therefore
> > >     > delayed in its
> > >     > processing. Furthermore, the main counter is read 
> by the guest
> > >     under the lock.
> > >     > The net
> > >     > effect is that if we time stamp injection, we can get
the
> > >     difference in time
> > >     > between successive interrupt handler lock acquisitions
to be
> > >     less than the
> > >     > period.
> > >     >
> > >     > 3.2. The no-missed ticks policy
> > >     >
> > >     > Windows 2k864 keeps very poor time with the missed
> > ticks policy.
> > >     So the
> > >     > no-missed ticks policy
> > >     > was developed. In the no-missed ticks policy we deliver
the
> > >     correct number of
> > >     > interrupts,
> > >     > even if they are spaced less than a period apart
> > (when catching up).
> > >     >
> > >     > Windows 2k864 uses a broadcast mode in the interrupt
routing
> > >     such that
> > >     > all vcpus get the clock interrupt. The best Windows
drift
> > >     performance was
> > >     > achieved when the
> > >     > policy code ensured that all the previous interrupts (on
the
> > >     various vcpus)
> > >     > had been injected
> > >     > before injecting the next interrupt to the vioapic..
> > >     >
> > >     > The policy code works as follows. It uses the
> > >     hvm_isa_irq_assert_cb() to
> > >     > record
> > >     > the vcpus to be interrupted in 
> h->hpet.pending_mask. Then, in
> > >     the callback
> > >     > registered
> > >     > with hvm_register_intr_en_notif() at post=1 time it 
> clears the
> > >     current vcpu in
> > >     > the pending_mask.
> > >     > When the pending_mask is clear it decrements
> > >     hpet.intr_pending_nr and if
> > >     > intr_pending_nr is still
> > >     > non-zero posts another interrupt to the ioapic with
> > >     hvm_isa_irq_assert_cb().
> > >     > Intr_pending_nr is incremented in
> > >     hpet_route_decision_not_missed_ticks().
> > >     >
> > >     > The missed ticks policy intr_en_notif callback also uses
the
> > >     pending_mask
> > >     > method. So even though
> > >     > Linux does not broadcast its interrupts, the code 
> could handle
> > >     it if it did.
> > >     > In this case the end of interrupt time stamp is 
> made when the
> > >     pending_mask is
> > >     > clear.
> > >     >
> > >     > 4. Live Migration
> > >     >
> > >     > Live migration with hpet preserves the current offset of
the
> > >     guest clock with
> > >     > respect
> > >     > to ntp. This is accomplished by migrating all of 
> the state in
> > >     the h->hpet data
> > >     > structure
> > >     > in the usual way. The hp->mc_offset is recalculated
on the
> > >     receiving node so
> > >     > that the
> > >     > guest sees a continuous hpet main counter.
> > >     >
> > >     > Code as been added to xc_domain_save.c to send a 
> small message
> > >     after the
> > >     > domain context is sent. The contents of the message is
the
> > >     physical tsc
> > >     > timestamp, last_tsc,
> > >     > read just before the message is sent. When the
> > last_tsc message
> > >     is received in
> > >     > xc_domain_restore.c,
> > >     > another physical tsc timestamp, cur_tsc, is read. The
two
> > >     timestamps are
> > >     > loaded into the domain
> > >     > structure as last_tsc_sender and first_tsc_receiver with
> > >     hypercalls. Then
> > >     > xc_domain_hvm_setcontext
> > >     > is called so that hpet_load has access to these time
stamps.
> > >     Hpet_load uses
> > >     > the timestamps
> > >     > to account for the time spent saving and loading the
domain
> > >     context. With this
> > >     > technique,
> > >     > the only neglected time is the time spent sending a
small
> > >     network message.
> > >     >
> > >     > 5. Test Results
> > >     >
> > >     > Some recent test results are:
> > >     >
> > >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> > >     >       Duration: 70 hours.
> > >     >       Test date: 6/2/08
> > >     >       Loads: usex -b48 on Linux; burn-in on Windows
> > >     >       Guest vcpus: 8 for Linux; 2 for Windows
> > >     >       Hardware: 8 physical cpu AMD
> > >     >       Clock drift : Linux: .0012% Windows: .009%
> > >     >
> > >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 
> no-load test
> > >     >       Duration: 23 hours.
> > >     >       Test date: 6/3/08
> > >     >       Loads: none
> > >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> > >     >       Hardware: 4 physical cpu AMD
> > >     >       Clock drift : Linux: .033% Windows: .019%
> > >     >
> > >     > 6. Relation to recent work in xen-unstable
> > >     >
> > >     > There is a similarity between hvm_get_guest_time() in
> > >     xen-unstable and
> > >     > read_64_main_counter()
> > >     > in this code. However, read_64_main_counter() is 
> more tuned to
> > >     the needs of
> > >     > hpet.c. It has no
> > >     > "set" operation, only the get. It isolates the
mode,
> > physical or
> > >     simulated, in
> > >     > read_64_main_counter()
> > >     > itself. It uses no vcpu or domain state as it is a
physical
> > >     entity, in either
> > >     > mode. And it provides a real
> > >     > physical mode for every read for those applications
> > that desire
> > >     this.
> > >     >
> > >     > 7. Conclusion
> > >     >
> > >     > The virtual hpet is improved by this patch in terms
> > of accuracy and
> > >     > monotonicity.
> > >     > Tests performed to date verify this and more testing
> > is under way.
> > >     >
> > >     > 8. Future Work
> > >     >
> > >     > Testing with Windows Vista will be performed soon. 
> The reason
> > >     for accuracy
> > >     > variations
> > >     > on different platforms using the physical hpet 
> device will be
> > >     investigated.
> > >     > Additional overhead measurements on simulated vs 
> physical hpet
> > >     mode will be
> > >     > made.
> > >     >
> > >     > Footnotes:
> > >     >
> > >     > 1. I don''t recall the accuracy improvement with
end
> > of interrupt
> > >     stamping, but
> > >     > it was
> > >     > significant, perhaps better than two to one improvement.
It
> > >     would be a very
> > >     > simple matter
> > >     > to re-measure the improvement as the facility can 
> call back at
> > >     injection time
> > >     > as well.
> > >     >
> > >     >
> > >     > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> > >     > <mailto:dwinchell@virtualiron.com>
> > >     > Signed-off-by: Ben Guthro
<bguthro@virtualiron.com>
> > >     > <mailto:bguthro@virtualiron.com>
> > >     >
> > >     >
> > >     > _______________________________________________
> > >     > Xen-devel mailing list
> > >     > Xen-devel@lists.xensource.com
> > >     > http://lists.xensource.com/xen-devel
> > >
> > >
> > >
> >
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-11 08:30 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

I implemented the monotonicity guarantee within hvm_get_guest_time(). We
don''t need or want get_s_time_mono().

 -- Keir

On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
> Keir, Dan:
> 
> Although I plan to break up the patch, etc., I''m posting
> this fix to the patch for anyone who might be interested.
> 
> thanks,
> Dave
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com
> #   vi-patch: xen-hpet
> #   
> #   Bug Id: 6057 
> #   
> #   Reviewed by: Robert
> #   
> #   SUMMARY: Fix wrap issue in monotonic s_time().
> # 
> # xen/arch/x86/time.c
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2
> #   Fix wrap issue in monotonic s_time().
> # 
> diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> @@ -534,7 +534,7 @@
>      u64 count;
>      unsigned long flags;
>      struct cpu_time *t = &this_cpu(cpu_time);
> -    u64 tsc, delta;
> +    u64 tsc, delta, diff;
>      s_time_t now;
>  
>      if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) {
> @@ -542,7 +542,8 @@
>          rdtscll(tsc);
>          delta = tsc - t->local_tsc_stamp;
>          now = t->stime_local_stamp + scale_delta(delta,
&t->tsc_scale);
> -        if(now > get_s_time_mon.last_ret)
> +        diff = (u64)now - (u64)get_s_time_mon.last_ret;
> +        if((s64)diff > (s64)0)
>              get_s_time_mon.last_ret = now;
>          else
>              now = get_s_time_mon.last_ret;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-11 11:38 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> I implemented the monotonicity guarantee within hvm_get_guest_time(). We
> don''t need or want get_s_time_mono().
I''ll give hvm_get_guest_time() another look.

-Dave



-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: Wed 6/11/2008 4:30 AM
To: Dave Winchell
Cc: dan.magenheimer@oracle.com; Ben Guthro; xen-devel
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
I implemented the monotonicity guarantee within hvm_get_guest_time(). We
don''t need or want get_s_time_mono().

 -- Keir

On 10/6/08 18:13, "Dave Winchell" <dwinchell@virtualiron.com>
wrote:
> Keir, Dan:
> 
> Although I plan to break up the patch, etc., I''m posting
> this fix to the patch for anyone who might be interested.
> 
> thanks,
> Dave
> # This is a BitKeeper generated diff -Nru style patch.
> #
> # ChangeSet
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com
> #   vi-patch: xen-hpet
> #   
> #   Bug Id: 6057 
> #   
> #   Reviewed by: Robert
> #   
> #   SUMMARY: Fix wrap issue in monotonic s_time().
> # 
> # xen/arch/x86/time.c
> #   2008/06/10 12:20:48-04:00 winchell@dwinchell2.virtualiron.com +3 -2
> #   Fix wrap issue in monotonic s_time().
> # 
> diff -Nru a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> --- a/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> +++ b/xen/arch/x86/time.c 2008-06-10 13:08:39 -04:00
> @@ -534,7 +534,7 @@
>      u64 count;
>      unsigned long flags;
>      struct cpu_time *t = &this_cpu(cpu_time);
> -    u64 tsc, delta;
> +    u64 tsc, delta, diff;
>      s_time_t now;
>  
>      if(hpet_main_counter_phys_avoid_hdw || !hpet_physical_inited) {
> @@ -542,7 +542,8 @@
>          rdtscll(tsc);
>          delta = tsc - t->local_tsc_stamp;
>          now = t->stime_local_stamp + scale_delta(delta,
&t->tsc_scale);
> -        if(now > get_s_time_mon.last_ret)
> +        diff = (u64)now - (u64)get_s_time_mon.last_ret;
> +        if((s64)diff > (s64)0)
>              get_s_time_mon.last_ret = now;
>          else
>              now = get_s_time_mon.last_ret;




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-11 13:58 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan Magenheimer wrote:
>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I''ve always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I''ll have to test this hypothesis
though.
>>    
>>
>
>I''ve tested this hypothesis and it seems to hold true.
>This means the existing (unpatched) hpet code works fine
>on EL5-32bit (vcpus=1) when hpet is the clocksource,
>even when the machine is overcommitted.  A second hypothesis
>still needs to be tested that Dave''s patch will not make this
worse.
>  
>Interesting, thanks for pointing this out and confirming.
>(Note that per previous discussion, my EL5u1-32bit guest
>running on an Intel dual-core physical box chose tsc as
>the best clocksource and I had to override it with
>clock=hpet in the kernel command line.)
>  
>Is there one setting for all Linux guests that makes them
choose hpet? Is it "clock=hpet clocksource=hpet"?
I know you wrote at length about this before.
>  
>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>    
>>
>
>Indeed with the existing (unpatched) hpet code, time is
>running faster on EL4u5-32 (vcpus=1, when overcommited).
>So Dave''s patch is definitely needed here.
>  
>Its good to get the verification of this.

thanks,
Dave
>Will try 64-bit next.
>
>Dan
>
>  
>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Monday, June 09, 2008 9:21 PM
>>To: ''Dave Winchell''; ''Keir Fraser''
>>Cc: ''xen-devel''; ''Ben Guthro''
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>I''ll tell  you what I recall about this. Tomorrow
I''ll check the
>>>guest code to verify. I think that Linux declares a full tick,
>>>even if the interrupt is early. That''s the problem.
>>>      
>>>
>>Yes, that makes sense and concurs with what I remember from
>>the EL4u5-32 code.  If this is true, one would expect the
>>default "no missed tick" policy to see time moving faster
>>than an external source -- the first missed tick delivered
>>after a long sleep would "catch up" and then the remainder
>>would each add another tick.
>>
>>    
>>
>>>On the other hand, if the interrupt is late it in effect declares
>>>a tick plus fraction. If it just declared the fraction in 
>>>      
>>>
>>the first place,
>>    
>>
>>>we could deliver the interrupts whenever we wanted.
>>>      
>>>
>>My read of the EL4u5-32 code is that the fraction is discarded
>>and a new tick period commences at "now", so the fractions
>>eventually accumulate as lost time.
>>
>>In EL5u1-32 however it looks like the fractions are accounted
>>for.  Indeed the EL5u1-32 "lost tick handling" code resembles
>>the Linux/ia64 code which is what I''ve always assumed was
>>the "missed tick" model.  In this case, I think no policy
>>is necessary and the measured skew should be identical to
>>any physical hpet skew.  I''ll have to test this hypothesis
though.
>>
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of 
>>Dave Winchell
>>Sent: Monday, June 09, 2008 5:35 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Dave Winchell; xen-devel; Ben Guthro
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>    
>>
>>>>The Linux policy is more subtle, but is required to go
>>>>from .1% to .03%.
>>>>        
>>>>
>>>Thanks for the good documentation which I hadn''t thoroughly
>>>read until now.
>>>I now understand that the essence of your
>>>hpet missed ticks policy is to ensure that ticks are never
>>>delivered too close together.  But I''m trying to understand
>>>WHY your patch works, in other words, what problem it is
>>>countering.
>>>      
>>>
>>I''ll tell  you what I recall about this. Tomorrow I''ll
check the
>>guest code to verify. I think that Linux declares a full tick,
>>even if the interrupt is early. That''s the problem.
>>On the other hand, if the interrupt is late it in effect declares
>>a tick plus fraction. If it just declared the fraction in the 
>>first place,
>>we could deliver the interrupts whenever we wanted.
>>
>>Its really not that different than the missed ticks policy in vpt.c
>>except that there the period in vpt.c is based on start of interrupt
>>and I have improved that with end-of interrupt as described
>>in the patch note.
>>
>>I don''t recall what prompted me to try end-of-interrupt,
>>but I saw a significant improvement. I may have been running
>>a monotonicity test at the same time to explain the lock
>>contention mentioned in the write-up.
>>
>>    
>>
>>> I care about this for more reasons than just
>>>because it is interesting: (1) I''d like to feel confident
that
>>>it is fixing a bug rather than just a symptom of a bug;
>>>and (2) I wonder how universally it is applicable.
>>>      
>>>
>>Its worked well my my small set of guests. You and our
>>QA are going to tell us about the wider set. It doesn''t
>>matter if guest A handles interrupts closely spaced or not,
>>just whether it handles them far apart. So it should be pretty
>>universal with guests that really handle missed ticks.
>>I think its interesting that some 32bit Linux guests handle
>>missed ticks for hpet.
>>
>>    
>>
>>>I see from code examination in mark_offset_hpet() in
>>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>>the correction for lost ticks is just plain wrong in
>>>a virtual environment. (Suppose for example that a virtual
>>>tick was delivered every 1.999*hpet_tick... I think
>>>the clock would be off by 50%!)  Is this the bug that
>>>is being "countered" by your policy?
>>>      
>>>
>>I haven''t looked at that code, perhaps.
>>I''ll check it tomorrow.
>>
>>    
>>
>>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>>(which I think is used also for hpet) is much better
>>>so I am eager to find out if your policy works there
>>>too.
>>>If the hpet missed tick policy works for both, though,
>>>I should be happy, though I wonder about upstream kernels
>>>(e.g. the trend toward tickless).
>>>      
>>>
>>I wasn''t aware of this trend. If its robust, however, it should
>>handle late interrupts ...
>>
>>    
>>
>>>That said, I''d rather
>>>see this get into Xen 3.3 and worry about upstream kernels
>>>later :-)
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Mon 6/9/2008 6:02 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>    
>>
>>>The Linux policy is more subtle, but is required to go
>>>from .1% to .03%.
>>>      
>>>
>>Thanks for the good documentation which I hadn''t thoroughly
>>read until now.  I now understand that the essence of your
>>hpet missed ticks policy is to ensure that ticks are never
>>delivered too close together.  But I''m trying to understand
>>WHY your patch works, in other words, what problem it is
>>countering.  I care about this for more reasons than just
>>because it is interesting: (1) I''d like to feel confident that
>>it is fixing a bug rather than just a symptom of a bug;
>>and (2) I wonder how universally it is applicable.
>>
>>I see from code examination in mark_offset_hpet() in
>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
>>the correction for lost ticks is just plain wrong in
>>a virtual environment. (Suppose for example that a virtual
>>tick was delivered every 1.999*hpet_tick... I think
>>the clock would be off by 50%!)  Is this the bug that
>>is being "countered" by your policy?
>>
>>However, the lost tick handling in RHEL5u1/kernel/timer.c
>>(which I think is used also for hpet) is much better
>>so I am eager to find out if your policy works there
>>too.
>>
>>If the hpet missed tick policy works for both, though,
>>I should be happy, though I wonder about upstream kernels
>>(e.g. the trend toward tickless).  That said, I''d rather
>>see this get into Xen 3.3 and worry about upstream kernels
>>later :-)
>>
>>-----Original Message-----
>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>Sent: Sunday, June 08, 2008 2:32 PM
>>To: dan.magenheimer@oracle.com; Keir Fraser
>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>
>>Hi Dan,
>>
>>    
>>
>>>While I am fully supportive of offering hardware hpet as an option
>>>for hvm guests (let''s call it hwhpet=1 for shorthand), I am
very
>>>surprised by your preliminary results; the most obvious conclusion
>>>is that Xen system time is losing time at the rate of 1000 PPM
>>>though its possible there''s a bug somewhere else in the
"time
>>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>>though I have to admit ignorance of how Windows manages time.
>>>      
>>>
>>I think xen system time is fine. You have to add the interrupt
>>delivery policies decribed in the write-up for the patch to get
>>accurate timekeeping in the guest.
>>
>>The windows policy is obvious and results in a large improvement
>>in accuracy. The Linux policy is more subtle, but is required to go
>>from .1% to .03%.
>>
>>    
>>
>>>I think with my recent patch and hpet=1 (essentially the same as
>>>your emulated hpet), hvm guest time should track Xen system time.
>>>I wonder if domain0 (which if I understand correctly is directly
>>>using Xen system time) is also seeing an error of .1%?  Also
>>>I wonder for the skew you are seeing (in both hvm guests and
>>>domain0) is time moving too fast or two slow?
>>>      
>>>
>>I don''t recall the direction. I can look it up in my notes at
work
>>tomorrow.
>>
>>    
>>
>>>Although hwhpet=1 is a fine alternative in many cases, it may
>>>be unavailable on some systems and may cause significant performance
>>>issues on others.  So I think we will still need to track down
>>>the poor accuracy when hwhpet=0.
>>>      
>>>
>>Our patch is accurate to < .03% using the physical hpet mode or
>>the simulated mode.
>>
>>    
>>
>>>And if for some reason
>>>Xen system time can''t be made accurate enough (< 0.05%),
then
>>>I think we should consider building Xen system time itself on
>>>top of hardware hpet instead of TSC... at least when Xen discovers
>>>a capable hpet.
>>>      
>>>
>>In our experience, Xen system time is accurate enough now.
>>
>>    
>>
>>>One more thought... do you know the accuracy of the TSC crystals
>>>on your test systems?  I posted a patch awhile ago that was
>>>intended to test that, though I guess it was only testing skew
>>>of different TSCs on the same system, not TSCs against an
>>>external time source.
>>>      
>>>
>>I do not know the tsc accuracy.
>>
>>    
>>
>>>Or maybe there''s a computation error somewhere in the hvm
hpet
>>>scaling code?  Hmmm...
>>>      
>>>
>>Regards,
>>Dave
>>
>>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Fri 6/6/2008 4:29 PM
>>To: Dave Winchell; Keir Fraser
>>Cc: Ben Guthro; xen-devel
>>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>
>>Dave --
>>
>>Thanks much for posting the preliminary results!
>>
>>While I am fully supportive of offering hardware hpet as an option
>>for hvm guests (let''s call it hwhpet=1 for shorthand), I am
very
>>surprised by your preliminary results; the most obvious conclusion
>>is that Xen system time is losing time at the rate of 1000 PPM
>>though its possible there''s a bug somewhere else in the
"time
>>stack".  Your Windows result is jaw-dropping and inexplicable,
>>though I have to admit ignorance of how Windows manages time.
>>
>>
>>I think with my recent patch and hpet=1 (essentially the same as
>>your emulated hpet), hvm guest time should track Xen system time.
>>I wonder if domain0 (which if I understand correctly is directly
>>using Xen system time) is also seeing an error of .1%?  Also
>>I wonder for the skew you are seeing (in both hvm guests and
>>domain0) is time moving too fast or two slow?
>>
>>Although hwhpet=1 is a fine alternative in many cases, it may
>>be unavailable on some systems and may cause significant performance
>>issues on others.  So I think we will still need to track down
>>the poor accuracy when hwhpet=0.  And if for some reason
>>Xen system time can''t be made accurate enough (< 0.05%),
then
>>I think we should consider building Xen system time itself on
>>top of hardware hpet instead of TSC... at least when Xen discovers
>>a capable hpet.
>>
>>One more thought... do you know the accuracy of the TSC crystals
>>on your test systems?  I posted a patch awhile ago that was
>>intended to test that, though I guess it was only testing skew
>>of different TSCs on the same system, not TSCs against an
>>external time source.
>>
>>Or maybe there''s a computation error somewhere in the hvm hpet
>>scaling code?  Hmmm...
>>
>>Thanks,
>>Dan
>>
>>    
>>
>>>-----Original Message-----
>>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>>>Sent: Friday, June 06, 2008 1:33 PM
>>>To: dan.magenheimer@oracle.com; Keir Fraser
>>>Cc: Ben Guthro; xen-devel; Dave Winchell
>>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>
>>>
>>>Dan, Keir:
>>>
>>>Preliminary tests results indicate an error of .1% for Linux 64 bit
>>>guests configured
>>>for hpet with xen-unstable as is. As we have discussed many 
>>>      
>>>
>>times, the
>>    
>>
>>>ntp requirement is .05%.
>>>Tests on the patch we just submitted for hpet have 
>>>      
>>>
>>indicated errors of
>>    
>>
>>>.0012%
>>>on this platform under similar test conditions and .03% on
>>>other platforms.
>>>
>>>Windows vista64 has an error of 11% using hpet with the
>>>xen-unstable bits.
>>>In an overnight test with our hpet patch, the Windows vista
>>>error was .008%.
>>>
>>>The tests are with two or three guests on a physical node, all under
>>>load, and with
>>>the ratio of vcpus to phys cpus > 1.
>>>
>>>I will continue to run tests over the next few days.
>>>
>>>thanks,
>>>Dave
>>>
>>>
>>>Dan Magenheimer wrote:
>>>
>>>      
>>>
>>>>Hi Dave and Ben --
>>>>
>>>>When running tests on xen-unstable (without your patch),
>>>>        
>>>>
>>>please ensure
>>>      
>>>
>>>>that hpet=1 is set in the hvm config and also I think 
>>>>        
>>>>
>>that when hpet
>>    
>>
>>>>is the clocksource on RHEL4-32, the clock IS resilient to
>>>>        
>>>>
>>>missed ticks
>>>      
>>>
>>>>so timer_mode should be 2 (vs when pit is the clocksource
>>>>        
>>>>
>>>on RHEL4-32,
>>>      
>>>
>>>>all clock ticks must be delivered and so timer_mode should be
0).
>>>>
>>>>Per
>>>>
>>>>        
>>>>
>>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
>>>00098.html it''s
>>>      
>>>
>>>>my intent to clean this up, but I won''t get to it until
next week.
>>>>
>>>>Thanks,
>>>>Dan
>>>>
>>>>    -----Original Message-----
>>>>    *From:* xen-devel-bounces@lists.xensource.com
>>>>    [mailto:xen-devel-bounces@lists.xensource.com]*On
>>>>        
>>>>
>>>Behalf Of *Dave
>>>      
>>>
>>>>    Winchell
>>>>    *Sent:* Friday, June 06, 2008 4:46 AM
>>>>    *To:* Keir Fraser; Ben Guthro; xen-devel
>>>>    *Cc:* dan.magenheimer@oracle.com; Dave Winchell
>>>>    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Keir,
>>>>
>>>>    I think the changes are required. We''ll run some
tests
>>>>        
>>>>
>>>today today so
>>>      
>>>
>>>>    that we have some data to talk about.
>>>>
>>>>    -Dave
>>>>
>>>>
>>>>    -----Original Message-----
>>>>    From: xen-devel-bounces@lists.xensource.com on behalf
>>>>        
>>>>
>>>of Keir Fraser
>>>      
>>>
>>>>    Sent: Fri 6/6/2008 4:58 AM
>>>>    To: Ben Guthro; xen-devel
>>>>    Cc: dan.magenheimer@oracle.com
>>>>    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>>>>
>>>>    Are these patches needed now the timers are built on 
>>>>        
>>>>
>>Xen system
>>    
>>
>>>>    time rather
>>>>    than host TSC? Dan has reported much better
>>>>        
>>>>
>>>time-keeping with his
>>>      
>>>
>>>>    patch
>>>>    checked in, and it¹s for sure a lot less invasive than
>>>>        
>>>>
>>>this patchset.
>>>      
>>>
>>>>     -- Keir
>>>>
>>>>    On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
>>>>
>>>>    >
>>>>    > 1. Introduction
>>>>    >
>>>>    > This patch improves the hpet based guest clock in
>>>>        
>>>>
>>>terms of drift and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Prior to this work the drift with hpet was greater
>>>>        
>>>>
>>>than 2%, far
>>>      
>>>
>>>>    above the .05%
>>>>    > limit
>>>>    > for ntp to synchronize. With this code, the drift 
>>>>        
>>>>
>>ranges from
>>    
>>
>>>>    .001% to .0033%
>>>>    > depending
>>>>    > on guest and physical platform.
>>>>    >
>>>>    > Using hpet allows guest operating systems to 
>>>>        
>>>>
>>provide monotonic
>>    
>>
>>>>    time to their
>>>>    > applications. Time sources other than hpet are not
>>>>        
>>>>
>>>monotonic because
>>>      
>>>
>>>>    > of their reliance on tsc, which is not synchronized
>>>>        
>>>>
>>>across physical
>>>      
>>>
>>>>    > processors.
>>>>    >
>>>>    > Windows 2k864 and many Linux guests are supported with
two
>>>>    policies, one for
>>>>    > guests
>>>>    > that handle missed clock interrupts and the other for
guests
>>>>    that require the
>>>>    > correct number of interrupts.
>>>>    >
>>>>    > Guests may use hpet for the timing source even if 
>>>>        
>>>>
>>the physical
>>    
>>
>>>>    platform has no
>>>>    > visible
>>>>    > hpet. Migration is supported between physical machines
which
>>>>    differ in
>>>>    > physical
>>>>    > hpet visibility.
>>>>    >
>>>>    > Most of the changes are in hpet.c. Two general 
>>>>        
>>>>
>>facilities are
>>    
>>
>>>>    added to track
>>>>    > interrupt
>>>>    > progress. The ideas here and the facilities would 
>>>>        
>>>>
>>be useful in
>>    
>>
>>>>    vpt.c, for
>>>>    > other time
>>>>    > sources, though no attempt is made here to improve
vpt.c.
>>>>    >
>>>>    > The following sections discuss hpet dependencies,
interrupt
>>>>    delivery policies,
>>>>    > live migration,
>>>>    > test results, and relation to recent work with 
>>>>        
>>>>
>>monotonic time.
>>    
>>
>>>>    >
>>>>    >
>>>>    > 2. Virtual Hpet dependencies
>>>>    >
>>>>    > The virtual hpet depends on the ability to read the
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated
>>>>    > (see discussion below) hpet.  For timekeeping, the
>>>>        
>>>>
>>>virtual hpet
>>>      
>>>
>>>>    also depends
>>>>    > on two new interrupt notification facilities to 
>>>>        
>>>>
>>implement its
>>    
>>
>>>>    policies for
>>>>    > interrupt delivery.
>>>>    >
>>>>    > 2.1. Two modes of low-level hpet main counter reads.
>>>>    >
>>>>    > In this implementation, the virtual hpet reads with
>>>>    read_64_main_counter(),
>>>>    > exported by
>>>>    > time.c, either the real physical hpet main counter
register
>>>>    directly or a
>>>>    > "simulated"
>>>>    > hpet main counter.
>>>>    >
>>>>    > The simulated mode uses a monotonic version of
get_s_time()
>>>>    (NOW()), where the
>>>>    > last
>>>>    > time value is returned whenever the current time 
>>>>        
>>>>
>>value is less
>>    
>>
>>>>    than the last
>>>>    > time
>>>>    > value. In simulated mode, since it is layered on
s_time, the
>>>>    underlying
>>>>    > hardware
>>>>    > can be hpet or some other device. The frequency of the
main
>>>>    counter in
>>>>    > simulated
>>>>    > mode is the same as the standard physical hpet
frequency,
>>>>    allowing live
>>>>    > migration
>>>>    > between nodes that are configured differently.
>>>>    >
>>>>    > If the physical platform does not have an hpet
>>>>        
>>>>
>>>device, or if xen
>>>      
>>>
>>>>    is configured
>>>>    > not
>>>>    > to use the device, then the simulated method is 
>>>>        
>>>>
>>used. If there
>>    
>>
>>>>    is a physical
>>>>    > hpet device,
>>>>    > and xen has initialized it, then either simulated 
>>>>        
>>>>
>>or physical
>>    
>>
>>>>    mode can be
>>>>    > used.
>>>>    > This is governed by a boot time option, hpet-avoid.
>>>>        
>>>>
>>>Setting this
>>>      
>>>
>>>>    option to 1
>>>>    > gives the
>>>>    > simulated mode and 0 the physical mode. The default
>>>>        
>>>>
>>>is physical
>>>      
>>>
>>>>    mode.
>>>>    >
>>>>    > A disadvantage of the physical mode is that may 
>>>>        
>>>>
>>take longer to
>>    
>>
>>>>    read the device
>>>>    > than in simulated mode. On some platforms the cost is
>>>>        
>>>>
>>>about the
>>>      
>>>
>>>>    same (less
>>>>    > than 250 nsec) for
>>>>    > physical and simulated modes, while on others 
>>>>        
>>>>
>>physical cost is
>>    
>>
>>>>    much higher
>>>>    > than simulated.
>>>>    > A disadvantage of the simulated mode is that it can 
>>>>        
>>>>
>>return the
>>    
>>
>>>>    same value
>>>>    > for the counter in consecutive calls.
>>>>    >
>>>>    > 2.2. Interrupt notification facilities.
>>>>    >
>>>>    > Two interrupt notification facilities are introduced,
one is
>>>>    > hvm_isa_irq_assert_cb()
>>>>    > and the other hvm_register_intr_en_notif().
>>>>    >
>>>>    > The vhpet uses hvm_isa_irq_assert_cb to deliver 
>>>>        
>>>>
>>interrupts to
>>    
>>
>>>>    the vioapic.
>>>>    > hvm_isa_irq_assert_cb allows a callback to be 
>>>>        
>>>>
>>passed along to
>>    
>>
>>>>    > vioapic_deliver()
>>>>    > and this callback is called with a mask of the vcpus
>>>>        
>>>>
>>>which will
>>>      
>>>
>>>>    get the
>>>>    > interrupt. This callback is made before any vcpus
receive an
>>>>    interrupt.
>>>>    >
>>>>    > Vhpet uses hvm_register_intr_en_notif() to register 
>>>>        
>>>>
>>a handler
>>    
>>
>>>>    for a particular
>>>>    > vector that will be called when that vector is injected
in
>>>>    > [vmx,svm]_intr_assist()
>>>>    > and also when the guest finishes handling the 
>>>>        
>>>>
>>interrupt. Here
>>    
>>
>>>>    finished is
>>>>    > defined
>>>>    > as the point when the guest re-enables interrupts or
>>>>        
>>>>
>>>lowers the
>>>      
>>>
>>>>    tpr value.
>>>>    > EOI is not used as the end of interrupt as this is
sometimes
>>>>    returned before
>>>>    > the interrupt handler has done its work. A flag is
>>>>        
>>>>
>>>passed to the
>>>      
>>>
>>>>    handler
>>>>    > indicating
>>>>    > whether this is the injection point (post = 1) or the
>>>>        
>>>>
>>>interrupt
>>>      
>>>
>>>>    finished (post
>>>>    > = 0) point.
>>>>    > The need for the finished point callback is discussed
in the
>>>>    missed ticks
>>>>    > policy section.
>>>>    >
>>>>    > To prevent a possible early trigger of the finished 
>>>>        
>>>>
>>callback,
>>    
>>
>>>>    intr_en_notif
>>>>    > logic
>>>>    > has a two stage arm, the first at injection
>>>>    (hvm_intr_en_notif_arm()) and the
>>>>    > second when
>>>>    > interrupts are seen to be disabled
>>>>        
>>>>
>>>(hvm_intr_en_notif_disarm()).
>>>      
>>>
>>>>    Once fully
>>>>    > armed, re-enabling
>>>>    > interrupts will cause hvm_intr_en_notif_disarm() to
>>>>        
>>>>
>>>make the end
>>>      
>>>
>>>>    of interrupt
>>>>    > callback. hvm_intr_en_notif_arm() and
>>>>        
>>>>
>>>hvm_intr_en_notif_disarm()
>>>      
>>>
>>>>    are called by
>>>>    > [vmx,svm]_intr_assist().
>>>>    >
>>>>    > 3. Interrupt delivery policies
>>>>    >
>>>>    > The existing hpet interrupt delivery is preserved.
>>>>        
>>>>
>>>This includes
>>>      
>>>
>>>>    > vcpu round robin delivery used by Linux and 
>>>>        
>>>>
>>broadcast delivery
>>    
>>
>>>>    used by
>>>>    > Windows.
>>>>    >
>>>>    > There are two policies for interrupt delivery, one 
>>>>        
>>>>
>>for Windows
>>    
>>
>>>>    2k8-64 and the
>>>>    > other
>>>>    > for Linux. The Linux policy takes advantage of the
>>>>        
>>>>
>>>(guest) Linux
>>>      
>>>
>>>>    missed tick
>>>>    > and offset
>>>>    > calculations and does not attempt to deliver the
>>>>        
>>>>
>>>right number of
>>>      
>>>
>>>>    interrupts.
>>>>    > The Windows policy delivers the correct number of 
>>>>        
>>>>
>>interrupts,
>>    
>>
>>>>    even if
>>>>    > sometimes much
>>>>    > closer to each other than the period. The policies 
>>>>        
>>>>
>>are similar
>>    
>>
>>>>    to those in
>>>>    > vpt.c, though
>>>>    > there are some important differences.
>>>>    >
>>>>    > Policies are selected with an HVMOP_set_param
>>>>        
>>>>
>>>hypercall with index
>>>      
>>>
>>>>    > HVM_PARAM_TIMER_MODE.
>>>>    > Two new values are added,
>>>>        
>>>>
>>>HVM_HPET_guest_computes_missed_ticks and
>>>      
>>>
>>>>    > HVM_HPET_guest_does_not_compute_missed_ticks.  The 
>>>>        
>>>>
>>reason that
>>    
>>
>>>>    two new ones
>>>>    > are added is that
>>>>    > in some guests (32bit Linux) a no-missed policy is 
>>>>        
>>>>
>>needed for
>>    
>>
>>>>    clock sources
>>>>    > other than hpet
>>>>    > and a missed ticks policy for hpet. It was felt that
>>>>        
>>>>
>>>there would
>>>      
>>>
>>>>    be less
>>>>    > confusion by simply
>>>>    > introducing the two hpet policies.
>>>>    >
>>>>    > 3.1. The missed ticks policy
>>>>    >
>>>>    > The Linux clock interrupt handler for hpet calculates
missed
>>>>    ticks and offset
>>>>    > using the hpet
>>>>    > main counter. The algorithm works well when the 
>>>>        
>>>>
>>time since the
>>    
>>
>>>>    last interrupt
>>>>    > is greater than
>>>>    > or equal to a period and poorly otherwise.
>>>>    >
>>>>    > The missed ticks policy ensures that no two clock
>>>>        
>>>>
>>>interrupts are
>>>      
>>>
>>>>    delivered to
>>>>    > the guest at
>>>>    > a time interval less than a period. A time stamp (hpet
main
>>>>    counter value) is
>>>>    > recorded (by a
>>>>    > callback registered with hvm_register_intr_en_notif)
>>>>        
>>>>
>>>when Linux
>>>      
>>>
>>>>    finishes
>>>>    > handling the clock
>>>>    > interrupt. Then, ensuing interrupts are delivered to
>>>>        
>>>>
>>>the vioapic
>>>      
>>>
>>>>    only if the
>>>>    > current main
>>>>    > counter value is a period greater than when the 
>>>>        
>>>>
>>last interrupt
>>    
>>
>>>>    was handled.
>>>>    >
>>>>    > Tests showed a significant improvement in clock 
>>>>        
>>>>
>>drift with end
>>    
>>
>>>>    of interrupt
>>>>    > time stamps
>>>>    > versus beginning of interrupt[1]. It is believed that
>>>>        
>>>>
>>>the reason
>>>      
>>>
>>>>    for the
>>>>    > improvement
>>>>    > is that the clock interrupt handler goes for a
>>>>        
>>>>
>>>spinlock and can
>>>      
>>>
>>>>    be therefore
>>>>    > delayed in its
>>>>    > processing. Furthermore, the main counter is read 
>>>>        
>>>>
>>by the guest
>>    
>>
>>>>    under the lock.
>>>>    > The net
>>>>    > effect is that if we time stamp injection, we can get
the
>>>>    difference in time
>>>>    > between successive interrupt handler lock acquisitions
to be
>>>>    less than the
>>>>    > period.
>>>>    >
>>>>    > 3.2. The no-missed ticks policy
>>>>    >
>>>>    > Windows 2k864 keeps very poor time with the missed
>>>>        
>>>>
>>>ticks policy.
>>>      
>>>
>>>>    So the
>>>>    > no-missed ticks policy
>>>>    > was developed. In the no-missed ticks policy we deliver
the
>>>>    correct number of
>>>>    > interrupts,
>>>>    > even if they are spaced less than a period apart
>>>>        
>>>>
>>>(when catching up).
>>>      
>>>
>>>>    >
>>>>    > Windows 2k864 uses a broadcast mode in the interrupt
routing
>>>>    such that
>>>>    > all vcpus get the clock interrupt. The best Windows
drift
>>>>    performance was
>>>>    > achieved when the
>>>>    > policy code ensured that all the previous interrupts
(on the
>>>>    various vcpus)
>>>>    > had been injected
>>>>    > before injecting the next interrupt to the vioapic..
>>>>    >
>>>>    > The policy code works as follows. It uses the
>>>>    hvm_isa_irq_assert_cb() to
>>>>    > record
>>>>    > the vcpus to be interrupted in 
>>>>        
>>>>
>>h->hpet.pending_mask. Then, in
>>    
>>
>>>>    the callback
>>>>    > registered
>>>>    > with hvm_register_intr_en_notif() at post=1 time it 
>>>>        
>>>>
>>clears the
>>    
>>
>>>>    current vcpu in
>>>>    > the pending_mask.
>>>>    > When the pending_mask is clear it decrements
>>>>    hpet.intr_pending_nr and if
>>>>    > intr_pending_nr is still
>>>>    > non-zero posts another interrupt to the ioapic with
>>>>    hvm_isa_irq_assert_cb().
>>>>    > Intr_pending_nr is incremented in
>>>>    hpet_route_decision_not_missed_ticks().
>>>>    >
>>>>    > The missed ticks policy intr_en_notif callback also
uses the
>>>>    pending_mask
>>>>    > method. So even though
>>>>    > Linux does not broadcast its interrupts, the code 
>>>>        
>>>>
>>could handle
>>    
>>
>>>>    it if it did.
>>>>    > In this case the end of interrupt time stamp is 
>>>>        
>>>>
>>made when the
>>    
>>
>>>>    pending_mask is
>>>>    > clear.
>>>>    >
>>>>    > 4. Live Migration
>>>>    >
>>>>    > Live migration with hpet preserves the current offset
of the
>>>>    guest clock with
>>>>    > respect
>>>>    > to ntp. This is accomplished by migrating all of 
>>>>        
>>>>
>>the state in
>>    
>>
>>>>    the h->hpet data
>>>>    > structure
>>>>    > in the usual way. The hp->mc_offset is recalculated
on the
>>>>    receiving node so
>>>>    > that the
>>>>    > guest sees a continuous hpet main counter.
>>>>    >
>>>>    > Code as been added to xc_domain_save.c to send a 
>>>>        
>>>>
>>small message
>>    
>>
>>>>    after the
>>>>    > domain context is sent. The contents of the message is
the
>>>>    physical tsc
>>>>    > timestamp, last_tsc,
>>>>    > read just before the message is sent. When the
>>>>        
>>>>
>>>last_tsc message
>>>      
>>>
>>>>    is received in
>>>>    > xc_domain_restore.c,
>>>>    > another physical tsc timestamp, cur_tsc, is read. The
two
>>>>    timestamps are
>>>>    > loaded into the domain
>>>>    > structure as last_tsc_sender and first_tsc_receiver
with
>>>>    hypercalls. Then
>>>>    > xc_domain_hvm_setcontext
>>>>    > is called so that hpet_load has access to these time
stamps.
>>>>    Hpet_load uses
>>>>    > the timestamps
>>>>    > to account for the time spent saving and loading the
domain
>>>>    context. With this
>>>>    > technique,
>>>>    > the only neglected time is the time spent sending a
small
>>>>    network message.
>>>>    >
>>>>    > 5. Test Results
>>>>    >
>>>>    > Some recent test results are:
>>>>    >
>>>>    > 5.1 Linux 4u664 and Windows 2k864 load test.
>>>>    >       Duration: 70 hours.
>>>>    >       Test date: 6/2/08
>>>>    >       Loads: usex -b48 on Linux; burn-in on Windows
>>>>    >       Guest vcpus: 8 for Linux; 2 for Windows
>>>>    >       Hardware: 8 physical cpu AMD
>>>>    >       Clock drift : Linux: .0012% Windows: .009%
>>>>    >
>>>>    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 
>>>>        
>>>>
>>no-load test
>>    
>>
>>>>    >       Duration: 23 hours.
>>>>    >       Test date: 6/3/08
>>>>    >       Loads: none
>>>>    >       Guest vcpus: 8 for each Linux; 2 for Windows
>>>>    >       Hardware: 4 physical cpu AMD
>>>>    >       Clock drift : Linux: .033% Windows: .019%
>>>>    >
>>>>    > 6. Relation to recent work in xen-unstable
>>>>    >
>>>>    > There is a similarity between hvm_get_guest_time() in
>>>>    xen-unstable and
>>>>    > read_64_main_counter()
>>>>    > in this code. However, read_64_main_counter() is 
>>>>        
>>>>
>>more tuned to
>>    
>>
>>>>    the needs of
>>>>    > hpet.c. It has no
>>>>    > "set" operation, only the get. It isolates
the mode,
>>>>        
>>>>
>>>physical or
>>>      
>>>
>>>>    simulated, in
>>>>    > read_64_main_counter()
>>>>    > itself. It uses no vcpu or domain state as it is a
physical
>>>>    entity, in either
>>>>    > mode. And it provides a real
>>>>    > physical mode for every read for those applications
>>>>        
>>>>
>>>that desire
>>>      
>>>
>>>>    this.
>>>>    >
>>>>    > 7. Conclusion
>>>>    >
>>>>    > The virtual hpet is improved by this patch in terms
>>>>        
>>>>
>>>of accuracy and
>>>      
>>>
>>>>    > monotonicity.
>>>>    > Tests performed to date verify this and more testing
>>>>        
>>>>
>>>is under way.
>>>      
>>>
>>>>    >
>>>>    > 8. Future Work
>>>>    >
>>>>    > Testing with Windows Vista will be performed soon. 
>>>>        
>>>>
>>The reason
>>    
>>
>>>>    for accuracy
>>>>    > variations
>>>>    > on different platforms using the physical hpet 
>>>>        
>>>>
>>device will be
>>    
>>
>>>>    investigated.
>>>>    > Additional overhead measurements on simulated vs 
>>>>        
>>>>
>>physical hpet
>>    
>>
>>>>    mode will be
>>>>    > made.
>>>>    >
>>>>    > Footnotes:
>>>>    >
>>>>    > 1. I don''t recall the accuracy improvement
with end
>>>>        
>>>>
>>>of interrupt
>>>      
>>>
>>>>    stamping, but
>>>>    > it was
>>>>    > significant, perhaps better than two to one
improvement. It
>>>>    would be a very
>>>>    > simple matter
>>>>    > to re-measure the improvement as the facility can 
>>>>        
>>>>
>>call back at
>>    
>>
>>>>    injection time
>>>>    > as well.
>>>>    >
>>>>    >
>>>>    > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
>>>>    > <mailto:dwinchell@virtualiron.com>
>>>>    > Signed-off-by: Ben Guthro
<bguthro@virtualiron.com>
>>>>    > <mailto:bguthro@virtualiron.com>
>>>>    >
>>>>    >
>>>>    > _______________________________________________
>>>>    > Xen-devel mailing list
>>>>    > Xen-devel@lists.xensource.com
>>>>    > http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>      
>>>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-11 16:47 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

OK, I can confirm that without Dave''s patch RHEL4-
and RHEL5-based 64-bit uni-p kernels gain time when hpet is
the clocksource.

But WHOA with vcpus=2, el5u1-32 time suddenly goes crazy
when domains are added whereas it seems fine when vcpus=1.

All my testing so far has been on 3.1.3, so I am going to
redo it on xen-unstable, first without Dave''s patch then
with.
> Is there one setting for all Linux guests that makes them
> choose hpet? Is it "clock=hpet clocksource=hpet"?
> I know you wrote at length about this before.
In the hvm configuration file:

hpet=1
acpi=1

(Note, acpi unspecified works too as 1 appears to be the default;
but hpet=1 is ignored if acpi=0)

In the kernel command line of the hvm domain (e.g. in grub.conf):

clock=hpet notsc nopmtimer

(Note, a different set of kernel parameters is necessary for each
kernel but because the kernel either ignores or gives harmless
warnings for invalid parameters, this set should always result
in hpet being selected as the clocksource, at least on all
RHEL4- and RHEL5-based kernels I''ve tested.)
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Wednesday, June 11, 2008 7:58 AM
> To: dan.magenheimer@oracle.com
> Cc: Keir Fraser; xen-devel; Ben Guthro; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan Magenheimer wrote:
> 
> >>In EL5u1-32 however it looks like the fractions are accounted
> >>for.  Indeed the EL5u1-32 "lost tick handling" code
resembles
> >>the Linux/ia64 code which is what I''ve always assumed was
> >>the "missed tick" model.  In this case, I think no policy
> >>is necessary and the measured skew should be identical to
> >>any physical hpet skew.  I''ll have to test this hypothesis
though.
> >>
> >>
> >
> >I''ve tested this hypothesis and it seems to hold true.
> >This means the existing (unpatched) hpet code works fine
> >on EL5-32bit (vcpus=1) when hpet is the clocksource,
> >even when the machine is overcommitted.  A second hypothesis
> >still needs to be tested that Dave''s patch will not make this
worse.
> >
> >
> Interesting, thanks for pointing this out and confirming.
> 
> >(Note that per previous discussion, my EL5u1-32bit guest
> >running on an Intel dual-core physical box chose tsc as
> >the best clocksource and I had to override it with
> >clock=hpet in the kernel command line.)
> >
> >
> Is there one setting for all Linux guests that makes them
> choose hpet? Is it "clock=hpet clocksource=hpet"?
> I know you wrote at length about this before.
> 
> >
> >
> >>Yes, that makes sense and concurs with what I remember from
> >>the EL4u5-32 code.  If this is true, one would expect the
> >>default "no missed tick" policy to see time moving faster
> >>than an external source -- the first missed tick delivered
> >>after a long sleep would "catch up" and then the
remainder
> >>would each add another tick.
> >>
> >>
> >
> >Indeed with the existing (unpatched) hpet code, time is
> >running faster on EL4u5-32 (vcpus=1, when overcommited).
> >So Dave''s patch is definitely needed here.
> >
> >
> Its good to get the verification of this.
> 
> thanks,
> Dave
> 
> >Will try 64-bit next.
> >
> >Dan
> >
> >
> >
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Monday, June 09, 2008 9:21 PM
> >>To: ''Dave Winchell''; ''Keir
Fraser''
> >>Cc: ''xen-devel''; ''Ben Guthro''
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>
> >>>I''ll tell  you what I recall about this. Tomorrow
I''ll check the
> >>>guest code to verify. I think that Linux declares a full tick,
> >>>even if the interrupt is early. That''s the problem.
> >>>
> >>>
> >>Yes, that makes sense and concurs with what I remember from
> >>the EL4u5-32 code.  If this is true, one would expect the
> >>default "no missed tick" policy to see time moving faster
> >>than an external source -- the first missed tick delivered
> >>after a long sleep would "catch up" and then the
remainder
> >>would each add another tick.
> >>
> >>
> >>
> >>>On the other hand, if the interrupt is late it in effect
declares
> >>>a tick plus fraction. If it just declared the fraction in
> >>>
> >>>
> >>the first place,
> >>
> >>
> >>>we could deliver the interrupts whenever we wanted.
> >>>
> >>>
> >>My read of the EL4u5-32 code is that the fraction is discarded
> >>and a new tick period commences at "now", so the
fractions
> >>eventually accumulate as lost time.
> >>
> >>In EL5u1-32 however it looks like the fractions are accounted
> >>for.  Indeed the EL5u1-32 "lost tick handling" code
resembles
> >>the Linux/ia64 code which is what I''ve always assumed was
> >>the "missed tick" model.  In this case, I think no policy
> >>is necessary and the measured skew should be identical to
> >>any physical hpet skew.  I''ll have to test this hypothesis
though.
> >>
> >>-----Original Message-----
> >>From: xen-devel-bounces@lists.xensource.com
> >>[mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of
> >>Dave Winchell
> >>Sent: Monday, June 09, 2008 5:35 PM
> >>To: dan.magenheimer@oracle.com; Keir Fraser
> >>Cc: Dave Winchell; xen-devel; Ben Guthro
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>
> >>>>The Linux policy is more subtle, but is required to go
> >>>>from .1% to .03%.
> >>>>
> >>>>
> >>>Thanks for the good documentation which I hadn''t
thoroughly
> >>>read until now.
> >>>I now understand that the essence of your
> >>>hpet missed ticks policy is to ensure that ticks are never
> >>>delivered too close together.  But I''m trying to
understand
> >>>WHY your patch works, in other words, what problem it is
> >>>countering.
> >>>
> >>>
> >>I''ll tell  you what I recall about this. Tomorrow
I''ll check the
> >>guest code to verify. I think that Linux declares a full tick,
> >>even if the interrupt is early. That''s the problem.
> >>On the other hand, if the interrupt is late it in effect declares
> >>a tick plus fraction. If it just declared the fraction in the
> >>first place,
> >>we could deliver the interrupts whenever we wanted.
> >>
> >>Its really not that different than the missed ticks policy in vpt.c
> >>except that there the period in vpt.c is based on start of
interrupt
> >>and I have improved that with end-of interrupt as described
> >>in the patch note.
> >>
> >>I don''t recall what prompted me to try end-of-interrupt,
> >>but I saw a significant improvement. I may have been running
> >>a monotonicity test at the same time to explain the lock
> >>contention mentioned in the write-up.
> >>
> >>
> >>
> >>> I care about this for more reasons than just
> >>>because it is interesting: (1) I''d like to feel
confident that
> >>>it is fixing a bug rather than just a symptom of a bug;
> >>>and (2) I wonder how universally it is applicable.
> >>>
> >>>
> >>Its worked well my my small set of guests. You and our
> >>QA are going to tell us about the wider set. It doesn''t
> >>matter if guest A handles interrupts closely spaced or not,
> >>just whether it handles them far apart. So it should be pretty
> >>universal with guests that really handle missed ticks.
> >>I think its interesting that some 32bit Linux guests handle
> >>missed ticks for hpet.
> >>
> >>
> >>
> >>>I see from code examination in mark_offset_hpet() in
> >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> >>>the correction for lost ticks is just plain wrong in
> >>>a virtual environment. (Suppose for example that a virtual
> >>>tick was delivered every 1.999*hpet_tick... I think
> >>>the clock would be off by 50%!)  Is this the bug that
> >>>is being "countered" by your policy?
> >>>
> >>>
> >>I haven''t looked at that code, perhaps.
> >>I''ll check it tomorrow.
> >>
> >>
> >>
> >>>However, the lost tick handling in RHEL5u1/kernel/timer.c
> >>>(which I think is used also for hpet) is much better
> >>>so I am eager to find out if your policy works there
> >>>too.
> >>>If the hpet missed tick policy works for both, though,
> >>>I should be happy, though I wonder about upstream kernels
> >>>(e.g. the trend toward tickless).
> >>>
> >>>
> >>I wasn''t aware of this trend. If its robust, however, it
should
> >>handle late interrupts ...
> >>
> >>
> >>
> >>>That said, I''d rather
> >>>see this get into Xen 3.3 and worry about upstream kernels
> >>>later :-)
> >>>
> >>>
> >>Regards,
> >>Dave
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Mon 6/9/2008 6:02 PM
> >>To: Dave Winchell; Keir Fraser
> >>Cc: Ben Guthro; xen-devel
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>
> >>>The Linux policy is more subtle, but is required to go
> >>>from .1% to .03%.
> >>>
> >>>
> >>Thanks for the good documentation which I hadn''t
thoroughly
> >>read until now.  I now understand that the essence of your
> >>hpet missed ticks policy is to ensure that ticks are never
> >>delivered too close together.  But I''m trying to
understand
> >>WHY your patch works, in other words, what problem it is
> >>countering.  I care about this for more reasons than just
> >>because it is interesting: (1) I''d like to feel confident
that
> >>it is fixing a bug rather than just a symptom of a bug;
> >>and (2) I wonder how universally it is applicable.
> >>
> >>I see from code examination in mark_offset_hpet() in
> >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that
> >>the correction for lost ticks is just plain wrong in
> >>a virtual environment. (Suppose for example that a virtual
> >>tick was delivered every 1.999*hpet_tick... I think
> >>the clock would be off by 50%!)  Is this the bug that
> >>is being "countered" by your policy?
> >>
> >>However, the lost tick handling in RHEL5u1/kernel/timer.c
> >>(which I think is used also for hpet) is much better
> >>so I am eager to find out if your policy works there
> >>too.
> >>
> >>If the hpet missed tick policy works for both, though,
> >>I should be happy, though I wonder about upstream kernels
> >>(e.g. the trend toward tickless).  That said, I''d rather
> >>see this get into Xen 3.3 and worry about upstream kernels
> >>later :-)
> >>
> >>-----Original Message-----
> >>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> >>Sent: Sunday, June 08, 2008 2:32 PM
> >>To: dan.magenheimer@oracle.com; Keir Fraser
> >>Cc: Ben Guthro; xen-devel; Dave Winchell
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>
> >>Hi Dan,
> >>
> >>
> >>
> >>>While I am fully supportive of offering hardware hpet as an
option
> >>>for hvm guests (let''s call it hwhpet=1 for shorthand),
I am very
> >>>surprised by your preliminary results; the most obvious
conclusion
> >>>is that Xen system time is losing time at the rate of 1000 PPM
> >>>though its possible there''s a bug somewhere else in
the "time
> >>>stack".  Your Windows result is jaw-dropping and
inexplicable,
> >>>though I have to admit ignorance of how Windows manages time.
> >>>
> >>>
> >>I think xen system time is fine. You have to add the interrupt
> >>delivery policies decribed in the write-up for the patch to get
> >>accurate timekeeping in the guest.
> >>
> >>The windows policy is obvious and results in a large improvement
> >>in accuracy. The Linux policy is more subtle, but is required to go
> >>from .1% to .03%.
> >>
> >>
> >>
> >>>I think with my recent patch and hpet=1 (essentially the same
as
> >>>your emulated hpet), hvm guest time should track Xen system
time.
> >>>I wonder if domain0 (which if I understand correctly is
directly
> >>>using Xen system time) is also seeing an error of .1%?  Also
> >>>I wonder for the skew you are seeing (in both hvm guests and
> >>>domain0) is time moving too fast or two slow?
> >>>
> >>>
> >>I don''t recall the direction. I can look it up in my notes
at work
> >>tomorrow.
> >>
> >>
> >>
> >>>Although hwhpet=1 is a fine alternative in many cases, it may
> >>>be unavailable on some systems and may cause significant 
> performance
> >>>issues on others.  So I think we will still need to track down
> >>>the poor accuracy when hwhpet=0.
> >>>
> >>>
> >>Our patch is accurate to < .03% using the physical hpet mode or
> >>the simulated mode.
> >>
> >>
> >>
> >>>And if for some reason
> >>>Xen system time can''t be made accurate enough (<
0.05%), then
> >>>I think we should consider building Xen system time itself on
> >>>top of hardware hpet instead of TSC... at least when Xen
discovers
> >>>a capable hpet.
> >>>
> >>>
> >>In our experience, Xen system time is accurate enough now.
> >>
> >>
> >>
> >>>One more thought... do you know the accuracy of the TSC
crystals
> >>>on your test systems?  I posted a patch awhile ago that was
> >>>intended to test that, though I guess it was only testing skew
> >>>of different TSCs on the same system, not TSCs against an
> >>>external time source.
> >>>
> >>>
> >>I do not know the tsc accuracy.
> >>
> >>
> >>
> >>>Or maybe there''s a computation error somewhere in the
hvm hpet
> >>>scaling code?  Hmmm...
> >>>
> >>>
> >>Regards,
> >>Dave
> >>
> >>
> >>-----Original Message-----
> >>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> >>Sent: Fri 6/6/2008 4:29 PM
> >>To: Dave Winchell; Keir Fraser
> >>Cc: Ben Guthro; xen-devel
> >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>
> >>Dave --
> >>
> >>Thanks much for posting the preliminary results!
> >>
> >>While I am fully supportive of offering hardware hpet as an option
> >>for hvm guests (let''s call it hwhpet=1 for shorthand), I
am very
> >>surprised by your preliminary results; the most obvious conclusion
> >>is that Xen system time is losing time at the rate of 1000 PPM
> >>though its possible there''s a bug somewhere else in the
"time
> >>stack".  Your Windows result is jaw-dropping and inexplicable,
> >>though I have to admit ignorance of how Windows manages time.
> >>
> >>
> >>I think with my recent patch and hpet=1 (essentially the same as
> >>your emulated hpet), hvm guest time should track Xen system time.
> >>I wonder if domain0 (which if I understand correctly is directly
> >>using Xen system time) is also seeing an error of .1%?  Also
> >>I wonder for the skew you are seeing (in both hvm guests and
> >>domain0) is time moving too fast or two slow?
> >>
> >>Although hwhpet=1 is a fine alternative in many cases, it may
> >>be unavailable on some systems and may cause significant
performance
> >>issues on others.  So I think we will still need to track down
> >>the poor accuracy when hwhpet=0.  And if for some reason
> >>Xen system time can''t be made accurate enough (<
0.05%), then
> >>I think we should consider building Xen system time itself on
> >>top of hardware hpet instead of TSC... at least when Xen discovers
> >>a capable hpet.
> >>
> >>One more thought... do you know the accuracy of the TSC crystals
> >>on your test systems?  I posted a patch awhile ago that was
> >>intended to test that, though I guess it was only testing skew
> >>of different TSCs on the same system, not TSCs against an
> >>external time source.
> >>
> >>Or maybe there''s a computation error somewhere in the hvm
hpet
> >>scaling code?  Hmmm...
> >>
> >>Thanks,
> >>Dan
> >>
> >>
> >>
> >>>-----Original Message-----
> >>>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> >>>Sent: Friday, June 06, 2008 1:33 PM
> >>>To: dan.magenheimer@oracle.com; Keir Fraser
> >>>Cc: Ben Guthro; xen-devel; Dave Winchell
> >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >>>
> >>>
> >>>Dan, Keir:
> >>>
> >>>Preliminary tests results indicate an error of .1% for Linux 64
bit
> >>>guests configured
> >>>for hpet with xen-unstable as is. As we have discussed many
> >>>
> >>>
> >>times, the
> >>
> >>
> >>>ntp requirement is .05%.
> >>>Tests on the patch we just submitted for hpet have
> >>>
> >>>
> >>indicated errors of
> >>
> >>
> >>>.0012%
> >>>on this platform under similar test conditions and .03% on
> >>>other platforms.
> >>>
> >>>Windows vista64 has an error of 11% using hpet with the
> >>>xen-unstable bits.
> >>>In an overnight test with our hpet patch, the Windows vista
> >>>error was .008%.
> >>>
> >>>The tests are with two or three guests on a physical node, 
> all under
> >>>load, and with
> >>>the ratio of vcpus to phys cpus > 1.
> >>>
> >>>I will continue to run tests over the next few days.
> >>>
> >>>thanks,
> >>>Dave
> >>>
> >>>
> >>>Dan Magenheimer wrote:
> >>>
> >>>
> >>>
> >>>>Hi Dave and Ben --
> >>>>
> >>>>When running tests on xen-unstable (without your patch),
> >>>>
> >>>>
> >>>please ensure
> >>>
> >>>
> >>>>that hpet=1 is set in the hvm config and also I think
> >>>>
> >>>>
> >>that when hpet
> >>
> >>
> >>>>is the clocksource on RHEL4-32, the clock IS resilient to
> >>>>
> >>>>
> >>>missed ticks
> >>>
> >>>
> >>>>so timer_mode should be 2 (vs when pit is the clocksource
> >>>>
> >>>>
> >>>on RHEL4-32,
> >>>
> >>>
> >>>>all clock ticks must be delivered and so timer_mode should
be 0).
> >>>>
> >>>>Per
> >>>>
> >>>>
> >>>>
> >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> >>>00098.html it''s
> >>>
> >>>
> >>>>my intent to clean this up, but I won''t get to it
until next week.
> >>>>
> >>>>Thanks,
> >>>>Dan
> >>>>
> >>>>    -----Original Message-----
> >>>>    *From:* xen-devel-bounces@lists.xensource.com
> >>>>    [mailto:xen-devel-bounces@lists.xensource.com]*On
> >>>>
> >>>>
> >>>Behalf Of *Dave
> >>>
> >>>
> >>>>    Winchell
> >>>>    *Sent:* Friday, June 06, 2008 4:46 AM
> >>>>    *To:* Keir Fraser; Ben Guthro; xen-devel
> >>>>    *Cc:* dan.magenheimer@oracle.com; Dave Winchell
> >>>>    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet
accuracy
> >>>>
> >>>>    Keir,
> >>>>
> >>>>    I think the changes are required. We''ll run
some tests
> >>>>
> >>>>
> >>>today today so
> >>>
> >>>
> >>>>    that we have some data to talk about.
> >>>>
> >>>>    -Dave
> >>>>
> >>>>
> >>>>    -----Original Message-----
> >>>>    From: xen-devel-bounces@lists.xensource.com on behalf
> >>>>
> >>>>
> >>>of Keir Fraser
> >>>
> >>>
> >>>>    Sent: Fri 6/6/2008 4:58 AM
> >>>>    To: Ben Guthro; xen-devel
> >>>>    Cc: dan.magenheimer@oracle.com
> >>>>    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet
accuracy
> >>>>
> >>>>    Are these patches needed now the timers are built on
> >>>>
> >>>>
> >>Xen system
> >>
> >>
> >>>>    time rather
> >>>>    than host TSC? Dan has reported much better
> >>>>
> >>>>
> >>>time-keeping with his
> >>>
> >>>
> >>>>    patch
> >>>>    checked in, and it¹s for sure a lot less invasive than
> >>>>
> >>>>
> >>>this patchset.
> >>>
> >>>
> >>>>     -- Keir
> >>>>
> >>>>    On 5/6/08 15:59, "Ben Guthro"
<bguthro@virtualiron.com> wrote:
> >>>>
> >>>>    >
> >>>>    > 1. Introduction
> >>>>    >
> >>>>    > This patch improves the hpet based guest clock in
> >>>>
> >>>>
> >>>terms of drift and
> >>>
> >>>
> >>>>    > monotonicity.
> >>>>    > Prior to this work the drift with hpet was greater
> >>>>
> >>>>
> >>>than 2%, far
> >>>
> >>>
> >>>>    above the .05%
> >>>>    > limit
> >>>>    > for ntp to synchronize. With this code, the drift
> >>>>
> >>>>
> >>ranges from
> >>
> >>
> >>>>    .001% to .0033%
> >>>>    > depending
> >>>>    > on guest and physical platform.
> >>>>    >
> >>>>    > Using hpet allows guest operating systems to
> >>>>
> >>>>
> >>provide monotonic
> >>
> >>
> >>>>    time to their
> >>>>    > applications. Time sources other than hpet are not
> >>>>
> >>>>
> >>>monotonic because
> >>>
> >>>
> >>>>    > of their reliance on tsc, which is not
synchronized
> >>>>
> >>>>
> >>>across physical
> >>>
> >>>
> >>>>    > processors.
> >>>>    >
> >>>>    > Windows 2k864 and many Linux guests are supported
with two
> >>>>    policies, one for
> >>>>    > guests
> >>>>    > that handle missed clock interrupts and the other
for guests
> >>>>    that require the
> >>>>    > correct number of interrupts.
> >>>>    >
> >>>>    > Guests may use hpet for the timing source even if
> >>>>
> >>>>
> >>the physical
> >>
> >>
> >>>>    platform has no
> >>>>    > visible
> >>>>    > hpet. Migration is supported between physical
machines which
> >>>>    differ in
> >>>>    > physical
> >>>>    > hpet visibility.
> >>>>    >
> >>>>    > Most of the changes are in hpet.c. Two general
> >>>>
> >>>>
> >>facilities are
> >>
> >>
> >>>>    added to track
> >>>>    > interrupt
> >>>>    > progress. The ideas here and the facilities would
> >>>>
> >>>>
> >>be useful in
> >>
> >>
> >>>>    vpt.c, for
> >>>>    > other time
> >>>>    > sources, though no attempt is made here to improve
vpt.c.
> >>>>    >
> >>>>    > The following sections discuss hpet dependencies,
interrupt
> >>>>    delivery policies,
> >>>>    > live migration,
> >>>>    > test results, and relation to recent work with
> >>>>
> >>>>
> >>monotonic time.
> >>
> >>
> >>>>    >
> >>>>    >
> >>>>    > 2. Virtual Hpet dependencies
> >>>>    >
> >>>>    > The virtual hpet depends on the ability to read
the
> >>>>
> >>>>
> >>>physical or
> >>>
> >>>
> >>>>    simulated
> >>>>    > (see discussion below) hpet.  For timekeeping, the
> >>>>
> >>>>
> >>>virtual hpet
> >>>
> >>>
> >>>>    also depends
> >>>>    > on two new interrupt notification facilities to
> >>>>
> >>>>
> >>implement its
> >>
> >>
> >>>>    policies for
> >>>>    > interrupt delivery.
> >>>>    >
> >>>>    > 2.1. Two modes of low-level hpet main counter
reads.
> >>>>    >
> >>>>    > In this implementation, the virtual hpet reads
with
> >>>>    read_64_main_counter(),
> >>>>    > exported by
> >>>>    > time.c, either the real physical hpet main counter
register
> >>>>    directly or a
> >>>>    > "simulated"
> >>>>    > hpet main counter.
> >>>>    >
> >>>>    > The simulated mode uses a monotonic version of
get_s_time()
> >>>>    (NOW()), where the
> >>>>    > last
> >>>>    > time value is returned whenever the current time
> >>>>
> >>>>
> >>value is less
> >>
> >>
> >>>>    than the last
> >>>>    > time
> >>>>    > value. In simulated mode, since it is layered on
s_time, the
> >>>>    underlying
> >>>>    > hardware
> >>>>    > can be hpet or some other device. The frequency of
the main
> >>>>    counter in
> >>>>    > simulated
> >>>>    > mode is the same as the standard physical hpet
frequency,
> >>>>    allowing live
> >>>>    > migration
> >>>>    > between nodes that are configured differently.
> >>>>    >
> >>>>    > If the physical platform does not have an hpet
> >>>>
> >>>>
> >>>device, or if xen
> >>>
> >>>
> >>>>    is configured
> >>>>    > not
> >>>>    > to use the device, then the simulated method is
> >>>>
> >>>>
> >>used. If there
> >>
> >>
> >>>>    is a physical
> >>>>    > hpet device,
> >>>>    > and xen has initialized it, then either simulated
> >>>>
> >>>>
> >>or physical
> >>
> >>
> >>>>    mode can be
> >>>>    > used.
> >>>>    > This is governed by a boot time option,
hpet-avoid.
> >>>>
> >>>>
> >>>Setting this
> >>>
> >>>
> >>>>    option to 1
> >>>>    > gives the
> >>>>    > simulated mode and 0 the physical mode. The
default
> >>>>
> >>>>
> >>>is physical
> >>>
> >>>
> >>>>    mode.
> >>>>    >
> >>>>    > A disadvantage of the physical mode is that may
> >>>>
> >>>>
> >>take longer to
> >>
> >>
> >>>>    read the device
> >>>>    > than in simulated mode. On some platforms the cost
is
> >>>>
> >>>>
> >>>about the
> >>>
> >>>
> >>>>    same (less
> >>>>    > than 250 nsec) for
> >>>>    > physical and simulated modes, while on others
> >>>>
> >>>>
> >>physical cost is
> >>
> >>
> >>>>    much higher
> >>>>    > than simulated.
> >>>>    > A disadvantage of the simulated mode is that it
can
> >>>>
> >>>>
> >>return the
> >>
> >>
> >>>>    same value
> >>>>    > for the counter in consecutive calls.
> >>>>    >
> >>>>    > 2.2. Interrupt notification facilities.
> >>>>    >
> >>>>    > Two interrupt notification facilities are
introduced, one is
> >>>>    > hvm_isa_irq_assert_cb()
> >>>>    > and the other hvm_register_intr_en_notif().
> >>>>    >
> >>>>    > The vhpet uses hvm_isa_irq_assert_cb to deliver
> >>>>
> >>>>
> >>interrupts to
> >>
> >>
> >>>>    the vioapic.
> >>>>    > hvm_isa_irq_assert_cb allows a callback to be
> >>>>
> >>>>
> >>passed along to
> >>
> >>
> >>>>    > vioapic_deliver()
> >>>>    > and this callback is called with a mask of the
vcpus
> >>>>
> >>>>
> >>>which will
> >>>
> >>>
> >>>>    get the
> >>>>    > interrupt. This callback is made before any vcpus
receive an
> >>>>    interrupt.
> >>>>    >
> >>>>    > Vhpet uses hvm_register_intr_en_notif() to
register
> >>>>
> >>>>
> >>a handler
> >>
> >>
> >>>>    for a particular
> >>>>    > vector that will be called when that vector is
injected in
> >>>>    > [vmx,svm]_intr_assist()
> >>>>    > and also when the guest finishes handling the
> >>>>
> >>>>
> >>interrupt. Here
> >>
> >>
> >>>>    finished is
> >>>>    > defined
> >>>>    > as the point when the guest re-enables interrupts
or
> >>>>
> >>>>
> >>>lowers the
> >>>
> >>>
> >>>>    tpr value.
> >>>>    > EOI is not used as the end of interrupt as this is
sometimes
> >>>>    returned before
> >>>>    > the interrupt handler has done its work. A flag is
> >>>>
> >>>>
> >>>passed to the
> >>>
> >>>
> >>>>    handler
> >>>>    > indicating
> >>>>    > whether this is the injection point (post = 1) or
the
> >>>>
> >>>>
> >>>interrupt
> >>>
> >>>
> >>>>    finished (post
> >>>>    > = 0) point.
> >>>>    > The need for the finished point callback is
discussed in the
> >>>>    missed ticks
> >>>>    > policy section.
> >>>>    >
> >>>>    > To prevent a possible early trigger of the
finished
> >>>>
> >>>>
> >>callback,
> >>
> >>
> >>>>    intr_en_notif
> >>>>    > logic
> >>>>    > has a two stage arm, the first at injection
> >>>>    (hvm_intr_en_notif_arm()) and the
> >>>>    > second when
> >>>>    > interrupts are seen to be disabled
> >>>>
> >>>>
> >>>(hvm_intr_en_notif_disarm()).
> >>>
> >>>
> >>>>    Once fully
> >>>>    > armed, re-enabling
> >>>>    > interrupts will cause hvm_intr_en_notif_disarm()
to
> >>>>
> >>>>
> >>>make the end
> >>>
> >>>
> >>>>    of interrupt
> >>>>    > callback. hvm_intr_en_notif_arm() and
> >>>>
> >>>>
> >>>hvm_intr_en_notif_disarm()
> >>>
> >>>
> >>>>    are called by
> >>>>    > [vmx,svm]_intr_assist().
> >>>>    >
> >>>>    > 3. Interrupt delivery policies
> >>>>    >
> >>>>    > The existing hpet interrupt delivery is preserved.
> >>>>
> >>>>
> >>>This includes
> >>>
> >>>
> >>>>    > vcpu round robin delivery used by Linux and
> >>>>
> >>>>
> >>broadcast delivery
> >>
> >>
> >>>>    used by
> >>>>    > Windows.
> >>>>    >
> >>>>    > There are two policies for interrupt delivery, one
> >>>>
> >>>>
> >>for Windows
> >>
> >>
> >>>>    2k8-64 and the
> >>>>    > other
> >>>>    > for Linux. The Linux policy takes advantage of the
> >>>>
> >>>>
> >>>(guest) Linux
> >>>
> >>>
> >>>>    missed tick
> >>>>    > and offset
> >>>>    > calculations and does not attempt to deliver the
> >>>>
> >>>>
> >>>right number of
> >>>
> >>>
> >>>>    interrupts.
> >>>>    > The Windows policy delivers the correct number of
> >>>>
> >>>>
> >>interrupts,
> >>
> >>
> >>>>    even if
> >>>>    > sometimes much
> >>>>    > closer to each other than the period. The policies
> >>>>
> >>>>
> >>are similar
> >>
> >>
> >>>>    to those in
> >>>>    > vpt.c, though
> >>>>    > there are some important differences.
> >>>>    >
> >>>>    > Policies are selected with an HVMOP_set_param
> >>>>
> >>>>
> >>>hypercall with index
> >>>
> >>>
> >>>>    > HVM_PARAM_TIMER_MODE.
> >>>>    > Two new values are added,
> >>>>
> >>>>
> >>>HVM_HPET_guest_computes_missed_ticks and
> >>>
> >>>
> >>>>    > HVM_HPET_guest_does_not_compute_missed_ticks.  The
> >>>>
> >>>>
> >>reason that
> >>
> >>
> >>>>    two new ones
> >>>>    > are added is that
> >>>>    > in some guests (32bit Linux) a no-missed policy is
> >>>>
> >>>>
> >>needed for
> >>
> >>
> >>>>    clock sources
> >>>>    > other than hpet
> >>>>    > and a missed ticks policy for hpet. It was felt
that
> >>>>
> >>>>
> >>>there would
> >>>
> >>>
> >>>>    be less
> >>>>    > confusion by simply
> >>>>    > introducing the two hpet policies.
> >>>>    >
> >>>>    > 3.1. The missed ticks policy
> >>>>    >
> >>>>    > The Linux clock interrupt handler for hpet
calculates missed
> >>>>    ticks and offset
> >>>>    > using the hpet
> >>>>    > main counter. The algorithm works well when the
> >>>>
> >>>>
> >>time since the
> >>
> >>
> >>>>    last interrupt
> >>>>    > is greater than
> >>>>    > or equal to a period and poorly otherwise.
> >>>>    >
> >>>>    > The missed ticks policy ensures that no two clock
> >>>>
> >>>>
> >>>interrupts are
> >>>
> >>>
> >>>>    delivered to
> >>>>    > the guest at
> >>>>    > a time interval less than a period. A time stamp
(hpet main
> >>>>    counter value) is
> >>>>    > recorded (by a
> >>>>    > callback registered with
hvm_register_intr_en_notif)
> >>>>
> >>>>
> >>>when Linux
> >>>
> >>>
> >>>>    finishes
> >>>>    > handling the clock
> >>>>    > interrupt. Then, ensuing interrupts are delivered
to
> >>>>
> >>>>
> >>>the vioapic
> >>>
> >>>
> >>>>    only if the
> >>>>    > current main
> >>>>    > counter value is a period greater than when the
> >>>>
> >>>>
> >>last interrupt
> >>
> >>
> >>>>    was handled.
> >>>>    >
> >>>>    > Tests showed a significant improvement in clock
> >>>>
> >>>>
> >>drift with end
> >>
> >>
> >>>>    of interrupt
> >>>>    > time stamps
> >>>>    > versus beginning of interrupt[1]. It is believed
that
> >>>>
> >>>>
> >>>the reason
> >>>
> >>>
> >>>>    for the
> >>>>    > improvement
> >>>>    > is that the clock interrupt handler goes for a
> >>>>
> >>>>
> >>>spinlock and can
> >>>
> >>>
> >>>>    be therefore
> >>>>    > delayed in its
> >>>>    > processing. Furthermore, the main counter is read
> >>>>
> >>>>
> >>by the guest
> >>
> >>
> >>>>    under the lock.
> >>>>    > The net
> >>>>    > effect is that if we time stamp injection, we can
get the
> >>>>    difference in time
> >>>>    > between successive interrupt handler lock
acquisitions to be
> >>>>    less than the
> >>>>    > period.
> >>>>    >
> >>>>    > 3.2. The no-missed ticks policy
> >>>>    >
> >>>>    > Windows 2k864 keeps very poor time with the missed
> >>>>
> >>>>
> >>>ticks policy.
> >>>
> >>>
> >>>>    So the
> >>>>    > no-missed ticks policy
> >>>>    > was developed. In the no-missed ticks policy we
deliver the
> >>>>    correct number of
> >>>>    > interrupts,
> >>>>    > even if they are spaced less than a period apart
> >>>>
> >>>>
> >>>(when catching up).
> >>>
> >>>
> >>>>    >
> >>>>    > Windows 2k864 uses a broadcast mode in the
interrupt routing
> >>>>    such that
> >>>>    > all vcpus get the clock interrupt. The best
Windows drift
> >>>>    performance was
> >>>>    > achieved when the
> >>>>    > policy code ensured that all the previous
interrupts (on the
> >>>>    various vcpus)
> >>>>    > had been injected
> >>>>    > before injecting the next interrupt to the
vioapic..
> >>>>    >
> >>>>    > The policy code works as follows. It uses the
> >>>>    hvm_isa_irq_assert_cb() to
> >>>>    > record
> >>>>    > the vcpus to be interrupted in
> >>>>
> >>>>
> >>h->hpet.pending_mask. Then, in
> >>
> >>
> >>>>    the callback
> >>>>    > registered
> >>>>    > with hvm_register_intr_en_notif() at post=1 time
it
> >>>>
> >>>>
> >>clears the
> >>
> >>
> >>>>    current vcpu in
> >>>>    > the pending_mask.
> >>>>    > When the pending_mask is clear it decrements
> >>>>    hpet.intr_pending_nr and if
> >>>>    > intr_pending_nr is still
> >>>>    > non-zero posts another interrupt to the ioapic
with
> >>>>    hvm_isa_irq_assert_cb().
> >>>>    > Intr_pending_nr is incremented in
> >>>>    hpet_route_decision_not_missed_ticks().
> >>>>    >
> >>>>    > The missed ticks policy intr_en_notif callback
also uses the
> >>>>    pending_mask
> >>>>    > method. So even though
> >>>>    > Linux does not broadcast its interrupts, the code
> >>>>
> >>>>
> >>could handle
> >>
> >>
> >>>>    it if it did.
> >>>>    > In this case the end of interrupt time stamp is
> >>>>
> >>>>
> >>made when the
> >>
> >>
> >>>>    pending_mask is
> >>>>    > clear.
> >>>>    >
> >>>>    > 4. Live Migration
> >>>>    >
> >>>>    > Live migration with hpet preserves the current
offset of the
> >>>>    guest clock with
> >>>>    > respect
> >>>>    > to ntp. This is accomplished by migrating all of
> >>>>
> >>>>
> >>the state in
> >>
> >>
> >>>>    the h->hpet data
> >>>>    > structure
> >>>>    > in the usual way. The hp->mc_offset is
recalculated on the
> >>>>    receiving node so
> >>>>    > that the
> >>>>    > guest sees a continuous hpet main counter.
> >>>>    >
> >>>>    > Code as been added to xc_domain_save.c to send a
> >>>>
> >>>>
> >>small message
> >>
> >>
> >>>>    after the
> >>>>    > domain context is sent. The contents of the
message is the
> >>>>    physical tsc
> >>>>    > timestamp, last_tsc,
> >>>>    > read just before the message is sent. When the
> >>>>
> >>>>
> >>>last_tsc message
> >>>
> >>>
> >>>>    is received in
> >>>>    > xc_domain_restore.c,
> >>>>    > another physical tsc timestamp, cur_tsc, is read.
The two
> >>>>    timestamps are
> >>>>    > loaded into the domain
> >>>>    > structure as last_tsc_sender and
first_tsc_receiver with
> >>>>    hypercalls. Then
> >>>>    > xc_domain_hvm_setcontext
> >>>>    > is called so that hpet_load has access to these
time stamps.
> >>>>    Hpet_load uses
> >>>>    > the timestamps
> >>>>    > to account for the time spent saving and loading
the domain
> >>>>    context. With this
> >>>>    > technique,
> >>>>    > the only neglected time is the time spent sending
a small
> >>>>    network message.
> >>>>    >
> >>>>    > 5. Test Results
> >>>>    >
> >>>>    > Some recent test results are:
> >>>>    >
> >>>>    > 5.1 Linux 4u664 and Windows 2k864 load test.
> >>>>    >       Duration: 70 hours.
> >>>>    >       Test date: 6/2/08
> >>>>    >       Loads: usex -b48 on Linux; burn-in on
Windows
> >>>>    >       Guest vcpus: 8 for Linux; 2 for Windows
> >>>>    >       Hardware: 8 physical cpu AMD
> >>>>    >       Clock drift : Linux: .0012% Windows: .009%
> >>>>    >
> >>>>    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864
> >>>>
> >>>>
> >>no-load test
> >>
> >>
> >>>>    >       Duration: 23 hours.
> >>>>    >       Test date: 6/3/08
> >>>>    >       Loads: none
> >>>>    >       Guest vcpus: 8 for each Linux; 2 for Windows
> >>>>    >       Hardware: 4 physical cpu AMD
> >>>>    >       Clock drift : Linux: .033% Windows: .019%
> >>>>    >
> >>>>    > 6. Relation to recent work in xen-unstable
> >>>>    >
> >>>>    > There is a similarity between hvm_get_guest_time()
in
> >>>>    xen-unstable and
> >>>>    > read_64_main_counter()
> >>>>    > in this code. However, read_64_main_counter() is
> >>>>
> >>>>
> >>more tuned to
> >>
> >>
> >>>>    the needs of
> >>>>    > hpet.c. It has no
> >>>>    > "set" operation, only the get. It
isolates the mode,
> >>>>
> >>>>
> >>>physical or
> >>>
> >>>
> >>>>    simulated, in
> >>>>    > read_64_main_counter()
> >>>>    > itself. It uses no vcpu or domain state as it is a
physical
> >>>>    entity, in either
> >>>>    > mode. And it provides a real
> >>>>    > physical mode for every read for those
applications
> >>>>
> >>>>
> >>>that desire
> >>>
> >>>
> >>>>    this.
> >>>>    >
> >>>>    > 7. Conclusion
> >>>>    >
> >>>>    > The virtual hpet is improved by this patch in
terms
> >>>>
> >>>>
> >>>of accuracy and
> >>>
> >>>
> >>>>    > monotonicity.
> >>>>    > Tests performed to date verify this and more
testing
> >>>>
> >>>>
> >>>is under way.
> >>>
> >>>
> >>>>    >
> >>>>    > 8. Future Work
> >>>>    >
> >>>>    > Testing with Windows Vista will be performed soon.
> >>>>
> >>>>
> >>The reason
> >>
> >>
> >>>>    for accuracy
> >>>>    > variations
> >>>>    > on different platforms using the physical hpet
> >>>>
> >>>>
> >>device will be
> >>
> >>
> >>>>    investigated.
> >>>>    > Additional overhead measurements on simulated vs
> >>>>
> >>>>
> >>physical hpet
> >>
> >>
> >>>>    mode will be
> >>>>    > made.
> >>>>    >
> >>>>    > Footnotes:
> >>>>    >
> >>>>    > 1. I don''t recall the accuracy
improvement with end
> >>>>
> >>>>
> >>>of interrupt
> >>>
> >>>
> >>>>    stamping, but
> >>>>    > it was
> >>>>    > significant, perhaps better than two to one
improvement. It
> >>>>    would be a very
> >>>>    > simple matter
> >>>>    > to re-measure the improvement as the facility can
> >>>>
> >>>>
> >>call back at
> >>
> >>
> >>>>    injection time
> >>>>    > as well.
> >>>>    >
> >>>>    >
> >>>>    > Signed-off-by: Dave Winchell
<dwinchell@virtualiron.com>
> >>>>    > <mailto:dwinchell@virtualiron.com>
> >>>>    > Signed-off-by: Ben Guthro
<bguthro@virtualiron.com>
> >>>>    > <mailto:bguthro@virtualiron.com>
> >>>>    >
> >>>>    >
> >>>>    > _______________________________________________
> >>>>    > Xen-devel mailing list
> >>>>    > Xen-devel@lists.xensource.com
> >>>>    > http://lists.xensource.com/xen-devel
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xensource.com
> >http://lists.xensource.com/xen-devel
> >
> >
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-12 22:05 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: 'Dave Winchell'
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There's nothing in any other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn't matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-12 22:49 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan,

You shouldn''t be getting higher than .05%.
I''d like to figure out what is wrong. I''m running the same
guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type ''Z'' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?


thanks,
Dave




-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-12 22:51 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> > In EL5u1-32 however it looks like the fractions are accounted
> > for.  Indeed the EL5u1-32 "lost tick handling" code
resembles
> > the Linux/ia64 code which is what I''ve always assumed was
> > the "missed tick" model.  In this case, I think no policy
> > is necessary and the measured skew should be identical to
> > any physical hpet skew.  I''ll have to test this hypothesis
though.
> 
> I''ve tested this hypothesis and it seems to hold true.
> This means the existing (unpatched) hpet code works fine
> on EL5-32bit (vcpus=1) when hpet is the clocksource,
> even when the machine is overcommitted.  A second hypothesis
> still needs to be tested that Dave''s patch will not make this
worse.
OK, I can confirm that Dave''s patch, as expected, does not
make this any worse.

The timer algorithm in 2.6.18 for x86 (i.e. RHEL5-32bit) is
definitely the most resilient to variations in tick delivery
for a monotonically-increasing timesource (i.e. hpet).
This algorithm is in arch-independent code but sadly x86_64
didn''t use it as of 2.6.18.

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-13 00:38 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,
> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?
With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
> 
> 
> One more thought while waiting for compile and reboot:
> 
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
> 
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
> 
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-13 02:21 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

I understand that ticks too close together causes
time to move faster, but I thought your policy ensured
that ticks were never delivered too close together.
So I was surprised to see that time was moving faster
rather than slower.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 6:39 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,
> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?
With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-13 03:12 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> I understand that ticks too close together causes
> time to move faster, but I thought your policy ensured
> that ticks were never delivered too close together.
> So I was surprised to see that time was moving faster
> rather than slower.
ok. Send me the debug info and I''ll try to figure out
what''s going on.

thanks,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 10:21 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Hi Dave --

I understand that ticks too close together causes
time to move faster, but I thought your policy ensured
that ticks were never delivered too close together.
So I was surprised to see that time was moving faster
rather than slower.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 6:39 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,
> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?
With a guest that computes missed ticks, and is not dealing
with fractional ticks when the interrupts are closer than
a period:
If you send several interrupts farther apart than period and then
send one closer than period, the guest gains a tick. With this
fact you can have fewer than the expected number of interrupts
and be gaining time.

With one that expects the right number of interrupts (Windows)
delivering fewer than expected makes the guest run slow.

-Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-13 04:47 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can''t seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let''s take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
     cat /proc/interrupts | grep ''LOC|timer''

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Dan,

You shouldn''t be getting higher than .05%.
I''d like to figure out what is wrong. I''m running the same
guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type ''Z'' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?


thanks,
Dave




-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jun-13 07:33 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

On 13/6/08 05:47, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> I wondered what was different between apic=1 vs 0. Using:
> 
> # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
>      cat /proc/interrupts | grep ''LOC|timer''
> 
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
> 
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.
I think apic=0 is not a particularly useful configuration though, right?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-13 12:08 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

I''m glad your able to reproduce my results.
Are you still seeing the boot time hang up?
Is this the reason for vcpus=1?
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.
Thanks for this data. Your analysis is correct, I think.
I wrote the interrupt routing and callback code for the
IOAPIC edge triggered interrupts. The PIC path does not
have the callbacks. With no callbacks, it always looks to
the routing code in hpet.c like its been longer than a period
since the last one as the end-of-interrupt time stamp is zero. Thus, you get
an interrupt each timeout or 1000 interrupts/sec.
350 is a typical amount when the algorithm for missed ticks is
doing its thing. I''ll put this on the bug list - unless no one
cares about apic=0.

thanks,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/13/2008 12:47 AM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
 
Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can''t seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let''s take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
     cat /proc/interrupts | grep ''LOC|timer''

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Dan,

You shouldn''t be getting higher than .05%.
I''d like to figure out what is wrong. I''m running the same
guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type ''Z'' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?


thanks,
Dave




-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-13 14:58 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan, Keir,

In an overnight (17.5 hrs) test with three guests,
8 vcpus each on 8 physical cpus all under
usex b48 loads I noted the following errors:

rh4u664 -.72 sec (.0012%)
rhas5u164 -10.2 sec (.016%)
sles10u164 -9.3 sec (.015%)

The number for rh4u664 is what I am used to seeing on this
platform. The other ones are 10 times worse, but still good enough
for ntp. The reason they are worse is that the guest clock code
for hpet in rhas5u164 looks at the cmp register to calculate interrupt 
delay.
I mentioned before on this list that one of the beauties of hpet
was the fine hpet code in the guest (rh4u664) which did not use the delay
computation, which in my mind is unnecessary and adds error.
Well, in rhas5u164 and I assume in sles10u164 delay is back in
and so is the associated error.

The cmp register is also the reason for the hesitations on boot.
I''ll have more to say on this later.

thanks,
Dave



Dave Winchell wrote:
> Hi Dan,
>
> I''m glad your able to reproduce my results.
> Are you still seeing the boot time hang up?
> Is this the reason for vcpus=1?
>
> > you can see that there are always 1000 LOC/sec.  But
> > with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> > and with apic=0 there are 1000 XT-PIC-timer/sec.
>
> > I suspect that the latter of these (XT-PIC-timer) is
> > messing up your policy and the former (edge-timer) is not.
>
> Thanks for this data. Your analysis is correct, I think.
> I wrote the interrupt routing and callback code for the
> IOAPIC edge triggered interrupts. The PIC path does not
> have the callbacks. With no callbacks, it always looks to
> the routing code in hpet.c like its been longer than a period
> since the last one as the end-of-interrupt time stamp is zero. Thus, 
> you get
> an interrupt each timeout or 1000 interrupts/sec.
> 350 is a typical amount when the algorithm for missed ticks is
> doing its thing. I''ll put this on the bug list - unless no one
> cares about apic=0.
>
> thanks,
> Dave
>
>
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Fri 6/13/2008 12:47 AM
> To: Dave Winchell; Keir Fraser; xen-devel
> Cc: Ben Guthro
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
> Hi Dave --
>
> Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
> (yes apic, not acpi).  Changing it to apic=1 gives excellent
> results (< 0.01% even with overcommit).  Changing it back
> to apic=0 has the same fairly bad results, 0.08% with no
> overcommit and 0.16% (and climbing) with overcommit.
> Note that this is all with vcpus=1.
>
> How odd...
>
> I vaguely recalled from some research a couple of months ago
> that hpet is read MORE than once/tick on the boot processor.
> I can''t seem to find the table I compiled from that research,
> but I did find this in an email I sent to you:
>
> "You probably know this already but an n-way 2.6 Linux
> kernel reads hpet (n+1)*1000 times/second.  Let''s take
> five 2-way guests as an example; that comes to 15000
> hpet reads/second...."
>
> I wondered what was different between apic=1 vs 0. Using:
>
> # cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
>      cat /proc/interrupts | grep ''LOC|timer''
>
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
>
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.
>
> Dan
>
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@virtualiron.com]
> Sent: Thursday, June 12, 2008 4:49 PM
> To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
> Cc: Ben Guthro; Dave Winchell
> Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan,
>
> You shouldn''t be getting higher than .05%.
> I''d like to figure out what is wrong. I''m running the
same guest
> you are with heavy loads and the physical processors overcommitted
> by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.
>
> Can you type ^a^a^a at the console and then
> type ''Z'' a couple of times about 10 seconds apart and
send
> me the output? Do this when you have a domain
> running that is keeping poor time.
>
> You should take drift measurements over a period
> of time that is at least 20 minutes, preferably longer.
>
> Also, can you send me a tarball of your sources from
> the xen directory?
>
>
> thanks,
> Dave
>
>
>
>
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thu 6/12/2008 6:05 PM
> To: Dave Winchell; Keir Fraser; xen-devel
> Cc: Ben Guthro
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
> (Going back on list.)
>
> OK, so looking at the updated patch, hpet_avoid=1 is actually
> working, just reporting wrong, correct?
>
> With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
> is under -0.04% and falling.  With hpet_avoid=0, it looks
> about the same.  However both cases seem to start creeping
> up again when I put load on, then fall again when I remove
> the load -- even with sched-credit capping cpu usage.  Odd!
> This implies to me that the activity in the other domains
> IS affecting skew on the domain-under-test. (Keir, any
> comments on the hypothesis attached below?)
>
> Another theoretical oddity... if you are always delivering
> timer ticks "late", fewer than the nominal 1000 ticks/sec
> should be being received.  So then why is guest time actually
> going faster than an external source?
>
> (In my mind, going faster is much worse than going slower
> because if ntpd or a human moves time backwards to compensate
> for a clock going faster, "make" and other programs can
> get very confused.)
>
> Dan
>
> > -----Original Message-----
> > From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> > Sent: Thursday, June 12, 2008 3:13 PM
> > To: ''Dave Winchell''
> > Subject: RE: xen hpet patch
> >
> >
> > One more thought while waiting for compile and reboot:
> >
> > Am I right that all of the policies are correcting for when
> > a domain "A" is out-of-context?  There''s nothing in
any other
> > domain "B" that can account for any timer loss/gain in
domain
> > "A".  The only reason we are running other domains is to
ensure
> > that domain "A" is sometimes out-of-context, and the more
> > it is out-of-context, the more likely we will observe
> > a problem, correct?
> >
> > If this is true, it doesn''t matter what workload is run
> > in the non-A domains... as long as it is loading the
> > CPU(s), thus ensuring that domain A is sometimes not
> > scheduled on any CPU.
> >
> > And if all this is true, we may not need to run other
> > domains at all... running "xm sched-credit -d A -c 50"
> > should result in domain A being out-of-context at least
> > half the time.
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-13 15:39 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

> I think apic=0 is not a particularly useful configuration 
> though, right?
We''ve seen it proposed sometimes as a workaround for
a boot-time problem, but I agree its not useful enough
to warrant concern or stand in the way of Dave''s patch.
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Friday, June 13, 2008 1:34 AM
> To: dan.magenheimer@oracle.com; Dave Winchell; xen-devel
> Cc: Ben Guthro
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> On 13/6/08 05:47, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> > I wondered what was different between apic=1 vs 0. Using:
> >
> > # cat /proc/interrupts | grep ''LOC|timer''; sleep 10;
\
> >      cat /proc/interrupts | grep ''LOC|timer''
> >
> > you can see that there are always 1000 LOC/sec.  But
> > with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> > and with apic=0 there are 1000 XT-PIC-timer/sec.
> >
> > I suspect that the latter of these (XT-PIC-timer) is
> > messing up your policy and the former (edge-timer) is not.
> 
> I think apic=0 is not a particularly useful configuration 
> though, right?
> 
>  -- Keir
> 
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jun-13 15:52 UTC

head link

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Kudos, Dave, for your excellent work!

Keir, I''ve completed enough testing to agree that
Dave''s hpet policy is a huge improvement over the
existing hpet code and a major improvement over
the pit-based policies/timekeeping.  I strongly
recommend that, once Dave''s soon-to-be-revised
patch is in, we turn on hpet by default for all
hvm guests. I''d also suggest that the default
timer_mode (at least when hpet=1) should be
Dave''s guest_computes_missed_ticks policy.
(Dave, could you include this in your revised
patch?  Or if you want me to, let me know.)

A couple of remaining points...
> I''m glad your able to reproduce my results.
> Are you still seeing the boot time hang up?
> Is this the reason for vcpus=1?
No, I was just trying to be methodical in my testing,
covering various cases.  I haven''t seen the boot-time
hang for awhile.
> I''ll put this on the bug list - unless no one
> cares about apic=0.
It probably should be "on the bug list" but very low
priority compared with getting the patch cleaned up
(per Keir''s requirements) in time for the 3.3 release.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Friday, June 13, 2008 6:08 AM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Hi Dan,

I''m glad your able to reproduce my results.
Are you still seeing the boot time hang up?
Is this the reason for vcpus=1?
> you can see that there are always 1000 LOC/sec.  But
> with apic=1 there are also about 350 IO-APIC-edge-timer/sec
> and with apic=0 there are 1000 XT-PIC-timer/sec.
> I suspect that the latter of these (XT-PIC-timer) is
> messing up your policy and the former (edge-timer) is not.
Thanks for this data. Your analysis is correct, I think.
I wrote the interrupt routing and callback code for the
IOAPIC edge triggered interrupts. The PIC path does not
have the callbacks. With no callbacks, it always looks to
the routing code in hpet.c like its been longer than a period
since the last one as the end-of-interrupt time stamp is zero. Thus, you get
an interrupt each timeout or 1000 interrupts/sec.
350 is a typical amount when the algorithm for missed ticks is
doing its thing. I''ll put this on the bug list - unless no one
cares about apic=0.

thanks,
Dave


-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Fri 6/13/2008 12:47 AM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
(yes apic, not acpi).  Changing it to apic=1 gives excellent
results (< 0.01% even with overcommit).  Changing it back
to apic=0 has the same fairly bad results, 0.08% with no
overcommit and 0.16% (and climbing) with overcommit.
Note that this is all with vcpus=1.

How odd...

I vaguely recalled from some research a couple of months ago
that hpet is read MORE than once/tick on the boot processor.
I can''t seem to find the table I compiled from that research,
but I did find this in an email I sent to you:

"You probably know this already but an n-way 2.6 Linux
kernel reads hpet (n+1)*1000 times/second.  Let''s take
five 2-way guests as an example; that comes to 15000
hpet reads/second...."

I wondered what was different between apic=1 vs 0. Using:

# cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
     cat /proc/interrupts | grep ''LOC|timer''

you can see that there are always 1000 LOC/sec.  But
with apic=1 there are also about 350 IO-APIC-edge-timer/sec
and with apic=0 there are 1000 XT-PIC-timer/sec.

I suspect that the latter of these (XT-PIC-timer) is
messing up your policy and the former (edge-timer) is not.

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@virtualiron.com]
Sent: Thursday, June 12, 2008 4:49 PM
To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
Cc: Ben Guthro; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy


Dan,

You shouldn''t be getting higher than .05%.
I''d like to figure out what is wrong. I''m running the same
guest
you are with heavy loads and the physical processors overcommitted
by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.

Can you type ^a^a^a at the console and then
type ''Z'' a couple of times about 10 seconds apart and send
me the output? Do this when you have a domain
running that is keeping poor time.

You should take drift measurements over a period
of time that is at least 20 minutes, preferably longer.

Also, can you send me a tarball of your sources from
the xen directory?


thanks,
Dave




-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
Sent: Thu 6/12/2008 6:05 PM
To: Dave Winchell; Keir Fraser; xen-devel
Cc: Ben Guthro
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

(Going back on list.)

OK, so looking at the updated patch, hpet_avoid=1 is actually
working, just reporting wrong, correct?

With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
is under -0.04% and falling.  With hpet_avoid=0, it looks
about the same.  However both cases seem to start creeping
up again when I put load on, then fall again when I remove
the load -- even with sched-credit capping cpu usage.  Odd!
This implies to me that the activity in the other domains
IS affecting skew on the domain-under-test. (Keir, any
comments on the hypothesis attached below?)

Another theoretical oddity... if you are always delivering
timer ticks "late", fewer than the nominal 1000 ticks/sec
should be being received.  So then why is guest time actually
going faster than an external source?

(In my mind, going faster is much worse than going slower
because if ntpd or a human moves time backwards to compensate
for a clock going faster, "make" and other programs can
get very confused.)

Dan
> -----Original Message-----
> From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
> Sent: Thursday, June 12, 2008 3:13 PM
> To: ''Dave Winchell''
> Subject: RE: xen hpet patch
>
>
> One more thought while waiting for compile and reboot:
>
> Am I right that all of the policies are correcting for when
> a domain "A" is out-of-context?  There''s nothing in any
other
> domain "B" that can account for any timer loss/gain in domain
> "A".  The only reason we are running other domains is to ensure
> that domain "A" is sometimes out-of-context, and the more
> it is out-of-context, the more likely we will observe
> a problem, correct?
>
> If this is true, it doesn''t matter what workload is run
> in the non-A domains... as long as it is loading the
> CPU(s), thus ensuring that domain A is sometimes not
> scheduled on any CPU.
>
> And if all this is true, we may not need to run other
> domains at all... running "xm sched-credit -d A -c 50"
> should result in domain A being out-of-context at least
> half the time.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dave Winchell

2008-Jun-13 16:53 UTC

head link

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dan Magenheimer wrote:
>Kudos, Dave, for your excellent work!
>  
>Thanks, Dan.
>Keir, I''ve completed enough testing to agree that
>Dave''s hpet policy is a huge improvement over the
>existing hpet code and a major improvement over
>the pit-based policies/timekeeping.  I strongly
>recommend that, once Dave''s soon-to-be-revised
>patch is in, we turn on hpet by default for all
>hvm guests. I''d also suggest that the default
>timer_mode (at least when hpet=1) should be
>Dave''s guest_computes_missed_ticks policy.
>(Dave, could you include this in your revised
>patch?  Or if you want me to, let me know.)
>  
>Sure, I can do it.
>A couple of remaining points...
>
>  
>
>>I''m glad your able to reproduce my results.
>>Are you still seeing the boot time hang up?
>>Is this the reason for vcpus=1?
>>    
>>
>
>No, I was just trying to be methodical in my testing,
>covering various cases.  I haven''t seen the boot-time
>hang for awhile.
>  
>ok. We still see it here so I''m working on a fix/workaround.
>  
>
>>I''ll put this on the bug list - unless no one
>>cares about apic=0.
>>    
>>
>
>It probably should be "on the bug list" but very low
>priority compared with getting the patch cleaned up
>(per Keir''s requirements) in time for the 3.3 release.
>  
>ok.

Dan, thanks very much for the testing work. I know its not
easy and you still came up with the results very quickly.

-Dave
>Dan
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Friday, June 13, 2008 6:08 AM
>To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
>Cc: Ben Guthro; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>Hi Dan,
>
>I''m glad your able to reproduce my results.
>Are you still seeing the boot time hang up?
>Is this the reason for vcpus=1?
>
>  
>
>>you can see that there are always 1000 LOC/sec.  But
>>with apic=1 there are also about 350 IO-APIC-edge-timer/sec
>>and with apic=0 there are 1000 XT-PIC-timer/sec.
>>    
>>
>
>  
>
>>I suspect that the latter of these (XT-PIC-timer) is
>>messing up your policy and the former (edge-timer) is not.
>>    
>>
>
>Thanks for this data. Your analysis is correct, I think.
>I wrote the interrupt routing and callback code for the
>IOAPIC edge triggered interrupts. The PIC path does not
>have the callbacks. With no callbacks, it always looks to
>the routing code in hpet.c like its been longer than a period
>since the last one as the end-of-interrupt time stamp is zero. Thus, you get
>an interrupt each timeout or 1000 interrupts/sec.
>350 is a typical amount when the algorithm for missed ticks is
>doing its thing. I''ll put this on the bug list - unless no one
>cares about apic=0.
>
>thanks,
>Dave
>
>
>-----Original Message-----
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>Sent: Fri 6/13/2008 12:47 AM
>To: Dave Winchell; Keir Fraser; xen-devel
>Cc: Ben Guthro
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>Hi Dave --
>
>Hmmm... in my earlier runs with rhel5u1-64, I had apic=0
>(yes apic, not acpi).  Changing it to apic=1 gives excellent
>results (< 0.01% even with overcommit).  Changing it back
>to apic=0 has the same fairly bad results, 0.08% with no
>overcommit and 0.16% (and climbing) with overcommit.
>Note that this is all with vcpus=1.
>
>How odd...
>
>I vaguely recalled from some research a couple of months ago
>that hpet is read MORE than once/tick on the boot processor.
>I can''t seem to find the table I compiled from that research,
>but I did find this in an email I sent to you:
>
>"You probably know this already but an n-way 2.6 Linux
>kernel reads hpet (n+1)*1000 times/second.  Let''s take
>five 2-way guests as an example; that comes to 15000
>hpet reads/second...."
>
>I wondered what was different between apic=1 vs 0. Using:
>
># cat /proc/interrupts | grep ''LOC|timer''; sleep 10; \
>     cat /proc/interrupts | grep ''LOC|timer''
>
>you can see that there are always 1000 LOC/sec.  But
>with apic=1 there are also about 350 IO-APIC-edge-timer/sec
>and with apic=0 there are 1000 XT-PIC-timer/sec.
>
>I suspect that the latter of these (XT-PIC-timer) is
>messing up your policy and the former (edge-timer) is not.
>
>Dan
>
>-----Original Message-----
>From: Dave Winchell [mailto:dwinchell@virtualiron.com]
>Sent: Thursday, June 12, 2008 4:49 PM
>To: dan.magenheimer@oracle.com; Keir Fraser; xen-devel
>Cc: Ben Guthro; Dave Winchell
>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
>Dan,
>
>You shouldn''t be getting higher than .05%.
>I''d like to figure out what is wrong. I''m running the same
guest
>you are with heavy loads and the physical processors overcommitted
>by 3:1. And I''m seeing .027% error on rh5u1-64 after an hour.
>
>Can you type ^a^a^a at the console and then
>type ''Z'' a couple of times about 10 seconds apart and send
>me the output? Do this when you have a domain
>running that is keeping poor time.
>
>You should take drift measurements over a period
>of time that is at least 20 minutes, preferably longer.
>
>Also, can you send me a tarball of your sources from
>the xen directory?
>
>
>thanks,
>Dave
>
>
>
>
>-----Original Message-----
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>Sent: Thu 6/12/2008 6:05 PM
>To: Dave Winchell; Keir Fraser; xen-devel
>Cc: Ben Guthro
>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>(Going back on list.)
>
>OK, so looking at the updated patch, hpet_avoid=1 is actually
>working, just reporting wrong, correct?
>
>With el5u1-64-hvm and hpet_avoid=1 and timer_mode=4, skew
>is under -0.04% and falling.  With hpet_avoid=0, it looks
>about the same.  However both cases seem to start creeping
>up again when I put load on, then fall again when I remove
>the load -- even with sched-credit capping cpu usage.  Odd!
>This implies to me that the activity in the other domains
>IS affecting skew on the domain-under-test. (Keir, any
>comments on the hypothesis attached below?)
>
>Another theoretical oddity... if you are always delivering
>timer ticks "late", fewer than the nominal 1000 ticks/sec
>should be being received.  So then why is guest time actually
>going faster than an external source?
>
>(In my mind, going faster is much worse than going slower
>because if ntpd or a human moves time backwards to compensate
>for a clock going faster, "make" and other programs can
>get very confused.)
>
>Dan
>
>  
>
>>-----Original Message-----
>>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com]
>>Sent: Thursday, June 12, 2008 3:13 PM
>>To: ''Dave Winchell''
>>Subject: RE: xen hpet patch
>>
>>
>>One more thought while waiting for compile and reboot:
>>
>>Am I right that all of the policies are correcting for when
>>a domain "A" is out-of-context?  There''s nothing in
any other
>>domain "B" that can account for any timer loss/gain in domain
>>"A".  The only reason we are running other domains is to
ensure
>>that domain "A" is sometimes out-of-context, and the more
>>it is out-of-context, the more likely we will observe
>>a problem, correct?
>>
>>If this is true, it doesn''t matter what workload is run
>>in the non-A domains... as long as it is loading the
>>CPU(s), thus ensuring that domain A is sometimes not
>>scheduled on any CPU.
>>
>>And if all this is true, we may not need to run other
>>domains at all... running "xm sched-credit -d A -c 50"
>>should result in domain A being out-of-context at least
>>half the time.
>>    
>>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>  
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jun 2008 - [PATCH 0/2] Improve hpet accuracy

[Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy