This patch fix HVM/VMX time resolution issue that cause IA32E complain "loss tick" occationally and APIC time calibration issue. not tested on SVM for slight common code change. Eddie Signed-off-by: Xiaowei Yang <xiaowei.yang@intel.com> Signed-off-by: Eddie Dong <eddie.dong@intel.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17 Mar 2006, at 14:39, Dong, Eddie wrote:> This patch fix HVM/VMX time resolution issue that cause IA32E complain > "loss tick" occationally and APIC time calibration issue. > > not tested on SVM for slight common code change.This patch looks scary. Can you give more info about the problem and how you solve it? It looks like you end up forcibly sync''ing the guest''s TSC rate to the PIT rate? Would that even be necessary if the PIT emulation were moved into Xen, where it ought to be? On a slightly unrelated note, I think TSC rate management will start to get exciting when we have HVM save/restore. What will happen if a guest is restored on a machine with quite different TSC rate to the machine it originally ran on? I was wondering whether the current TSC_OFFSET feature that VMX supports might be extended to allow control over TSC clock rate as well. For example, provide ''base'' and ''scale'' values and apply following when guest executes RDTSC: guest_tsc = (host_tsc - base) * scale + offset How do you guys see this working? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir: Before this patch, we saw 2 issues: One is when a VM switch happens at HVM guest within PIT Interrupt Service Routine (ISR) time. There exist problem if we let guest see time jump in TSC (TSC is used to adjust PIT). Previously hypervisor try to minimize the jump seen by guest, but the resolution is one PIT period that is not enough. (TSC_OFFSET=0-pending_intr_nr *period). This situation is worse in IA32E. The second issue is that when HVM/SMP is enabled, APIC time calibration want to see a TSC duration of about 100000000 cycles so that ACPI timer frequency can be calibrated with IRQ disabled. This is un-achievable in previously code. Because at that time the guest IRQ is disabled and no PIT IRQ injection, thus guest time is frozen. Due to that, the guest can never see 100000000 cycles passed (TSC is frozen) and thus stuck there. Another benefit of this is that we can get much accurate guest calibration result that is previously a known issue on multiple VM case, and it is long time too. I have a much detail description in the attached slide, hope this helpful. BTW, due to SMP support and more time resource support (RTC and ACPI), we are planning to do some design modification to sync all those different kind of time. This patch is mainly for bug fix that exist for long time and block SMP effort. thx,eddie Keir Fraser wrote:> On 17 Mar 2006, at 14:39, Dong, Eddie wrote: > >> This patch fix HVM/VMX time resolution issue that cause IA32E >> complain "loss tick" occationally and APIC time calibration issue. >> >> not tested on SVM for slight common code change. > > This patch looks scary. Can you give more info about the problem and > how you solve it? It looks like you end up forcibly sync''ing the > guest''s TSC rate to the PIT rate? Would that even be necessary if the > PIT emulation were moved into Xen, where it ought to be? > > On a slightly unrelated note, I think TSC rate management will start > to get exciting when we have HVM save/restore. What will happen if a > guest is restored on a machine with quite different TSC rate to the > machine it originally ran on? I was wondering whether the current > TSC_OFFSET feature that VMX supports might be extended to allow > control over TSC clock rate as well. For example, provide ''base'' and > ''scale'' values and apply following when guest executes RDTSC: > guest_tsc = (host_tsc - base) * scale + offset > > How do you guys see this working? > > -- Keir > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17 Mar 2006, at 16:11, Dong, Eddie wrote:> I have a much detail description in the attached slide, hope > this helpful.Well, freezing the TSC while a guest is descheduled is not very nice at all, but I can imagine it stops you getting time went backwards messages if you are also forcibly re-setting the TSC on PIT ticks. :-) The freezing is I guess why you have the new hook schedule_out(), which I''m also not madly keen on. Especially since this must surely be a short-term workaround (you don''t intend to TSC freeze as long-term solution, right?).> BTW, due to SMP support and more time resource support (RTC and > ACPI), we are planning to do some design modification to sync all those > different kind of time. This patch is mainly for bug fix that exist for > long time and block SMP effort.Clearly some effort needs to be applied here. Moving all time emulation into Xen itself would be a good start (e.g., strip PIT emulation from qemu-dm). And the new support should be HVM generic, since there are no differences in time handling between VMX and SVM that should require (much) vendor-specific handling I think. If there are, or extra vendor support appears in future (e.g., I''d like to see guest TSC rate control, as I''ve said a few times before ;-) ), then we can add vendor hooks in later. In summary, I''m not sure about this patch. I feel that if I take it I''m encouraging ''onward and upward'' development without spending the time to make sure fundamental abstractions like time are designed and implemented soundly. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19 Mar 2006, at 14:28, Keir Fraser wrote:>> I have a much detail description in the attached slide, hope >> this helpful. > > Well, freezing the TSC while a guest is descheduled is not very nice > at all, but I can imagine it stops you getting time went backwards > messages if you are also forcibly re-setting the TSC on PIT ticks. :-) > > The freezing is I guess why you have the new hook schedule_out(), > which I''m also not madly keen on. Especially since this must surely be > a short-term workaround (you don''t intend to TSC freeze as long-term > solution, right?).Actually, I now recall we were going to use this approach long term to ensure the guest calibrates TSC rate correctly during boot. But then we are going to turn it off the first time the guest reads wall-clock time (via RTC, for example). But that means we will need the schedule_out() hook long term, and that makes your patch less unattractive. I''ll take another look and reconsider it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> Well, freezing the TSC while a guest is descheduled is not very nice at > all, but I can imagine it stops you getting time went backwards messages > if you are also forcibly re-setting the TSC on PIT ticks. :-)I think freezing the TSC is probably a very bad idea for guests. I really do want to know what''s going on.> In summary, I''m not sure about this patch. I feel that if I take it I''m > encouraging ''onward and upward'' development without spending the time to > make sure fundamental abstractions like time are designed and > implemented soundly.I think you should heed your intuition ... it''s usually quite solid! thanks ron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir: Yes, for future multiple platform time resource support (RTC, PIT, ACPI), I agree eventually they should be in HV.> The freezing is I guess why you have the new hook schedule_out(), > which I''m also not madly keen on. Especially since this must surely > be a short-term workaround (you don''t intend to TSC freeze as > long-term solution, right?).It is true that I don''t like to freeze the guest time at deschedule time, but so far we have to stay here before we find better solution :-( The reason is in legacy guest PIT ISR (interrupt service routine). The ISR code reads TSC and computing the elapsed TSC (compare with saved old TSC) from last PIT IRQ fire time. If xen doesn''t present exactly expected difference in TSC, guest may get accumulated difference in PIT ISR and do some fixup that is a messy of guest jiffies and complain "lossed tick". Eventually that will force guest to give up using TSC as time resource (roll back to pure PIT). With this patch, we get very accurate guest time in our local test:-) Keir Fraser wrote:> Actually, I now recall we were going to use this approach long term to > ensure the guest calibrates TSC rate correctly during boot. But then > we are going to turn it off the first time the guest reads wall-clock > time (via RTC, for example). But that means we will need the > schedule_out() hook long term, and that makes your patch less > unattractive. I''ll take another look and reconsider it.Yes, this meets with what Ian and Asit talked in xensummit too. And it can solve the TSC calibration issue as wall-clock (RTC) read is some time later after TSC calibration. But it has problem in APIC time calibration side, as it is done very later in Linux (not sure for other OSes), it is even later than init thread creation that is hard to determine in xen. Freezing TSC has similar function with this suggestion. The difference in freezing TSC approach is that we need to assume the guest calibration is a one-time task. Otherwise the guest may see time backward in runtime. A better solution to remove this assumption is that we implement a mechanism like PIT IRQ output line that will discard accumulated IRQs during guest IRQ disable time. I.e. if guest IRQ is disabled, pickup_deactive_ticks should ignore the elapsed ticks (only add one more pending IRQ). In this way the guest behavior will be exactly same with native. We should put this in our TODO list :-)> In summary, I''m not sure about this patch. I feel that if I take it > I''m encouraging ''onward and upward'' development without spending the > time to make sure fundamental abstractions like time are designed and > implemented soundly.Thanks! We have plan to come out a much complicate time virtualization design soon to support multiple platform time resources and SMP better. We saw several issues for SMP support in guest time forwarding. We will send the design out as soon as possible and collect feedback from you and all others:-) thx,eddie _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 20 Mar 2006, at 16:08, Dong, Eddie wrote:> Yes, this meets with what Ian and Asit talked in xensummit too. And it > can solve > the TSC calibration issue as wall-clock (RTC) read is some time later > after > TSC calibration. But it has problem in APIC time calibration side, as > it is done very later in Linux (not sure for other OSes), it is even > later than init > thread creation that is hard to determine in xen.Hmmm.. in fact it looks like Linux reads CMOS RTC before even calibrating bogomips, so that wouldn''t be a good point to disable TSC freezing after all. Another issue is that some calibration loops read the PIT counter (and would be confused by wrapping), or expect to receive timely interrupts to increment jiffies. Those are hard to guarantee in a virtualised environment. So there''s a general timeliness issue as well as the original ''delay loop progress'' versus ''time progress'' issue. There''s no good way out of this I suspect. If guest time is to track wallclock time then guests are going to have to see time jumping forward across preemptions, or the jumping is simply going to be saved up for some time later (eg. as you do currently when the PIT underflows). Maybe we should do something really simple like run the guest in ''virtual'' (scheduled) time for some number of seconds after boot, then switch to real time (which runs at an accelerated rate for a short while to catch back up with real time)?> A better solution to remove this assumption is that we implement a > mechanism like PIT IRQ output line that will discard accumulated IRQs > during > guest IRQ disable time. I.e. if guest IRQ is disabled, > pickup_deactive_ticks > should ignore the elapsed ticks (only add one more pending IRQ). In > this > way > the guest behavior will be exactly same with native. We should put this > in > our TODO list :-)What effect will this have? Are you suggesting to always run guest time at ''virtual time'' rather than real wallclock time? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir: Thanks! Keir Fraser wrote:> What effect will this have? Are you suggesting to always run guest > time at ''virtual time'' rather than real wallclock time? >Ooo, the new proposal is not focusing on this issue :-) The basic issue we saw is 1: how to jump guest time For example, when a SMP system has 2 VPs, each VP APIC time (VP0-VP1) has a scheduled fire time say at 4ms, and 6ms time. And the platform time say PIT is scheduled at 8ms time. when VP0 is descheduled, while VP1 is switched in, then probably we can''t inject APIC time IRQ to VP1 even hypervisor undergo 6ms+ time. Because injecting APIC time IRQ means VP1 saw guest time jumped to 6ms later and same on TSC (platform). Otherwise when VP0 is switched in, guest TSC time on VP0 is in 6ms+ later time, but the ACPI timer ISR is still assuming it is in 4ms time. This kind of lossing synchronization means VP0 see backward time that may cause various corner case like we saw previously in PIT and TSC. Combining per processor time IRQ with platform time IRQ, the situation will become much complicated. 2: How to deliver guest time IRQ effeciently. Same with above situation, if the VP with next scheduled timer resource is deactive, all other VP may be unable to get time IRQ. That is unfair and may cause no way to catchup in some difficult case :-) Also if platform time IRQ is pinned on certain VP, that is much worse :-( 3: Make platform time code object orientation. That means, no matter RTC, ACPI or PIT time, for each HVM, the configuration can choice eithe of them and xen will provide dynamic register APIs. In this way we are no longer pinned on PIT. We have something in mind, but not fully completed yet. For simplicity, we may assume a) An guest OS only use one of the platform time as its ticking resource. b) platform time IRQ is not pinned on certain VP. thx,eddie _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel