Hi all, I am using xen for some time now and I am very happy with it, thanks for all the good work. The only problem I had with xen on our server was that sometimes (usually few times per day) the time in dom0 went berserk and started running about three times faster. The only fix for that was reboot of the machine. I remember seeing similar problems reported long time ago on this list, although I can''t locate them at the moment. Well, in my case, I traced the problem down to a buggy chipset. The VIA686a PIT timer randomly looses it''s programming and needs to be reset. The linux kernel has a workaround for this, but this does not get used when xen comes to play as the hypervisor takes over control of the PIT. I have implemented similar workaround in xen hypervisor. So far I am running it for about three weeks now and the server is perfectly stable. I am interested in your comments, and I would be happy if you could apply this patch to xen sources. Thanks Tomas Kopal _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 14 Mar 2006, at 18:05, Tomas Kopal wrote:> Well, in my case, I traced the problem down to a buggy chipset. The > VIA686a PIT timer randomly looses it''s programming and needs to be > reset. The linux kernel has a workaround for this, but this does not > get > used when xen comes to play as the hypervisor takes over control of > the PIT. > I have implemented similar workaround in xen hypervisor. So far I am > running it for about three weeks now and the server is perfectly > stable. > > I am interested in your comments, and I would be happy if you could > apply this patch to xen sources.Do you have any details on what mode the timer enters when it loses its programming, whether this affects all PIT channels, etc? The patch is potentially okay -- it differs from Linux in that we free-run channel 2 (we don''t periodically and automatically re-latch) and so the Linux test for count > latch does not work. The test you use (diff > 2*latch) is kind of weird, even if it does seem to work for you: I wonder what kind of mode it enters where readings make it look like it is running at three times normal speed? Also, although you detect and fix up channel 2 problems, all that code is driven off the channel 0 timer interrupt handler. What happens if ch0 loses its programming? :-) Really I want to understand this problem rather better before committing a patch for a six-year-old chipset. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15.3.2006 13:32, Keir Fraser wrote:> > On 14 Mar 2006, at 18:05, Tomas Kopal wrote: > >> Well, in my case, I traced the problem down to a buggy chipset. The >> VIA686a PIT timer randomly looses it''s programming and needs to be >> reset. The linux kernel has a workaround for this, but this does not get >> used when xen comes to play as the hypervisor takes over control of >> the PIT. >> I have implemented similar workaround in xen hypervisor. So far I am >> running it for about three weeks now and the server is perfectly stable. >> >> I am interested in your comments, and I would be happy if you could >> apply this patch to xen sources. > > Do you have any details on what mode the timer enters when it loses its > programming, whether this affects all PIT channels, etc?Well, there is not much info on this. There is no official VIA info, only speculations. Probably the most info I found on LKLM. The best summary I found is here: http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.0/1613.html and http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.3/1068.html One of initial problem descriptions: http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.2/1405.html It seems to affect only one channel AFAIK, but it''s not always the same (Linux kernel is using channel 0, Xen channel 2, and the problem is the same for both). It''s probably not affecting all channels together, as bug on channel 1 could be quite disastrous to the memory contents. But similar problems may be in other chipsets too: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q274323 http://support.microsoft.com/default.aspx?scid=kb;en-us;Q266344 So having a bit more "robust" PIT handling should generally help.> The patch is > potentially okay -- it differs from Linux in that we free-run channel 2 > (we don''t periodically and automatically re-latch) and so the Linux test > for count > latch does not work. The test you use (diff > 2*latch) is > kind of weird, even if it does seem to work for you: I wonder what kind > of mode it enters where readings make it look like it is running at > three times normal speed?I think that the mode is not changed, just the immediate value in the timer. My explanation is that the timer sometimes (probably when the system is under heavy load, like during domU shutdown) returns "random jump", probably by resetting current timer value to some other, random one, but continues counting. If this happen during calibration call, the calibrated values are completely off, and the system time starts to run away due to using invalid calibration data. Together with xntpd it can get even more messy. (Just for the record, I tried to turn xntpd in dom0 off, but the problem remained). But this is not backed up by any real evidence, so take it with heaps of salt :-). The test for diff > 2*latch is a bit of heuristics :-). You are right that this differs from Linux, Xen is not resetting the counter to latch but free running it. But the diff between subsequent values should be always near the latch value, as this is driven by the channel 0 set to interrupt by latch. I was printing out real diff values (detecting min and max over periods of time) and it varied about 40% around the latch value. I didn''t want to get too many false positives, so I set it to double the expected value. As the problematic values tend to be quite high, I think this is a safe threshold.> > Also, although you detect and fix up channel 2 problems, all that code > is driven off the channel 0 timer interrupt handler. What happens if ch0 > loses its programming? :-)Don''t know. It either does not loose it, or the effect of it loosing it is not that obvious. Do you know any easy way how to detect this? (i.e. detect missing or late interrupts? We can''t use channel 2 as we can''t trust it. Maybe we can use the TSC?) As I said, I expect the timer to continue counting, so if I am right, the only problem which it can cause is that the timer will come a bit later. Apart from time keeping, this should not be a big deal, or is it? As I am thinking about this now, the cause may even be that the counter problem is in channel 0 only. Then the timer interrupt would come a lot later and the difference in values of channel 2 could overflow to negative values?> > Really I want to understand this problem rather better before committing > a patch for a six-year-old chipset. > > -- KeirYes, the chipset is quite old. We were already thinking about replacing it, but after this fix, it will probably have to serve a bit longer :-). I share your desire to understand the problem, but I still don''t understand it, and it seems that the people from LKLM didn''t completely understood it either. And according to the MSDN records, it may be quite wide-spread, even on newer chipsets... Feel free to make it compile-time option, or just move it to contrib. But if it can save trouble I had to go through to anyone, it would be definitely beneficial to have in the mainstream, especially when it does not add any penalty to fault-less systems. Thanks a lot Tomas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15 Mar 2006, at 19:52, Tomas Kopal wrote:> I was printing out real diff values (detecting min and max over periods > of time) and it varied about 40% around the latch value. I didn''t want > to get too many false positives, so I set it to double the expected > value. As the problematic values tend to be quite high, I think this is > a safe threshold.40% range is huge, given that Xen disables interrupts only for very short periods of time. If you think that the timer ends up corrupting its count value, but continues counting in the mode we originally programmed it to, there would be no need for your patch to reprogram the timer. We could just clamp diff and let the timer continue to free-run from whatever value it corrupted itself to. Would that simpler patch, with no reprogramming, work for you? I agree with your suspicion that this may be a channel-0 problem. The 40% value range points at some serious weirdness. As for other timer problems -- the really common one (latched reads do not latch, so you get inconsistent 16-bit reads) don''t long-term affect our time stability. We end up out by a few hundred microseconds on that read of the clock, but the 16-bit timer value doesn''t wrap or anything really bad like that, so we can recover. Phew. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
OK, gathered a bit more data. On 15.3.2006 23:44, Keir Fraser wrote:> > On 15 Mar 2006, at 19:52, Tomas Kopal wrote: > >> I was printing out real diff values (detecting min and max over periods >> of time) and it varied about 40% around the latch value. I didn''t want >> to get too many false positives, so I set it to double the expected >> value. As the problematic values tend to be quite high, I think this is >> a safe threshold. > > 40% range is huge, given that Xen disables interrupts only for very > short periods of time.Sorry, I overshot. It''s up to 30%. Here is part of my debug log, using TSC. It was gathering minimum and maximum values over time approximately 15 minutes each line. It''s still a lot though, but as these are absolute extremes over long period of time, it may be not as bad most of the time. (XEN) Stats: min_tsc = 7272785, max_tsc = 9728578 (XEN) Stats: min_tsc = 6315016, max_tsc = 10686693 (XEN) Stats: min_tsc = 7287942, max_tsc = 9717338 (XEN) Stats: min_tsc = 7349398, max_tsc = 9653272 (XEN) Stats: min_tsc = 7101256, max_tsc = 9898106 (XEN) Stats: min_tsc = 6246158, max_tsc = 10753919 (XEN) Stats: min_tsc = 6263384, max_tsc = 10999952 (XEN) Stats: min_tsc = 6207822, max_tsc = 10799607 (XEN) Stats: min_tsc = 6919892, max_tsc = 10073639 (XEN) Stats: min_tsc = 6137085, max_tsc = 10864224 (XEN) Stats: min_tsc = 6276877, max_tsc = 10724951 (XEN) Stats: min_tsc = 7151101, max_tsc = 9848466 (XEN) Stats: min_tsc = 7020142, max_tsc = 9978974 (XEN) Stats: min_tsc = 7002859, max_tsc = 9992022> > If you think that the timer ends up corrupting its count value, but > continues counting in the mode we originally programmed it to, there > would be no need for your patch to reprogram the timer. We could just > clamp diff and let the timer continue to free-run from whatever value it > corrupted itself to. Would that simpler patch, with no reprogramming, > work for you?I tried not to reset the timer and once the error appeared, it didn''t go away for quite a long time (it did disappear at the end though). It definitely was not one time problem only. That leads me to believe that the mode IS changed after all, and it switches to mode with pre-programmed reset value, so the counter does not overflow as expected with free-run and the result of the subtraction is flawed.> > I agree with your suspicion that this may be a channel-0 problem. The > 40% value range points at some serious weirdness.As you can see from the following snippet, if we can trust TSC, the channel 0 is not affected when the error occur, the interrupts still occur regularly. (XEN) Stats: min_tsc = 6125635, max_tsc = 11387514 (XEN) Stats: min_tsc = 5518772, max_tsc = 11457692 (XEN) Stats: min_tsc = 6331729, max_tsc = 10671097 (XEN) PIT Timer HW error: 40750 (XEN) Stats: min_tsc = 6137681, max_tsc = 10868227 (XEN) Stats: min_tsc = 6372589, max_tsc = 10626930 (XEN) Stats: min_tsc = 6096761, max_tsc = 10902500 (XEN) Stats: min_tsc = 6218072, max_tsc = 10784801 (XEN) Stats: min_tsc = 6232108, max_tsc = 10775067 (XEN) Stats: min_tsc = 6088005, max_tsc = 10914563 (XEN) PIT Timer HW error: 39470 (XEN) Stats: min_tsc = 5037753, max_tsc = 11961922 (XEN) Stats: min_tsc = 6189690, max_tsc = 10810470 (XEN) Stats: min_tsc = 6247146, max_tsc = 10754276 Thanks Tomas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel