Thomas Gleixner
2007-Apr-18 13:02 UTC
+ stupid-hack-to-make-mainline-build.patch added to -mm tree
On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:> > a proper CE device also has the added bonus of making high-res timers > > guests work automatically. It should be simple: just pass it through to > > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has > > essentially no guest-side complexity. > > > > It is not so simple. In theory it works great. In reality, the i386 > implementation is completely hardwired to work the way hardware works, > and breaking the clockevent code out of the deep ties to the APIC is > extremely non-trivial. We tried, and could not accomplish it for 2.6.21 > because the hrtimers integration was complex, and introduced many bugs > for us.Why is this so non-trivial ? All you have to do is _NOT_ register PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead, which uses the hypervisor timer emulation instead of real hardware. clockevents breaks the hardwired assumptions of the old timer code and allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e. stuff like /* Disable PIT. */ outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */> We worked around this by keeping NO_IDLE_HZ support, which now > you deprecated. So now we are using NO_HZ without a hyper-CE device, > and it is working fine. We understand the benefits of moving to the CE > model - but it cannot be done overnight.This is ugly as hell. NO_HZ enables the dyntick functions in idle(), irq_enter() and irq_exit() so the clockevents code is actually invoked. I have not looked close enough why this does work at all. I have the feeling that "working fine" means something like "does not explode". We really want to fix this now instead of pushing some not know why it works hack into the kernel. tglx
Zachary Amsden
2007-Apr-18 13:02 UTC
+ stupid-hack-to-make-mainline-build.patch added to -mm tree
Ingo Molnar wrote:> * Ingo Molnar <mingo@elte.hu> wrote: > > >> no, that's not the case: next_timer_interrupt() is the NO_IDLE_HZ >> method of doing things - while in the NO_HZ case you are supposed to >> use clockevent devices to program timer hardware. >>We don't have a clockevent device. But we need NO_IDLE_HZ support, which NO_HZ has now subsumed.> a proper CE device also has the added bonus of making high-res timers > guests work automatically. It should be simple: just pass it through to > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has > essentially no guest-side complexity. >It is not so simple. In theory it works great. In reality, the i386 implementation is completely hardwired to work the way hardware works, and breaking the clockevent code out of the deep ties to the APIC is extremely non-trivial. We tried, and could not accomplish it for 2.6.21 because the hrtimers integration was complex, and introduced many bugs for us. We worked around this by keeping NO_IDLE_HZ support, which now you deprecated. So now we are using NO_HZ without a hyper-CE device, and it is working fine. We understand the benefits of moving to the CE model - but it cannot be done overnight. Xen has the same requirements for integrating their timer code. Zach
Ingo Molnar wrote:> * Zachary Amsden <zach@vmware.com> wrote: > > >> The correct solution here is to properly separate the APIC, SMP, and >> timer code so the logic of it which we want to reuse is separated from >> the hardware dependence. Clock events and clocksources take care of >> most of the timer issues, but there is still ugliness from SMP timer >> events depending on having part of the APIC infrastructure for wiring >> the interrupt gates. >> > > what are you talking about? A clockevents driver does not need to know > about lapic details, at all. In terms of interrupt gates for the > hypervisor to notify about clock events - use a virtual interrupt > controller via genirq. >See my last e-mail. It is not possible on i386, since local per-cpu interrupts are only supported via the APIC.> if you want to use hardwired hardware details as your API: DO IT WITHOUT > MODIFYING LINUX. If you want anything more intelligent, something more > 'paravirtual' - WORK WITH US AND WORK WITH THE OTHER HYPERVISORS. So far > all i've seen from you was excuses and stonewalling on every step! We >So far, all you have done is not complain about our code until it was merged, the pursue every tactic possible to break it. It is not us that are stonewalling.> told you about the need to do VMI-timer ontop of clockevents last year > already! You resisted virtually EVERY SINGLE cleanup suggestion since > your stuff got upstream and you ONLY acted when a change was force-fed > to you. Just count the number of emails you wrote, versus the patches > you did. And your code is barely 2 weeks in! That is unacceptable.Which cleanups have we resisted in particular? I can't recall any. Just count the number of emails you wrote versus the patches and helpful suggestions you made. No, instead, you broke our code, in many ways, with the untouchable aim of cleaning up the kernel source to do things the way you think they should be done in a future release. Our code is in the tree now, and any attempts to break it using such justifications as easing maintenance for kernel developers in future releases are flat out false and improper. We are working to correct flaws that we have and properly conform to the changing interfaces such as the timer subsystem, and also to interoperate properly with the full set of available configurations. In the meantime, having code that uses slightly older interfaces in the kernel tree is not wrong in any way - it is pragmatic, because that code is working today, and not only that, the sanest thing to do in a release cycle. And our code in the tree to be released imposes zero burden on anyone except for us. Are we stopping you from rewriting the timer subsystem in the -rc tree? How? Because this code is supposed to be settled. Your deliberate breaking of our code forces us to come up with workarounds that might be considered inappropriate, but nevertheless, necessary. Who has to deal with and adapt to this? Certainly not you. The burden to maintain the correctness of our code is on us. Working together to make sure that this code completely integrates with all this new development is the right thing to do - in the development tree. Why you insist on stopping our code in the tip kernel release tree is beyond me, as there is no purpose to it other than to block our code. Zach