George Dunlap
2010-Jul-02 13:40 UTC
[Xen-devel] Multi-vcpu HVM Linux domain hanging during boot
I''ve got an HVM Linux guest, Debian 2.6.18-6-686 kernel, which works fine if vcpus=1 but hangs if vcpus=2. I''m pretty sure that it worked with vcpus=2 earlier this week, but now I seem unable to find a hypervisor/tools/qemu combination within the last month that works. It hangs just after detecting TSC as a timesource. It''s busy-waiting (both cpus pegged). Vcpu 0 is in a function called hrtimer_run_queues, vcpu1 is in a function called do_timer. Xentrace reports that vcpu 1 has an interrupt pending, but that it''s not being delivered because interrupts are disabled in the vcpu''s eflags. I''ve even tried going back to an earlier disk snapshot and booting a different kernel (2.6.18-4-686), just to make sure it''s not something dumb like a corrupt VM image. Anyone else had this problem? Can anyone ATM successfully boot a mutli-processor HVM guest of any kind? I''m going to build and install a kernel that I have the source for, so I can see whether the guest thinks interrupts should be enabled or not, but I''d appreciate any other ideas / suggestions people have to help diagnose the problem... -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gianni Tedesco
2010-Jul-02 14:00 UTC
Re: [Xen-devel] Multi-vcpu HVM Linux domain hanging during boot
On Fri, 2010-07-02 at 14:40 +0100, George Dunlap wrote:> I''ve got an HVM Linux guest, Debian 2.6.18-6-686 kernel, which works > fine if vcpus=1 but hangs if vcpus=2. > > I''m pretty sure that it worked with vcpus=2 earlier this week, but now > I seem unable to find a hypervisor/tools/qemu combination within the > last month that works. > > It hangs just after detecting TSC as a timesource. It''s busy-waiting > (both cpus pegged). Vcpu 0 is in a function called > hrtimer_run_queues, vcpu1 is in a function called do_timer. Xentrace > reports that vcpu 1 has an interrupt pending, but that it''s not being > delivered because interrupts are disabled in the vcpu''s eflags. > > I''ve even tried going back to an earlier disk snapshot and booting a > different kernel (2.6.18-4-686), just to make sure it''s not something > dumb like a corrupt VM image. > > Anyone else had this problem? Can anyone ATM successfully boot a > mutli-processor HVM guest of any kind?I''ve had very similar problems booting linux HVM guests. I am using RHEL5.x to reproduce. However in my case EFLAGS has IF flag enabled and interrupts are being delivered from the 8254 PIT aswell as RESCHED IPI''s. The hangs also happen a lot later, like in early userspace mounting filesystems. But as for you, this only occurs with SMP. I take it that in your case the machine doesn''t "unhang" if you hit a key? Have you tried clocksource=hpet disable_8254_timer on the kernel commandline? Some tweaking of those params at least alters the timing of the hangs for me. Gianni _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Jul-02 16:33 UTC
RE: [Xen-devel] Multi-vcpu HVM Linux domain hanging during boot
> I''ve got an HVM Linux guest, Debian 2.6.18-6-686 kernel, which works > fine if vcpus=1 but hangs if vcpus=2. > > I''m pretty sure that it worked with vcpus=2 earlier this week, but now > I seem unable to find a hypervisor/tools/qemu combination within the > last month that works. > > It hangs just after detecting TSC as a timesource. It''s busy-waiting > (both cpus pegged). Vcpu 0 is in a function called > hrtimer_run_queues, vcpu1 is in a function called do_timer. Xentrace > reports that vcpu 1 has an interrupt pending, but that it''s not being > delivered because interrupts are disabled in the vcpu''s eflags. > > I''ve even tried going back to an earlier disk snapshot and booting a > different kernel (2.6.18-4-686), just to make sure it''s not something > dumb like a corrupt VM image. > > Anyone else had this problem? Can anyone ATM successfully boot a > mutli-processor HVM guest of any kind? > > I''m going to build and install a kernel that I have the source for, so > I can see whether the guest thinks interrupts should be enabled or > not, but I''d appreciate any other ideas / suggestions people have to > help diagnose the problem...Hi George -- This may be a case of "when one has a hammer, everything looks like a nail" but given that it hangs just after detecting TSC as a timesource and works fine with vcpus=1, I have to suspect TSC synchronization issues. Have you hard-rebooted the physical hardware recently? If not (DON''T YET), can you record the "s" and "t" debug keys from a console or from "xm debug-key" in dom0 (while the VM is in a hung state)? Ideally, try the "s" key many times to get enough samples to be statistically valid in case the sync is swinging wildly. If you are using xen-4.0.0 or later, all of the tsc_mode work was supposed to resolve this kind of issue, but yours might identify a corner-case... or perhaps the tools you are using (or the vm config file) are overriding the default tsc_mode for the VM? If my hammer is wrong and this has nothing to do with TSC, there were some known issues in "old" kernels under HVM that were resolved by a different timer_mode setting (note this is different from tsc_mode) and/or kernel boot parameters (which sadly differ sometimes even between update releases in the 2.6.18 stream). I can try to resurrect that data if necessary. Thanks, Dan P.S. You didn''t mention the version/changeset of Xen. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel