Juergen Keil
2007-Sep-28 15:11 UTC
NetBSD 3.1 HVM domU hangs... [was Re: nvidia driver incompatible with snv_75 xen putback?]
I wrote:> Mark Johnson wrote: > > I think we have identified the problem.. It looks like the > > netBSD domU is triggering the following bug. > > http://bugs.opensolaris.org/view_bug.do?bug_id=6606864 > > > > Yes, on my other AMD64 X2 box I''ve also been able to get into the Solaris > kernel debugger using the serial port xen console, and I was able to force > some crash dumps. > > > The stack backtrace I got looks similar: > > > $c > debug_enter+0x37(fffffffffbc0d750) > xen_debug_handler+0x1f(0) > av_dispatch_autovect+0x78(103) > dispatch_hilevel+0x1f(103, 0) > switch_sp_and_call+0x13() > do_interrupt+0xd6(ffffff00038bf710, 1) > xen_callback_handler+0x370(ffffff00038bf710, 1) > xen_callback+0xcd() > intr_restore+0x76() > mutex_vector_exit+0xf1(fffffffffbc14510) > xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0) > xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) > siron_poke_cpu+0x60(2) > softcall_choose_cpu+0xd5() > softcall+0x10a(fffffffffb962880, ffffff0132b2f000) > callout_schedule_1+0xdf(ffffff0132b2f000) > callout_schedule+0x40() > clock+0x51b() > cyclic_softint+0xc9(fffffffffbc40e90, 1) > cbe_softclock+0x1a() > av_dispatch_softvect+0x5f(a) > dispatch_softint+0x38(1, 0) > switch_sp_and_call+0x13() > dosoftint+0x59(ffffff00038c5b00) > do_interrupt+0xf9(ffffff00038c5b00, 1) > xen_callback_handler+0x370(ffffff00038c5b00, 1) > xen_callback+0xcd() > sti+0x33() > switch_sp_and_call+0x13() > dosoftint+0x59(ffffff0003805ae0) > do_interrupt+0xf9(ffffff0003805ae0, 1) > xen_callback_handler+0x370(ffffff0003805ae0, 1) > xen_callback+0xcd() > HYPERVISOR_sched_op+0x29(1, 0) > HYPERVISOR_block+0x11() > mach_cpu_idle+0x52() > cpu_idle+0xcc() > idle+0x10e() > thread_start+8() > > > > > It seems that CPU#0 is executing that thread. > Is that an interrupt thread, because of PRI == 109? > It seems the thread on CPU#0 didn''t release the cpu for t-1285 clock > ticks (= 12.85 seconds) - I guess that is a problem ? > > > ::cpuinfo -v > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 0 fffffffffbc5bb60 1b 7 0 109 no no t-1285 ffffff00038bfc80 sched > | | | > RUNNING <--+ | +--> PIL THREAD > READY | 10 ffffff00038bfc80 > EXISTS | 1 ffffff00038c5c80 > ENABLE | - ffffff0003805c80 (idle) > | > +--> PRI THREAD PROC > 99 ffffff00038d7c80 sched > 60 ffffff000403dc80 sched > 60 ffffff0003811c80 sched > 59 ffffff0135bf8720 sendmail > 59 ffffff01347aa400 syseventd > 59 ffffff0135c00700 nscd > 59 ffffff013f77b1a0 nscd > > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC > 1 ffffff0134560000 1f 2 0 -1 no no t-0 ffffff0003be5c80 > (idle) > | | > RUNNING <--+ +--> PRI THREAD PROC > READY 60 ffffff0004478c80 sched > QUIESCED 60 ffffff000399dc80 sched > EXISTS > ENABLE > > > > ffffff00038bfc80::findstack -v > stack pointer for thread ffffff00038bfc80: ffffff00038bf520 > ffffff00038bf710 0xb() > ffffff00038bf880 intr_restore+0x76() > ffffff00038bf8b0 mutex_vector_exit+0xf1(fffffffffbc14510) > ffffff00038bf930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0) > ffffff00038bf980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) > ffffff00038bf9b0 siron_poke_cpu+0x60(2) > ffffff00038bf9d0 softcall_choose_cpu+0xd5() > ffffff00038bfa20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000) > ffffff00038bfa60 callout_schedule_1+0xdf(ffffff0132b2f000) > ffffff00038bfa90 callout_schedule+0x40() > ffffff00038bfb20 clock+0x51b() > ffffff00038bfbd0 cyclic_softint+0xc9(fffffffffbc40e90, 1) > ffffff00038bfbe0 cbe_softclock+0x1a() > ffffff00038bfc30 av_dispatch_softvect+0x5f(a) > ffffff00038bfc60 dispatch_softint+0x38(1, 0) > ffffff00038c59b0 switch_sp_and_call+0x13() > ffffff00038c59f0 dosoftint+0x59(ffffff00038c5b00) > ffffff00038c5a40 do_interrupt+0xf9(ffffff00038c5b00, 1) > ffffff00038c5af0 xen_callback_handler+0x370(ffffff00038c5b00, 1) > ffffff00038c5b00 xen_callback+0xcd() > ffffff00038c5c60 sti+0x33() > ffffff0003805990 switch_sp_and_call+0x13() > ffffff00038059d0 dosoftint+0x59(ffffff0003805ae0) > ffffff0003805a20 do_interrupt+0xf9(ffffff0003805ae0, 1) > ffffff0003805ad0 xen_callback_handler+0x370(ffffff0003805ae0, 1) > ffffff0003805ae0 xen_callback+0xcd() > ffffff0003805be0 HYPERVISOR_sched_op+0x29(1, 0) > ffffff0003805bf0 HYPERVISOR_block+0x11() > ffffff0003805c10 mach_cpu_idle+0x52() > ffffff0003805c40 cpu_idle+0xcc() > ffffff0003805c60 idle+0x10e() > ffffff0003805c70 thread_start+8() > > > > I''ve also noticed that sampling cpu register dumps on the serial port > xen console using the "d" command frequently was showing cpu0 executing > code at "svm_stgi_label" (this is from memory, I hope I did remember the > correct symbol name). > > That symbol can be found in the hypervisor source code at: > xen-src-3.0.4-1-sun/xen.hg/xen/arch/x86/hvm/svm/x86_64/exits.S > > Is it possible that we''re somehow looping in CPU#0 inside the hypervisor? > So that cpu#0 inside the Solaris dom0 doesn''t make any progress?I''ve repeated the NetBSD 3.1 HVM domU boot test under a snv66 dom0 kernel (which is just hanging the netbsd domU but doesn''t kill/hang the dom0). Using the xentrace / xentrace_format utilities I got the following, while the NetBSD HVM domU was hanging: ... CPU1 36797688384536 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688388981 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688400048 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688424898 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688429551 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688440865 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688452604 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688463945 VMEXIT_0 0x00000002 0xc0100e58 0x00000064 CPU1 36797688465681 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688470057 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688492973 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688505979 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688510347 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688521456 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688533272 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688546327 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688550758 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688573613 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688585051 VMEXIT_0 0x00000002 0xc0100e58 0x00000064 CPU1 36797688586731 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688591120 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688613926 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688626729 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688631062 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688642066 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688653826 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688666521 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688670850 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688681966 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688693894 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688705004 VMEXIT_0 0x00000002 0xc0100e58 0x00000064 CPU1 36797688706579 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688710932 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688722242 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b CPU1 36797688734095 VMEXIT_0 0x00000002 0xc0101a59 0x0000007b CPU1 36797688746760 VMX_INTR [ domid = 0x00000002, trap = 0x00000026, va = 0x00000000 ] CPU1 36797688751112 VMEXIT_0 0x00000002 0xc01019f3 0x0000007b CPU1 36797688762316 VMEXIT_0 0x00000002 0xc01019f7 0x0000007b .... Plus this "xm dmesg" console output, with hvm_debug=0x32: ... (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c0101a59, exit_qualification = 210010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c01019f3, exit_qualification = 210010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x20 eip=8:c01019f7, exit_qualification = 200010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c0101a59, exit_qualification = 210010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c01019f3, exit_qualification = 210010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x20 eip=8:c01019f7, exit_qualification = 200010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c0101a59, exit_qualification = 210010 (XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21 eip=8:c01019f3, exit_qualification = 210010 ... So it seems the NetBSD domU is receiving lots of floppy controller IRQ#6 interrupts (trap == 0x26), and is accessing the 8259 interrupt controller (PIC) I/O ports 0x20 and 0x21. Somehow it is unable to clear the IRQ#6 condition, it seems (I don''t see floppy controller I/O port accesses at all?). Apparently this is a known problem with NetBSD under Xen HVM: http://mail-index.netbsd.org/port-xen/2007/08/29/0002.html http://mail-index.netbsd.org/port-xen/2007/09/07/0000.html