thr3ads.net - xen discuss - NetBSD 3.1 HVM domU hangs... [was Re: nvidia driver incompatible with snv

If this information is useful, please help other people find it:
Share via:
Juergen Keil
2007-Sep-28 15:11 UTC
NetBSD 3.1 HVM domU hangs... [was Re: nvidia driver incompatible with snv_75 xen putback?]

I wrote: 
> Mark Johnson wrote:
> > I think we have identified the problem.. It looks like the
> > netBSD domU is triggering the following bug.
> >      http://bugs.opensolaris.org/view_bug.do?bug_id=6606864
> > 
> 
> Yes, on my other AMD64 X2 box I''ve also been able to get into the
Solaris
> kernel debugger using the serial port xen console, and I was able to force
> some crash dumps.
> 
> 
> The stack backtrace I got looks similar:
> 
> > $c
> debug_enter+0x37(fffffffffbc0d750)
> xen_debug_handler+0x1f(0)
> av_dispatch_autovect+0x78(103)
> dispatch_hilevel+0x1f(103, 0)
> switch_sp_and_call+0x13()
> do_interrupt+0xd6(ffffff00038bf710, 1)
> xen_callback_handler+0x370(ffffff00038bf710, 1)
> xen_callback+0xcd()
> intr_restore+0x76()
> mutex_vector_exit+0xf1(fffffffffbc14510)
> xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0)
> xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
> siron_poke_cpu+0x60(2)
> softcall_choose_cpu+0xd5()
> softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
> callout_schedule_1+0xdf(ffffff0132b2f000)
> callout_schedule+0x40()
> clock+0x51b()
> cyclic_softint+0xc9(fffffffffbc40e90, 1)
> cbe_softclock+0x1a()
> av_dispatch_softvect+0x5f(a)
> dispatch_softint+0x38(1, 0)
> switch_sp_and_call+0x13()
> dosoftint+0x59(ffffff00038c5b00)
> do_interrupt+0xf9(ffffff00038c5b00, 1)
> xen_callback_handler+0x370(ffffff00038c5b00, 1)
> xen_callback+0xcd()
> sti+0x33()
> switch_sp_and_call+0x13()
> dosoftint+0x59(ffffff0003805ae0)
> do_interrupt+0xf9(ffffff0003805ae0, 1)
> xen_callback_handler+0x370(ffffff0003805ae0, 1)
> xen_callback+0xcd()
> HYPERVISOR_sched_op+0x29(1, 0)
> HYPERVISOR_block+0x11()
> mach_cpu_idle+0x52()
> cpu_idle+0xcc()
> idle+0x10e()
> thread_start+8()
> 
> 
> 
> 
> It seems that CPU#0 is executing that thread.
> Is that an interrupt thread, because of PRI == 109?
> It seems the thread on CPU#0 didn''t release the cpu for t-1285
clock
> ticks (= 12.85 seconds) - I guess that is a problem ? 
> 
> > ::cpuinfo -v
>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD          
PROC
>   0 fffffffffbc5bb60  1b    7    0 109   no    no t-1285 ffffff00038bfc80
sched
>                        |    |    |
>             RUNNING <--+    |    +--> PIL THREAD
>               READY         |          10 ffffff00038bfc80
>              EXISTS         |           1 ffffff00038c5c80
>              ENABLE         |           - ffffff0003805c80 (idle)
>                             |
>                             +-->  PRI THREAD           PROC
>                                    99 ffffff00038d7c80 sched
>                                    60 ffffff000403dc80 sched
>                                    60 ffffff0003811c80 sched
>                                    59 ffffff0135bf8720 sendmail
>                                    59 ffffff01347aa400 syseventd
>                                    59 ffffff0135c00700 nscd
>                                    59 ffffff013f77b1a0 nscd
> 
>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD          
PROC
>   1 ffffff0134560000  1f    2    0  -1   no    no t-0    ffffff0003be5c80
>  (idle)
>                        |    |
>             RUNNING <--+    +-->  PRI THREAD           PROC
>               READY                60 ffffff0004478c80 sched
>            QUIESCED                60 ffffff000399dc80 sched
>              EXISTS         
>              ENABLE         
> 
> 
> > ffffff00038bfc80::findstack -v
> stack pointer for thread ffffff00038bfc80: ffffff00038bf520
>   ffffff00038bf710 0xb()
>   ffffff00038bf880 intr_restore+0x76()
>   ffffff00038bf8b0 mutex_vector_exit+0xf1(fffffffffbc14510)
>   ffffff00038bf930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0)
>   ffffff00038bf980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
>   ffffff00038bf9b0 siron_poke_cpu+0x60(2)
>   ffffff00038bf9d0 softcall_choose_cpu+0xd5()
>   ffffff00038bfa20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
>   ffffff00038bfa60 callout_schedule_1+0xdf(ffffff0132b2f000)
>   ffffff00038bfa90 callout_schedule+0x40()
>   ffffff00038bfb20 clock+0x51b()
>   ffffff00038bfbd0 cyclic_softint+0xc9(fffffffffbc40e90, 1)
>   ffffff00038bfbe0 cbe_softclock+0x1a()
>   ffffff00038bfc30 av_dispatch_softvect+0x5f(a)
>   ffffff00038bfc60 dispatch_softint+0x38(1, 0)
>   ffffff00038c59b0 switch_sp_and_call+0x13()
>   ffffff00038c59f0 dosoftint+0x59(ffffff00038c5b00)
>   ffffff00038c5a40 do_interrupt+0xf9(ffffff00038c5b00, 1)
>   ffffff00038c5af0 xen_callback_handler+0x370(ffffff00038c5b00, 1)
>   ffffff00038c5b00 xen_callback+0xcd()
>   ffffff00038c5c60 sti+0x33()
>   ffffff0003805990 switch_sp_and_call+0x13()
>   ffffff00038059d0 dosoftint+0x59(ffffff0003805ae0)
>   ffffff0003805a20 do_interrupt+0xf9(ffffff0003805ae0, 1)
>   ffffff0003805ad0 xen_callback_handler+0x370(ffffff0003805ae0, 1)
>   ffffff0003805ae0 xen_callback+0xcd()
>   ffffff0003805be0 HYPERVISOR_sched_op+0x29(1, 0)
>   ffffff0003805bf0 HYPERVISOR_block+0x11()
>   ffffff0003805c10 mach_cpu_idle+0x52()
>   ffffff0003805c40 cpu_idle+0xcc()
>   ffffff0003805c60 idle+0x10e()
>   ffffff0003805c70 thread_start+8()
> 
> 
> 
> I''ve also noticed that sampling cpu register dumps on the serial
port
> xen console using the "d" command frequently was showing cpu0
executing
> code at "svm_stgi_label" (this is from memory, I hope I did
remember the
> correct symbol name).
> 
> That symbol can be found in the hypervisor source code at:
> xen-src-3.0.4-1-sun/xen.hg/xen/arch/x86/hvm/svm/x86_64/exits.S
> 
> Is it possible that we''re somehow looping in CPU#0 inside the
hypervisor?
> So that cpu#0 inside the Solaris dom0 doesn''t make any progress?


I''ve repeated the NetBSD 3.1 HVM domU boot test under a snv66 dom0
kernel (which is just hanging the netbsd domU but doesn''t kill/hang
the dom0).

Using the xentrace / xentrace_format utilities I got the following,
while the NetBSD HVM domU was hanging:

...
CPU1	36797688384536		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688388981		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688400048		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688424898		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688429551		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688440865		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688452604		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688463945		VMEXIT_0		0x00000002 0xc0100e58 0x00000064
CPU1	36797688465681		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688470057		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688492973		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688505979		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688510347		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688521456		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688533272		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688546327		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688550758		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688573613		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688585051		VMEXIT_0		0x00000002 0xc0100e58 0x00000064
CPU1	36797688586731		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688591120		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688613926		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688626729		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688631062		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688642066		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688653826		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688666521		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688670850		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688681966		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688693894		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688705004		VMEXIT_0		0x00000002 0xc0100e58 0x00000064
CPU1	36797688706579		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688710932		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688722242		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
CPU1	36797688734095		VMEXIT_0		0x00000002 0xc0101a59 0x0000007b
CPU1	36797688746760		VMX_INTR		[ domid = 0x00000002, trap = 0x00000026, va =
0x00000000 ]
CPU1	36797688751112		VMEXIT_0		0x00000002 0xc01019f3 0x0000007b
CPU1	36797688762316		VMEXIT_0		0x00000002 0xc01019f7 0x0000007b
....

Plus this "xm dmesg" console output, with hvm_debug=0x32:

...
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c0101a59, exit_qualification = 210010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c01019f3, exit_qualification = 210010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x20
eip=8:c01019f7, exit_qualification = 200010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c0101a59, exit_qualification = 210010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c01019f3, exit_qualification = 210010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x20
eip=8:c01019f7, exit_qualification = 200010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c0101a59, exit_qualification = 210010
(XEN) [HVM:4.0] <svm_io_instruction> svm_io_instruction: port 0x21
eip=8:c01019f3, exit_qualification = 210010
...


So it seems the NetBSD domU is receiving lots of floppy controller
IRQ#6 interrupts (trap == 0x26), and is accessing the 8259 interrupt
controller (PIC) I/O ports 0x20 and 0x21. Somehow it is unable to clear
the IRQ#6 condition, it seems (I don''t see floppy controller I/O
port accesses at all?).


Apparently this is a known problem with NetBSD under Xen HVM:

http://mail-index.netbsd.org/port-xen/2007/08/29/0002.html
http://mail-index.netbsd.org/port-xen/2007/09/07/0000.html
xen discuss - Sep 2007 - NetBSD 3.1 HVM domU hangs... [was Re: nvidia driver incompatible with snv_75 xen putback?]

NetBSD 3.1 HVM domU hangs... [was Re: nvidia driver incompatible with snv_75 xen putback?]