I''ve bfu''ed two opensolaris boxes to current opensolaris bits, and on both boxes I''ve observed hangs when using xen domUs. On my ASUS M2NPV-VM box (amd64 M2), I can run - a pv OpenSolaris domU - a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled through ssh -X connection) But the same hvm OpenSolaris domU hangs the system when the gui is displayed on the local Xorg server, sooner or later. On the other box, M2N-SLI deluxe (amd64 M2), I had at least one hang, too. Unfortunatelly, my attempts to get a crash dump for the hanging system have failed so far. Both boxes are using nvidia video hardware and the 1.0-9755 nvidia driver. Now I found bug 6607517, which seems to mention that old nvidia drivers may not be compatible with Solaris xVM (xen dom0). Is that the case? Is the latest Solaris driver "Version: 100.14.19) available for download from the nvidia website OK? Bug ID 6607517 Synopsis Gnome hung sometimes on dom0 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6607517 This message posted from opensolaris.org
Mark Johnson
2007-Sep-24 20:59 UTC
Re: nvidia driver incompatible with snv_75 xen putback?
Jürgen Keil wrote:> I''ve bfu''ed two opensolaris boxes to current opensolaris bits, and > on both boxes I''ve observed hangs when using xen domUs. > > On my ASUS M2NPV-VM box (amd64 M2), I can run > - a pv OpenSolaris domU > - a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled > through ssh -X connection) > > But the same hvm OpenSolaris domU hangs the system when the gui is > displayed on the local Xorg server, sooner or later. > > On the other box, M2N-SLI deluxe (amd64 M2), I had at least > one hang, too. > > Unfortunatelly, my attempts to get a crash dump for the hanging > system have failed so far.>> > Both boxes are using nvidia video hardware and the 1.0-9755 nvidia > driver.We have some changes to the nvidia driver which you need. You need to use the packages from the iso (or any solaris >= b73 iso).> > Now I found bug 6607517, which seems to mention that old nvidia > drivers may not be compatible with Solaris xVM (xen dom0). Is that > the case? Is the latest Solaris driver "Version: 100.14.19) available > for download from the nvidia website OK?No, the changes haven''t made it into the downloadable drivers yet. On a side note, we are working on getting 100.14.x integrated (with the xVM fixes) into Solaris Express (for those with the latest and greatest Adapters). Thanks, MRJ> Bug ID 6607517 > Synopsis Gnome hung sometimes on dom0 > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6607517 > > > This message posted from opensolaris.org > _______________________________________________ > xen-discuss mailing list > xen-discuss@opensolaris.org-- Mark Johnson <mark.johnson@sun.com> Sun Microsystems, Inc. (781) 442-0869
Juergen Keil
2007-Sep-27 11:09 UTC
Re: nvidia driver incompatible with snv_75 xen putback?
[ not sure if the CC to <matrix-eng@sun.com> works from outside Sun, I''m adding xen-discuss@opensolaris.org ... ] Mark Johnson wrote:> John Martin wrote: > > Mark Johnson wrote: > >> > >> alpha console login: Sep 26 08:42:09 alpha nvidia: NVRM: Xid > >> (0001:00): 16, Head 00000000 Count 0001bdc0 > >> Sep 26 08:42:18 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001bdc1 > >> Sep 26 08:42:23 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001bdc2 > >> Sep 26 08:42:27 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001bdc3 > >> Sep 26 08:42:59 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001c39c > >> Sep 26 08:43:09 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001c39d > >> Sep 26 08:43:13 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001c39e > >> Sep 26 08:43:17 alpha nvidia: NVRM: Xid (0001:00): 16, Head 00000000 > >> Count 0001c39f > >> > > These messages indicate the driver stopped receiving vertical blank > > interrupts from > > the GPU. It could either be a legacy interrupt routing problem or a > > driver bug, but > > without more context/information it is hard to say which. > > OK, thanks. > > > I think we have identified the problem.. It looks like the > netBSD domU is triggering the following bug. > http://bugs.opensolaris.org/view_bug.do?bug_id=6606864 > > > > Breaking into the xen console and doing a ctrl-a,ctrl-a,ctrl-a,q, > we''re always in the following thread stack... > > [0]> $c > kmdb_enter+0xb() > debug_enter+0x37(fffffffffbc0d470) > xen_debug_handler+0x1f(0) > av_dispatch_autovect+0x78(103) > dispatch_hilevel+0x1f(103, 0) > switch_sp_and_call+0x13() > do_interrupt+0xd6(ffffff0003ebf710, deadbeef) > xen_callback_handler+0x370(ffffff0003ebf710, deadbeef) > xen_callback+0xcd() > intr_restore+0x76() > mutex_vector_exit+0xf1(fffffffffbc14210) > xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880d30) > xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880d30) > siron_poke_cpu+0x60(2) > softcall_choose_cpu+0xd5() > softcall+0x10a(fffffffffb962240, ffffff0147dad000) > callout_schedule_1+0xdf(ffffff0147dad000) > callout_schedule+0x40() > clock+0x517() > cyclic_softint+0xc9(fffffffffbc40b90, 1) > cbe_softclock+0x1a() > av_dispatch_softvect+0x5f(a) > dispatch_softint+0x38(0, 0) > switch_sp_and_call+0x13() > dosoftint+0x59(ffffff0004da67a0) > do_interrupt+0x103(ffffff0004da67a0, 1) > _interrupt+0xc0() > fakesoftint_return() > disp_lock_exit+0x56(fffffffffbcfe6d8) > cv_signal+0x91(ffffff0150cfb58a) > pollnotify+0x3c(ffffff0150cfb550, f) > pollwakeup+0x10b(ffffff014ca5c808, 41) > ip`tcp_fuse_output+0x73f(ffffff015137f980, ffffff014d76c800, 14) > ip`tcp_output+0x86(ffffff015137f780, ffffff014d76c800, ffffff014958ae40) > ip`squeue_enter+0x1fb(ffffff014958ae40, ffffff014d76c800, fffffffff7a7e260, > ffffff015137f780, 7) > ip`tcp_wput+0xc4(ffffff01511b4398, ffffff014d76c800) > sockfs`sostream_direct+0x113(ffffff01511b9ad8, ffffff0004da6e20, 0, > ffffff014e337978) > sockfs`socktpi_write+0x157(ffffff01511b5500, ffffff0004da6e20, 0, > ffffff014e337978, 0) > fop_write+0x69(ffffff01511b5500, ffffff0004da6e20, 0, ffffff014e337978, 0) > write+0x2ac(8, 80bf270, 14) > write32+0x1e(8, 80bf270, 14) > sys_syscall32+0x141()Yes, on my other AMD64 X2 box I''ve also been able to get into the Solaris kernel debugger using the serial port xen console, and I was able to force some crash dumps. The stack backtrace I got looks similar:> $cdebug_enter+0x37(fffffffffbc0d750) xen_debug_handler+0x1f(0) av_dispatch_autovect+0x78(103) dispatch_hilevel+0x1f(103, 0) switch_sp_and_call+0x13() do_interrupt+0xd6(ffffff00038bf710, 1) xen_callback_handler+0x370(ffffff00038bf710, 1) xen_callback+0xcd() intr_restore+0x76() mutex_vector_exit+0xf1(fffffffffbc14510) xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0) xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) siron_poke_cpu+0x60(2) softcall_choose_cpu+0xd5() softcall+0x10a(fffffffffb962880, ffffff0132b2f000) callout_schedule_1+0xdf(ffffff0132b2f000) callout_schedule+0x40() clock+0x51b() cyclic_softint+0xc9(fffffffffbc40e90, 1) cbe_softclock+0x1a() av_dispatch_softvect+0x5f(a) dispatch_softint+0x38(1, 0) switch_sp_and_call+0x13() dosoftint+0x59(ffffff00038c5b00) do_interrupt+0xf9(ffffff00038c5b00, 1) xen_callback_handler+0x370(ffffff00038c5b00, 1) xen_callback+0xcd() sti+0x33() switch_sp_and_call+0x13() dosoftint+0x59(ffffff0003805ae0) do_interrupt+0xf9(ffffff0003805ae0, 1) xen_callback_handler+0x370(ffffff0003805ae0, 1) xen_callback+0xcd() HYPERVISOR_sched_op+0x29(1, 0) HYPERVISOR_block+0x11() mach_cpu_idle+0x52() cpu_idle+0xcc() idle+0x10e() thread_start+8() It seems that CPU#0 is executing that thread. Is that an interrupt thread, because of PRI == 109? It seems the thread on CPU#0 didn''t release the cpu for t-1285 clock ticks (= 12.85 seconds) - I guess that is a problem ?> ::cpuinfo -vID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc5bb60 1b 7 0 109 no no t-1285 ffffff00038bfc80 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 10 ffffff00038bfc80 EXISTS | 1 ffffff00038c5c80 ENABLE | - ffffff0003805c80 (idle) | +--> PRI THREAD PROC 99 ffffff00038d7c80 sched 60 ffffff000403dc80 sched 60 ffffff0003811c80 sched 59 ffffff0135bf8720 sendmail 59 ffffff01347aa400 syseventd 59 ffffff0135c00700 nscd 59 ffffff013f77b1a0 nscd ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 ffffff0134560000 1f 2 0 -1 no no t-0 ffffff0003be5c80 (idle) | | RUNNING <--+ +--> PRI THREAD PROC READY 60 ffffff0004478c80 sched QUIESCED 60 ffffff000399dc80 sched EXISTS ENABLE> ffffff00038bfc80::findstack -vstack pointer for thread ffffff00038bfc80: ffffff00038bf520 ffffff00038bf710 0xb() ffffff00038bf880 intr_restore+0x76() ffffff00038bf8b0 mutex_vector_exit+0xf1(fffffffffbc14510) ffffff00038bf930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0) ffffff00038bf980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) ffffff00038bf9b0 siron_poke_cpu+0x60(2) ffffff00038bf9d0 softcall_choose_cpu+0xd5() ffffff00038bfa20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000) ffffff00038bfa60 callout_schedule_1+0xdf(ffffff0132b2f000) ffffff00038bfa90 callout_schedule+0x40() ffffff00038bfb20 clock+0x51b() ffffff00038bfbd0 cyclic_softint+0xc9(fffffffffbc40e90, 1) ffffff00038bfbe0 cbe_softclock+0x1a() ffffff00038bfc30 av_dispatch_softvect+0x5f(a) ffffff00038bfc60 dispatch_softint+0x38(1, 0) ffffff00038c59b0 switch_sp_and_call+0x13() ffffff00038c59f0 dosoftint+0x59(ffffff00038c5b00) ffffff00038c5a40 do_interrupt+0xf9(ffffff00038c5b00, 1) ffffff00038c5af0 xen_callback_handler+0x370(ffffff00038c5b00, 1) ffffff00038c5b00 xen_callback+0xcd() ffffff00038c5c60 sti+0x33() ffffff0003805990 switch_sp_and_call+0x13() ffffff00038059d0 dosoftint+0x59(ffffff0003805ae0) ffffff0003805a20 do_interrupt+0xf9(ffffff0003805ae0, 1) ffffff0003805ad0 xen_callback_handler+0x370(ffffff0003805ae0, 1) ffffff0003805ae0 xen_callback+0xcd() ffffff0003805be0 HYPERVISOR_sched_op+0x29(1, 0) ffffff0003805bf0 HYPERVISOR_block+0x11() ffffff0003805c10 mach_cpu_idle+0x52() ffffff0003805c40 cpu_idle+0xcc() ffffff0003805c60 idle+0x10e() ffffff0003805c70 thread_start+8() I''ve also noticed that sampling cpu register dumps on the serial port xen console using the "d" command frequently was showing cpu0 executing code at "svm_stgi_label" (this is from memory, I hope I did remember the correct symbol name). That symbol can be found in the hypervisor source code at: xen-src-3.0.4-1-sun/xen.hg/xen/arch/x86/hvm/svm/x86_64/exits.S Is it possible that we''re somehow looping in CPU#0 inside the hypervisor? So that cpu#0 inside the Solaris dom0 doesn''t make any progress? ========================================================================== I also got a second crash dump, where cpu#0 seems to be stuck; but in this case kmdb was active on cpu#1 and couldn''t switch to cpu#0.> ::cpuinfo -vID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc40e90 1f 1 0 109 yes no t-4289 ffffff00038c5c80 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 10 ffffff00038c5c80 QUIESCED | - ffffff01453f5b40 python2.4 EXISTS | ENABLE +--> PRI THREAD PROC 60 ffffff0003a0fc80 sched ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 fffffffffbc5bb60 1b 3 0 109 no no t-0 ffffff0003c21c80 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 10 ffffff0003c21c80 EXISTS | - ffffff0003bebc80 (idle) ENABLE | +--> PRI THREAD PROC 99 ffffff00038e3c80 sched 60 ffffff000497ec80 sched 60 ffffff000399dc80 sched> ffffff00038c5c80::findstack -vstack pointer for thread ffffff00038c5c80: ffffff00038c57f0> ffffff00038c5c80::whatisffffff00038c5c80 is ffffff00038c5c80+0, allocated as a thread structure> ffffff00038c5c80::threadADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR ffffff00038c5c80 onproc 9 0 3 109 0 10 ffffff01453f5b40> ffffff01453f5b40::findstack -vstack pointer for thread ffffff01453f5b40: ffffff0004858af0 [ ffffff0004858af0 _resume_from_idle+0xfa() ] ffffff0004858c60 intr_restore+0x76() ffffff0004858c90 mutex_vector_exit+0xf1(fffffffffbc14510) ffffff0004858d10 xc_do_call+0x10d(fffffffffb80eef0, 0, 0, 2, ffffffffffffffff, fffffffffb80ee20, 1) ffffff0004858d60 xc_sync+0x2b(fffffffffb80eef0, 0, 0, 2, ffffffffffffffff, fffffffffb80ee20) ffffff0004858db0 dtrace_xcall+0x6f(ffffffff, fffffffffb80eef0, 0) ffffff0004858dc0 dtrace_sync+0x18() ffffff0004858e00 dtrace_helpers_destroy+0x4a() ffffff0004858e80 proc_exit+0x19c(1, 0) ffffff0004858ea0 exit+0x15(1, 0) ffffff0004858ec0 rexit+0x18(0) ffffff0004858f10 sys_syscall32+0x141() ========================================================================== And a third crash dump, but this one looks similar to the first one> ::cpuinfo -vID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc5bb60 1b 1 10 109 no no t-6156 ffffff00038c5c80 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 10 ffffff00038c5c80 EXISTS | ENABLE +--> PRI THREAD PROC 60 ffffff0003a0fc80 sched ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 ffffff013455a000 1f 2 0 -1 no no t-0 ffffff0003bebc80 (idle) | | RUNNING <--+ +--> PRI THREAD PROC READY 60 ffffff0004434c80 sched QUIESCED 60 ffffff000399dc80 sched EXISTS ENABLE> ffffff00038c5c80::findstack -vstack pointer for thread ffffff00038c5c80: ffffff00038c5520 ffffff00038c5710 0xb() ffffff00038c5880 intr_restore+0x76() ffffff00038c58b0 mutex_vector_exit+0xf1(fffffffffbc14510) ffffff00038c5930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0) ffffff00038c5980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) ffffff00038c59b0 siron_poke_cpu+0x60(2) ffffff00038c59d0 softcall_choose_cpu+0xd5() ffffff00038c5a20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000) ffffff00038c5a60 callout_schedule_1+0xdf(ffffff0132b2f000) ffffff00038c5a90 callout_schedule+0x40() ffffff00038c5b20 clock+0x51b() ffffff00038c5bd0 cyclic_softint+0xc9(fffffffffbc40e90, 1) ffffff00038c5be0 cbe_softclock+0x1a() ffffff00038c5c30 av_dispatch_softvect+0x5f(a) ffffff00038c5c60 dispatch_softint+0x38(1, 0) ffffff00038bf8f0 switch_sp_and_call+0x13() ffffff00038bf930 dtrace_xpv_getsystime+0x7c()> $cav_dispatch_autovect+0x78(100) dispatch_hilevel+0x1f(100, 0) switch_sp_and_call+0x13() do_interrupt+0xd6(ffffff00038c5710, 1) xen_callback_handler+0x370(ffffff00038c5710, 1) xen_callback+0xcd() intr_restore+0x76() mutex_vector_exit+0xf1(fffffffffbc14510) xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0) xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0) siron_poke_cpu+0x60(2) softcall_choose_cpu+0xd5() softcall+0x10a(fffffffffb962880, ffffff0132b2f000) callout_schedule_1+0xdf(ffffff0132b2f000) callout_schedule+0x40() clock+0x51b() cyclic_softint+0xc9(fffffffffbc40e90, 1) cbe_softclock+0x1a() av_dispatch_softvect+0x5f(a) dispatch_softint+0x38(1, 0) switch_sp_and_call+0x13() dtrace_xpv_getsystime+0x7c()
Mark Johnson
2007-Oct-03 02:42 UTC
Re: nvidia driver incompatible with snv_75 xen putback?
> I''ve bfu''ed two opensolaris boxes to current > opensolaris bits, and > on both boxes I''ve observed hangs when using xen > domUs. > > On my ASUS M2NPV-VM box (amd64 M2), I can run > - a pv OpenSolaris domU > - a hvm OpenSolaris domU, displaying to a remote > machine (gui tunneled > through ssh -X connection) > ut the same hvm OpenSolaris domU hangs the system > when the gui is > displayed on the local Xorg server, sooner or later. > > On the other box, M2N-SLI deluxe (amd64 M2), I had at > least > one hang, too.can you try putting the following in /etc/system and see if it solves your problems: set softcall_delay=0x100000 Thanks, MRJ This message posted from opensolaris.org
> > I''ve bfu''ed two opensolaris boxes to current opensolaris bits, and > > on both boxes I''ve observed hangs when using xen domUs. > > > > On my ASUS M2NPV-VM box (amd64 M2), I can run > > - a pv OpenSolaris domU > > - a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled > > through ssh -X connection) > > But the same hvm OpenSolaris domU hangs the system when the gui is > > displayed on the local Xorg server, sooner or later. > > > > On the other box, M2N-SLI deluxe (amd64 M2), I had at least > > one hang, too. > > > can you try putting the following in /etc/system and > see if it solves your problems: > > set softcall_delay=0x100000Yes, so far there have been no more OpenSolaris dom0 hangs, after setting softcall_delay to 0x100000 This message posted from opensolaris.org