thr3ads.net - xen discuss - nvidia driver incompatible with snv

If this information is useful, please help other people find it:
Share via:

Jürgen Keil

2007-Sep-24 15:31 UTC

nvidia driver incompatible with snv_75 xen putback?

I''ve bfu''ed two opensolaris boxes to current opensolaris bits,
and
on both boxes I''ve observed hangs when using xen domUs.

On my ASUS M2NPV-VM box (amd64 M2), I can run 
- a pv OpenSolaris domU
- a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled
   through ssh -X connection)

But the same hvm OpenSolaris domU hangs the system when the gui is
displayed on the local Xorg server, sooner or later.

On the other box, M2N-SLI deluxe (amd64 M2), I had at least
one hang, too.

Unfortunatelly, my attempts to get a crash dump for the hanging
system have failed so far.


Both boxes are using nvidia video hardware and the 1.0-9755 nvidia
driver.


Now I found bug 6607517, which seems to mention that old nvidia
drivers may not be compatible with Solaris xVM (xen dom0).  Is that
the case?  Is the latest Solaris driver "Version: 100.14.19) available
for download from the nvidia website OK?

Bug ID   	 6607517
Synopsis 	Gnome hung sometimes on dom0
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6607517
 
 
This message posted from opensolaris.org

Mark Johnson

2007-Sep-24 20:59 UTC

head link

Re: nvidia driver incompatible with snv_75 xen putback?

Jürgen Keil wrote:> I''ve bfu''ed two opensolaris boxes to current opensolaris
bits, and
> on both boxes I''ve observed hangs when using xen domUs.
> 
> On my ASUS M2NPV-VM box (amd64 M2), I can run 
> - a pv OpenSolaris domU
> - a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled
>    through ssh -X connection)
> 
> But the same hvm OpenSolaris domU hangs the system when the gui is
> displayed on the local Xorg server, sooner or later.
> 
> On the other box, M2N-SLI deluxe (amd64 M2), I had at least
> one hang, too.
> 
> Unfortunatelly, my attempts to get a crash dump for the hanging
> system have failed so far.
 >> 
> Both boxes are using nvidia video hardware and the 1.0-9755 nvidia
> driver.
We have some changes to the nvidia driver which you need. You
need to use the packages from the iso (or any solaris >= b73 iso).
> 
> Now I found bug 6607517, which seems to mention that old nvidia
> drivers may not be compatible with Solaris xVM (xen dom0).  Is that
> the case?  Is the latest Solaris driver "Version: 100.14.19) available
> for download from the nvidia website OK?
No, the changes haven''t made it into the downloadable drivers
yet.

On a side note, we are working on getting 100.14.x integrated
(with the xVM fixes) into Solaris Express (for those with the
latest and greatest Adapters).



Thanks,

MRJ




> Bug ID   	 6607517
> Synopsis 	Gnome hung sometimes on dom0
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6607517
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> xen-discuss mailing list
> xen-discuss@opensolaris.org
-- 
Mark Johnson <mark.johnson@sun.com>
Sun Microsystems, Inc.
(781) 442-0869

Juergen Keil

2007-Sep-27 11:09 UTC

head link

Re: nvidia driver incompatible with snv_75 xen putback?

[ not sure if the CC to <matrix-eng@sun.com> works from outside Sun,
  I''m adding xen-discuss@opensolaris.org ... ]

Mark Johnson wrote:> John Martin wrote:
> > Mark Johnson wrote:
> >>
> >> alpha console login: Sep 26 08:42:09 alpha nvidia: NVRM: Xid 
> >> (0001:00): 16, Head 00000000 Count 0001bdc0
> >> Sep 26 08:42:18 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001bdc1
> >> Sep 26 08:42:23 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001bdc2
> >> Sep 26 08:42:27 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001bdc3
> >> Sep 26 08:42:59 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001c39c
> >> Sep 26 08:43:09 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001c39d
> >> Sep 26 08:43:13 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001c39e
> >> Sep 26 08:43:17 alpha nvidia: NVRM: Xid (0001:00): 16, Head
00000000
> >> Count 0001c39f
> >>
> > These messages indicate the driver stopped receiving vertical blank 
> > interrupts from
> > the GPU.  It could either be a legacy interrupt routing problem or a 
> > driver bug, but
> > without more context/information it is hard to say which.
> 
> OK, thanks.
> 
> 
> I think we have identified the problem.. It looks like the
> netBSD domU is triggering the following bug.
>      http://bugs.opensolaris.org/view_bug.do?bug_id=6606864
> 
> 
> 
> Breaking into the xen console and doing a ctrl-a,ctrl-a,ctrl-a,q,
> we''re always in the following thread stack...
> 
> [0]> $c
> kmdb_enter+0xb()
> debug_enter+0x37(fffffffffbc0d470)
> xen_debug_handler+0x1f(0)
> av_dispatch_autovect+0x78(103)
> dispatch_hilevel+0x1f(103, 0)
> switch_sp_and_call+0x13()
> do_interrupt+0xd6(ffffff0003ebf710, deadbeef)
> xen_callback_handler+0x370(ffffff0003ebf710, deadbeef)
> xen_callback+0xcd()
> intr_restore+0x76()
> mutex_vector_exit+0xf1(fffffffffbc14210)
> xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880d30)
> xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880d30)
> siron_poke_cpu+0x60(2)
> softcall_choose_cpu+0xd5()
> softcall+0x10a(fffffffffb962240, ffffff0147dad000)
> callout_schedule_1+0xdf(ffffff0147dad000)
> callout_schedule+0x40()
> clock+0x517()
> cyclic_softint+0xc9(fffffffffbc40b90, 1)
> cbe_softclock+0x1a()
> av_dispatch_softvect+0x5f(a)
> dispatch_softint+0x38(0, 0)
> switch_sp_and_call+0x13()
> dosoftint+0x59(ffffff0004da67a0)
> do_interrupt+0x103(ffffff0004da67a0, 1)
> _interrupt+0xc0()
> fakesoftint_return()
> disp_lock_exit+0x56(fffffffffbcfe6d8)
> cv_signal+0x91(ffffff0150cfb58a)
> pollnotify+0x3c(ffffff0150cfb550, f)
> pollwakeup+0x10b(ffffff014ca5c808, 41)
> ip`tcp_fuse_output+0x73f(ffffff015137f980, ffffff014d76c800, 14)
> ip`tcp_output+0x86(ffffff015137f780, ffffff014d76c800, ffffff014958ae40)
> ip`squeue_enter+0x1fb(ffffff014958ae40, ffffff014d76c800, fffffffff7a7e260,
> ffffff015137f780, 7)
> ip`tcp_wput+0xc4(ffffff01511b4398, ffffff014d76c800)
> sockfs`sostream_direct+0x113(ffffff01511b9ad8, ffffff0004da6e20, 0,
> ffffff014e337978)
> sockfs`socktpi_write+0x157(ffffff01511b5500, ffffff0004da6e20, 0,
> ffffff014e337978, 0)
> fop_write+0x69(ffffff01511b5500, ffffff0004da6e20, 0, ffffff014e337978, 0)
> write+0x2ac(8, 80bf270, 14)
> write32+0x1e(8, 80bf270, 14)
> sys_syscall32+0x141()

Yes, on my other AMD64 X2 box I''ve also been able to get into the
Solaris
kernel debugger using the serial port xen console, and I was able to force
some crash dumps.


The stack backtrace I got looks similar:
> $cdebug_enter+0x37(fffffffffbc0d750)
xen_debug_handler+0x1f(0)
av_dispatch_autovect+0x78(103)
dispatch_hilevel+0x1f(103, 0)
switch_sp_and_call+0x13()
do_interrupt+0xd6(ffffff00038bf710, 1)
xen_callback_handler+0x370(ffffff00038bf710, 1)
xen_callback+0xcd()
intr_restore+0x76()
mutex_vector_exit+0xf1(fffffffffbc14510)
xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0)
xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
siron_poke_cpu+0x60(2)
softcall_choose_cpu+0xd5()
softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
callout_schedule_1+0xdf(ffffff0132b2f000)
callout_schedule+0x40()
clock+0x51b()
cyclic_softint+0xc9(fffffffffbc40e90, 1)
cbe_softclock+0x1a()
av_dispatch_softvect+0x5f(a)
dispatch_softint+0x38(1, 0)
switch_sp_and_call+0x13()
dosoftint+0x59(ffffff00038c5b00)
do_interrupt+0xf9(ffffff00038c5b00, 1)
xen_callback_handler+0x370(ffffff00038c5b00, 1)
xen_callback+0xcd()
sti+0x33()
switch_sp_and_call+0x13()
dosoftint+0x59(ffffff0003805ae0)
do_interrupt+0xf9(ffffff0003805ae0, 1)
xen_callback_handler+0x370(ffffff0003805ae0, 1)
xen_callback+0xcd()
HYPERVISOR_sched_op+0x29(1, 0)
HYPERVISOR_block+0x11()
mach_cpu_idle+0x52()
cpu_idle+0xcc()
idle+0x10e()
thread_start+8()




It seems that CPU#0 is executing that thread.
Is that an interrupt thread, because of PRI == 109?
It seems the thread on CPU#0 didn''t release the cpu for t-1285 clock
ticks (= 12.85 seconds) - I guess that is a problem ? 
> ::cpuinfo -v ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc5bb60  1b    7    0 109   no    no t-1285 ffffff00038bfc80 sched
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |          10 ffffff00038bfc80
             EXISTS         |           1 ffffff00038c5c80
             ENABLE         |           - ffffff0003805c80 (idle)
                            |
                            +-->  PRI THREAD           PROC
                                   99 ffffff00038d7c80 sched
                                   60 ffffff000403dc80 sched
                                   60 ffffff0003811c80 sched
                                   59 ffffff0135bf8720 sendmail
                                   59 ffffff01347aa400 syseventd
                                   59 ffffff0135c00700 nscd
                                   59 ffffff013f77b1a0 nscd

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 ffffff0134560000  1f    2    0  -1   no    no t-0    ffffff0003be5c80
 (idle)
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 ffffff0004478c80 sched
           QUIESCED                60 ffffff000399dc80 sched
             EXISTS         
             ENABLE         

> ffffff00038bfc80::findstack -vstack pointer for thread ffffff00038bfc80: ffffff00038bf520
  ffffff00038bf710 0xb()
  ffffff00038bf880 intr_restore+0x76()
  ffffff00038bf8b0 mutex_vector_exit+0xf1(fffffffffbc14510)
  ffffff00038bf930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0)
  ffffff00038bf980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
  ffffff00038bf9b0 siron_poke_cpu+0x60(2)
  ffffff00038bf9d0 softcall_choose_cpu+0xd5()
  ffffff00038bfa20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
  ffffff00038bfa60 callout_schedule_1+0xdf(ffffff0132b2f000)
  ffffff00038bfa90 callout_schedule+0x40()
  ffffff00038bfb20 clock+0x51b()
  ffffff00038bfbd0 cyclic_softint+0xc9(fffffffffbc40e90, 1)
  ffffff00038bfbe0 cbe_softclock+0x1a()
  ffffff00038bfc30 av_dispatch_softvect+0x5f(a)
  ffffff00038bfc60 dispatch_softint+0x38(1, 0)
  ffffff00038c59b0 switch_sp_and_call+0x13()
  ffffff00038c59f0 dosoftint+0x59(ffffff00038c5b00)
  ffffff00038c5a40 do_interrupt+0xf9(ffffff00038c5b00, 1)
  ffffff00038c5af0 xen_callback_handler+0x370(ffffff00038c5b00, 1)
  ffffff00038c5b00 xen_callback+0xcd()
  ffffff00038c5c60 sti+0x33()
  ffffff0003805990 switch_sp_and_call+0x13()
  ffffff00038059d0 dosoftint+0x59(ffffff0003805ae0)
  ffffff0003805a20 do_interrupt+0xf9(ffffff0003805ae0, 1)
  ffffff0003805ad0 xen_callback_handler+0x370(ffffff0003805ae0, 1)
  ffffff0003805ae0 xen_callback+0xcd()
  ffffff0003805be0 HYPERVISOR_sched_op+0x29(1, 0)
  ffffff0003805bf0 HYPERVISOR_block+0x11()
  ffffff0003805c10 mach_cpu_idle+0x52()
  ffffff0003805c40 cpu_idle+0xcc()
  ffffff0003805c60 idle+0x10e()
  ffffff0003805c70 thread_start+8()



I''ve also noticed that sampling cpu register dumps on the serial port
xen console using the "d" command frequently was showing cpu0
executing
code at "svm_stgi_label" (this is from memory, I hope I did remember
the
correct symbol name).

That symbol can be found in the hypervisor source code at:
xen-src-3.0.4-1-sun/xen.hg/xen/arch/x86/hvm/svm/x86_64/exits.S

Is it possible that we''re somehow looping in CPU#0 inside the
hypervisor?
So that cpu#0 inside the Solaris dom0 doesn''t make any progress?


==========================================================================
I also got a second crash dump, where cpu#0 seems to be stuck; but in this
case kmdb was active on cpu#1 and couldn''t switch to cpu#0.  

> ::cpuinfo -v ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc40e90  1f    1    0 109  yes    no t-4289 ffffff00038c5c80 sched
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |          10 ffffff00038c5c80
           QUIESCED         |           - ffffff01453f5b40 python2.4
             EXISTS         |
             ENABLE         +-->  PRI THREAD           PROC
                                   60 ffffff0003a0fc80 sched

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 fffffffffbc5bb60  1b    3    0 109   no    no t-0    ffffff0003c21c80 sched
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |          10 ffffff0003c21c80
             EXISTS         |           - ffffff0003bebc80 (idle)
             ENABLE         |
                            +-->  PRI THREAD           PROC
                                   99 ffffff00038e3c80 sched
                                   60 ffffff000497ec80 sched
                                   60 ffffff000399dc80 sched
> ffffff00038c5c80::findstack -vstack pointer for thread ffffff00038c5c80:
ffffff00038c57f0> ffffff00038c5c80::whatisffffff00038c5c80 is ffffff00038c5c80+0, allocated as a thread
structure> ffffff00038c5c80::thread            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
ffffff00038c5c80 onproc      9    0    3   109     0  10 ffffff01453f5b40
> ffffff01453f5b40::findstack  -vstack pointer for thread ffffff01453f5b40: ffffff0004858af0
[ ffffff0004858af0 _resume_from_idle+0xfa() ]
  ffffff0004858c60 intr_restore+0x76()
  ffffff0004858c90 mutex_vector_exit+0xf1(fffffffffbc14510)
  ffffff0004858d10 xc_do_call+0x10d(fffffffffb80eef0, 0, 0, 2, ffffffffffffffff,
fffffffffb80ee20, 1)
  ffffff0004858d60 xc_sync+0x2b(fffffffffb80eef0, 0, 0, 2, ffffffffffffffff,
fffffffffb80ee20)
  ffffff0004858db0 dtrace_xcall+0x6f(ffffffff, fffffffffb80eef0, 0)
  ffffff0004858dc0 dtrace_sync+0x18()
  ffffff0004858e00 dtrace_helpers_destroy+0x4a()
  ffffff0004858e80 proc_exit+0x19c(1, 0)
  ffffff0004858ea0 exit+0x15(1, 0)
  ffffff0004858ec0 rexit+0x18(0)
  ffffff0004858f10 sys_syscall32+0x141()


==========================================================================
And a third crash dump, but this one looks similar to the first one

> ::cpuinfo -v ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc5bb60  1b    1   10 109   no    no t-6156 ffffff00038c5c80 sched
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |          10 ffffff00038c5c80
             EXISTS         |
             ENABLE         +-->  PRI THREAD           PROC
                                   60 ffffff0003a0fc80 sched

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 ffffff013455a000  1f    2    0  -1   no    no t-0    ffffff0003bebc80
 (idle)
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 ffffff0004434c80 sched
           QUIESCED                60 ffffff000399dc80 sched
             EXISTS         
             ENABLE         
> ffffff00038c5c80::findstack -vstack pointer for thread ffffff00038c5c80: ffffff00038c5520
  ffffff00038c5710 0xb()
  ffffff00038c5880 intr_restore+0x76()
  ffffff00038c58b0 mutex_vector_exit+0xf1(fffffffffbc14510)
  ffffff00038c5930 xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0, 0)
  ffffff00038c5980 xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
  ffffff00038c59b0 siron_poke_cpu+0x60(2)
  ffffff00038c59d0 softcall_choose_cpu+0xd5()
  ffffff00038c5a20 softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
  ffffff00038c5a60 callout_schedule_1+0xdf(ffffff0132b2f000)
  ffffff00038c5a90 callout_schedule+0x40()
  ffffff00038c5b20 clock+0x51b()
  ffffff00038c5bd0 cyclic_softint+0xc9(fffffffffbc40e90, 1)
  ffffff00038c5be0 cbe_softclock+0x1a()
  ffffff00038c5c30 av_dispatch_softvect+0x5f(a)
  ffffff00038c5c60 dispatch_softint+0x38(1, 0)
  ffffff00038bf8f0 switch_sp_and_call+0x13()
  ffffff00038bf930 dtrace_xpv_getsystime+0x7c()> $cav_dispatch_autovect+0x78(100)
dispatch_hilevel+0x1f(100, 0)
switch_sp_and_call+0x13()
do_interrupt+0xd6(ffffff00038c5710, 1)
xen_callback_handler+0x370(ffffff00038c5710, 1)
xen_callback+0xcd()
intr_restore+0x76()
mutex_vector_exit+0xf1(fffffffffbc14510)
xc_do_call+0x10d(0, 0, 0, 1, 2, fffffffffb880fb0)
xc_call+0x2b(0, 0, 0, 1, 2, fffffffffb880fb0)
siron_poke_cpu+0x60(2)
softcall_choose_cpu+0xd5()
softcall+0x10a(fffffffffb962880, ffffff0132b2f000)
callout_schedule_1+0xdf(ffffff0132b2f000)
callout_schedule+0x40()
clock+0x51b()
cyclic_softint+0xc9(fffffffffbc40e90, 1)
cbe_softclock+0x1a()
av_dispatch_softvect+0x5f(a)
dispatch_softint+0x38(1, 0)
switch_sp_and_call+0x13()
dtrace_xpv_getsystime+0x7c()

Mark Johnson

2007-Oct-03 02:42 UTC

head link

Re: nvidia driver incompatible with snv_75 xen putback?

> I''ve bfu''ed two opensolaris boxes to current
> opensolaris bits, and
> on both boxes I''ve observed hangs when using xen
> domUs.
> 
> On my ASUS M2NPV-VM box (amd64 M2), I can run 
> - a pv OpenSolaris domU
> - a hvm OpenSolaris domU, displaying to a remote
> machine (gui tunneled
>    through ssh -X connection)
> ut the same hvm OpenSolaris domU hangs the system
> when the gui is
> displayed on the local Xorg server, sooner or later.
> 
> On the other box, M2N-SLI deluxe (amd64 M2), I had at
> least
> one hang, too.

can you try putting the following in /etc/system and see if it solves your
problems:

set softcall_delay=0x100000 


Thanks,

MRJ
 
 
This message posted from opensolaris.org

Jürgen Keil

2007-Oct-04 10:10 UTC

head link

Re: nvidia driver incompatible with snv_75 xen putback?

> > I''ve bfu''ed two opensolaris boxes to current
opensolaris bits, and
> > on both boxes I''ve observed hangs when using xen domUs.
> > 
> > On my ASUS M2NPV-VM box (amd64 M2), I can run 
> > - a pv OpenSolaris domU
> > - a hvm OpenSolaris domU, displaying to a remote machine (gui tunneled
> >    through ssh -X connection)
> > But the same hvm OpenSolaris domU hangs the system when the gui is
> > displayed on the local Xorg server, sooner or later.
> > 
> > On the other box, M2N-SLI deluxe (amd64 M2), I had at least
> > one hang, too.
> 
> 
> can you try putting the following in /etc/system and
> see if it solves your problems:
> 
> set softcall_delay=0x100000 

Yes, so far there have been no more OpenSolaris dom0 hangs,
after setting softcall_delay to 0x100000
 
 
This message posted from opensolaris.org

xen discuss - Sep 2007 - nvidia driver incompatible with snv_75 xen putback?

nvidia driver incompatible with snv_75 xen putback?

Re: nvidia driver incompatible with snv_75 xen putback?

Re: nvidia driver incompatible with snv_75 xen putback?

Re: nvidia driver incompatible with snv_75 xen putback?

Re: nvidia driver incompatible with snv_75 xen putback?