thr3ads.net - freebsd stable - Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write [May 2010]

If this information is useful, please help other people find it:
Share via:

Terry Kennedy

2010-May-14 03:37 UTC

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

I'm reposting this over here at the suggestion of the Forums moderator.
The original post is at http://forums.freebsd.org/showthread.php?t=14163

Got an interesting crash just now (well, as interesting as a crash on a 
soon-to-be production system can be).

This is 8-STABLE/amd64, last cvsup'd early in the morning of May 9th.

The system didn't complete the crash dump, so it needed a manual reset to
get
it going again.

The crash was a "page fault while in kernel mode" with the current
process
being the interrupt service routine for the bce0 GigE. Things progressed 
reasonably until partway through the dump, when the system locked up with a 
"Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock".
That's the
same PID as reported in the main crash.

Screen capture at http://www.tmk.com/transient/crash-20100513002317.png
Complete dmesg, etc. available on request.

As I mentioned above, the system needed a hard reset to get going again. 
savecore doesn't think there's a usable dump, so I don't think
there's any
more info to gather.

I just cvsup'd the box and built a new kernel, in case the previous cvsup
was
in between related commits, or to see if anything changed since. I still have 
the old kernel around in case any useful info can be gathered from it.

So, a couple questions:

1) Anything known to be funky w/ bce?

2) Should the part of the system that caused the panic be able to lock up the 
crash dump process? Obviously, if the disk driver causes a panic, all bets are 
off when trying to use it to write the dump, but this crash seems to have been 
from a network driver. Shouldn't a double panic just give up on the dump and
try a reboot?

3) Is there any way to rig the system to obtain more info if this happens 
again? Right now I'm using an embedded remote console server, but I could 
switch the system to a serial port if enabling the kernel debugger might help. 
But I think that the sleeping thread bit would happen even at the debugger 
prompt, wouldn't it? 

I just booted the new kernel and tried this again, and got another crash. The 
message is identical to the first, except that the instruction pointer changed 
by 0x10 (presumably due to code differences between the old and new kernels) 
and it got 6MB further writing the crash dump.

Since it seems I can reproduce this at will, I'll be glad to either perform 
additional information-gathering or give a developer access to the box for 
testing purposes.

Is it possible to correlate the source line in the kernel with the instruction 
pointer in the panic? 

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA

John Baldwin

2010-May-14 11:53 UTC

head link

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Terry Kennedy wrote:>   I'm reposting this over here at the suggestion of the Forums
moderator.
> The original post is at http://forums.freebsd.org/showthread.php?t=14163
> 
> Got an interesting crash just now (well, as interesting as a crash on a 
> soon-to-be production system can be).
> 
> This is 8-STABLE/amd64, last cvsup'd early in the morning of May 9th.
> 
> The system didn't complete the crash dump, so it needed a manual reset
to get
> it going again.
> 
> The crash was a "page fault while in kernel mode" with the
current process
> being the interrupt service routine for the bce0 GigE. Things progressed 
> reasonably until partway through the dump, when the system locked up with a
> "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock".
That's the
> same PID as reported in the main crash.
Hmm.  You could try changing the code to not do a nested panic in that 
case.  You would update subr_turnstile.c to just return if panicstr is 
not NULL rather than calling panic.  However, there is still a good 
chance you will end up deadlocking in that case.  I have another patch I 
can send you next week that prevents blocking on mutexes duing a panic 
which may also help.
> 3) Is there any way to rig the system to obtain more info if this happens 
> again? Right now I'm using an embedded remote console server, but I
could
> switch the system to a serial port if enabling the kernel debugger might
help.
> But I think that the sleeping thread bit would happen even at the debugger 
> prompt, wouldn't it? 
Include DDB and enable the 'trace_on_panic' sysctl knob perhaps.
> I just booted the new kernel and tried this again, and got another crash.
The
> message is identical to the first, except that the instruction pointer
changed
> by 0x10 (presumably due to code differences between the old and new
kernels)
> and it got 6MB further writing the crash dump.
> 
> Since it seems I can reproduce this at will, I'll be glad to either
perform
> additional information-gathering or give a developer access to the box for 
> testing purposes.
> 
> Is it possible to correlate the source line in the kernel with the
instruction
> pointer in the panic? 
If you are booted into the same kernel with the same modules loaded, you 
can probably run 'kgdb' as root do 'l *<instruction
pointer>'.

-- 
John Baldwin

Terry Kennedy

2010-May-14 12:12 UTC

head link

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

> > The crash was a "page fault while in kernel mode" with the
current process
> > being the interrupt service routine for the bce0 GigE. Things
progressed
> > reasonably until partway through the dump, when the system locked up
with a
> > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable
lock". That's the
> > same PID as reported in the main crash.
>
> Hmm.  You could try changing the code to not do a nested panic in that
> case.  You would update subr_turnstile.c to just return if panicstr is
> not NULL rather than calling panic.  However, there is still a good
> chance you will end up deadlocking in that case.  I have another patch I
> can send you next week that prevents blocking on mutexes duing a panic
> which may also help.
  Ok, I'll be glad to try that.
> > 3) Is there any way to rig the system to obtain more info if this
happens
> > again? Right now I'm using an embedded remote console server, but
I could
> > switch the system to a serial port if enabling the kernel debugger
might help.
> > But I think that the sleeping thread bit would happen even at the
debugger
> > prompt, wouldn't it?
>
> Include DDB and enable the 'trace_on_panic' sysctl knob perhaps.
  Hmmm. Do you think it will get very far before the sleeping thread business
locks it up?
> > Is it possible to correlate the source line in the kernel with the
instruction
> > pointer in the panic?
>
> If you are booted into the same kernel with the same modules loaded, you
> can probably run 'kgdb' as root do 'l *<instruction
pointer>'.
  I did that and discovered that the 0x20: prefix is probably unwanted:

(kgdb) l *0x20:0xffffffff801e3c06
A syntax error in expression, near `:0xffffffff801e3c06'.
(kgdb) l *0xffffffff801e3c06
0xffffffff801e3c06 is in bce_start_locked (/usr/src/sys/dev/bce/if_bce.c:6996).
6991                    }
6992
6993                    count++;
6994
6995                    /* Send a copy of the frame to any BPF listeners. */
6996                    ETHER_BPF_MTAP(ifp, m_head);
6997            }
6998
6999            /* Exit if no packets were dequeued. */
7000            if (count == 0) {
(kgdb) 

  This kernel does have BPF compiled in, but I don't think it was in use at
the time. 

  Any further suggestions to look at (remember, this system is in another
state from me and all I have is remote access to the framebuffer - I'd have
to go there and set up a serial console to be able to talk to the debugger
if it crashes).

	Thanks,
        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA

Matthew Fleming

2010-May-14 13:29 UTC

head link

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

> > The crash was a "page fault while in kernel mode" with the
current process
> > being the interrupt service routine for the bce0 GigE. Things
progressed
> > reasonably until partway through the dump, when the system locked up
with a
> > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable
lock". Thats the
> > same PID as reported in the main crash.
> 
> Hmm.  You could try changing the code to not do a nested panic in that 
> case.  You would update subr_turnstile.c to just return if panicstr is 
> not NULL rather than calling panic.  However, there is still a good 
> chance you will end up deadlocking in that case.  I have another patch I 
> can send you next week that prevents blocking on mutexes duing a panic 
> which may also help.
It would be instructive to know exactly why we were in turnstile(9) but its
likely due to mtx contention.

AIX has some code at the beginning of all the locking operations to avoid taking
locks if we were running code out of kdb, though getting that worked out was
slightly tricky with our variant of mtx_assert(9).  I seem to recall there was
also some "lockbusting" code that forcibly reset all owned locks to
have no owner, at least in some paths.

Given that the system is single-cpu and should be single-threaded when dumping,
this seems to me to be something worth working through to get more reliable
dumps.  Except for mtx_assert(9) I cant think of a reason to take locks once we
start dumping or when in the debugger.

As an aside, with terribly corrupted locks Ive seen double panics when the
attempt to print the lock name faulted in strlen(9) called for printf(9), due to
a bad lockname pointer.  We have been able to get enough info off these crashes
to debug them, but its useful to remember that the system may be in a very
unstable state depending on why it panics.

Thanks,
matthew

freebsd stable - May 2010 - Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write