thr3ads.net - freebsd stable - rdump stuck in sbwait state (RELENG

If this information is useful, please help other people find it:
Share via:

Terry Kennedy

2008-Dec-30 02:03 UTC

rdump stuck in sbwait state (RELENG_7)

I upgraded a box (Dell Poweredge 1550, dual PIII processors) from a kernel +
world of December 8th to one from today (December 29th) and I am experiencing
a new problem with rdump.

  The symptom is that rdump stops sending data to the remote system. It is
responsive to ^T and can be aborted with ^C. Here's the ^T status on the
sending box (the aforementioned Dell RELENG_7 system):

  DUMP: dumping (Pass IV) [regular files]
  DUMP: 20.49% done, finished in 0:19 at Mon Dec 29 19:58:57 2008
  DUMP: 38.00% done, finished in 0:16 at Mon Dec 29 20:00:52 2008
  DUMP: 55.45% done, finished in 0:12 at Mon Dec 29 20:01:37 2008
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1494 [pause] 2.30u 11.22s 0% 34616k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1492 [sbwait] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k

  A tcpdump on both the sending and receiving systems shows no packets
between them from the rdump processes. However, I can rshell both ways
and get the expected output, so the link isn't down.

  ps shows the same thing as ^T. The sbwait process looks like this:

    0  1492  1489   0   4  0 36024 34808 sbwait I+    p0    0:07.35 rdump:
/dev/amrd0s1f: pass 4: 69.66% done, finished in 0:08 at Mon Dec 29 20:01:53 2008
(rdump)

  and the status never changes.

  The remote (receiving) system is a HP DS10 running OpenVMS 8.3 with
MultiNet 5.1A as the TCP stack. Despite this being a rather rare envir-
onment, I haven't had any problems until this most recent kernel build.
I have a large number (over a dozen) other systems running a variety
of releases (6.4, 7.0, 7.1-PRERELEASE) which can do this same dump oper-
ation without difficulty.

  I have the offending dump process still in this stuck state, so I can
generate whatever sort of debugging information is needed. The box is a
test box, so I can crash it and get a core dump if that's what is needed.

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA

Peter Jeremy

2008-Dec-30 08:10 UTC

head link

rdump stuck in sbwait state (RELENG_7)

On 2008-Dec-29 20:28:41 -0500, Terry Kennedy <terry@tmk.com>
wrote:>  I upgraded a box (Dell Poweredge 1550, dual PIII processors) from a kernel
+
>world of December 8th to one from today (December 29th) and I am
experiencing
>a new problem with rdump.
...>  A tcpdump on both the sending and receiving systems shows no packets
>between them from the rdump processes. However, I can rshell both ways
>and get the expected output, so the link isn't down.
This is probably the critical piece of information - the TCP connection
has stopped transferring data for some reason and the rdump is blocked
waiting to send.

Unfortunately, you need the last packets that were exchanged in order
to identify which end has the problem (and hopefully provide some
pointers as to why).  If possible, can you repeat the dump whilst you
run a tcpdump on the rdump flow and then post the last dozen or so
packets in each direction.

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081230/d9ca1b32/attachment.pgp

Terry Kennedy

2008-Dec-30 10:52 UTC

head link

rdump stuck in sbwait state (RELENG_7)

> Unfortunately, you need the last packets that were exchanged in order
> to identify which end has the problem (and hopefully provide some
> pointers as to why).  If possible, can you repeat the dump whilst you
> run a tcpdump on the rdump flow and then post the last dozen or so
> packets in each direction.
  That could be pretty unpleasant - this happens at a random point while
dumping 4GB or so. If I have to, I'll do it but I was hoping there was
a better way.

  Shouldn't this get torn down by a keepalive at some point? It has been
sitting for 9 hours or so at this point...

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA

Andy Kosela

2008-Dec-30 13:28 UTC

head link

rdump stuck in sbwait state (RELENG_7)

I'm pretty sure it's caused by FreeBSD.  It can very well be related to
PR 117603, a real nasty dump(8) bug that was introduced in 7.0 on SMP
systems.  But it should have been patched back in March by this:

jeff 2008-03-13 00:46:12 UTC

FreeBSD src repository

Modified files:
sys/kern subr_sleepqueue.c
Log:
PR 117603
- Close a sleepqueue signal race by interlocking with the per-process
spinlock. This was mistakenly omitted from the thread_lock patch and
has been a race since.

MFC After: 1 week
PR: bin/117603
Reported by: Danny Braniss <danny@cs.huji.ac.il>

Revision Changes Path
1.48 +5 -2 src/sys/kern/subr_sleepqueue.c

So I'm real surprised it shows up again. We got a pretty large backup
environment with dump(8) being a critical element of it.  I just hope
the problem will be resolved before 7.1-RELEASE hit the streets.

Terry, please file a bug report on this and get in touch with iedowse@
who was implementing the aforementioned patch.

Andy Kosela

Terry Kennedy

2008-Dec-31 02:25 UTC

head link

rdump stuck in sbwait state (RELENG_7)

> I'm pretty sure it's caused by FreeBSD.  It can very well be
related to
> PR 117603, a real nasty dump(8) bug that was introduced in 7.0 on SMP
> systems.  But it should have been patched back in March by this:
[...]> So I'm real surprised it shows up again. We got a pretty large backup
> environment with dump(8) being a critical element of it.  I just hope
> the problem will be resolved before 7.1-RELEASE hit the streets.
>
> Terry, please file a bug report on this and get in touch with iedowse@
> who was implementing the aforementioned patch.
  I don't think my hang is related to that problem - mine seems to be in
the TCP code while that problem seems to be in the kernel / filesystem
code (or at least that's what I recall of it from prior discussions).

  Plus, my problem just showed up in a recent build. The last time subr_
sleepqueue was touched seems to have been back in September.

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA

freebsd stable - Dec 2008 - rdump stuck in sbwait state (RELENG_7)

rdump stuck in sbwait state (RELENG_7)

rdump stuck in sbwait state (RELENG_7)

rdump stuck in sbwait state (RELENG_7)

rdump stuck in sbwait state (RELENG_7)

rdump stuck in sbwait state (RELENG_7)