I upgraded a box (Dell Poweredge 1550, dual PIII processors) from a kernel + world of December 8th to one from today (December 29th) and I am experiencing a new problem with rdump. The symptom is that rdump stops sending data to the remote system. It is responsive to ^T and can be aborted with ^C. Here's the ^T status on the sending box (the aforementioned Dell RELENG_7 system): DUMP: dumping (Pass IV) [regular files] DUMP: 20.49% done, finished in 0:19 at Mon Dec 29 19:58:57 2008 DUMP: 38.00% done, finished in 0:16 at Mon Dec 29 20:00:52 2008 DUMP: 55.45% done, finished in 0:12 at Mon Dec 29 20:01:37 2008 load: 0.00 cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k load: 0.00 cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k load: 0.00 cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k load: 0.00 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k load: 0.00 cmd: rdump 1494 [pause] 2.30u 11.22s 0% 34616k load: 0.00 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k load: 0.00 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k load: 0.00 cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k load: 0.00 cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k load: 0.00 cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k load: 0.00 cmd: rdump 1492 [sbwait] 2.46u 4.89s 0% 34800k load: 0.02 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k load: 0.02 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k load: 0.02 cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k load: 0.02 cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k A tcpdump on both the sending and receiving systems shows no packets between them from the rdump processes. However, I can rshell both ways and get the expected output, so the link isn't down. ps shows the same thing as ^T. The sbwait process looks like this: 0 1492 1489 0 4 0 36024 34808 sbwait I+ p0 0:07.35 rdump: /dev/amrd0s1f: pass 4: 69.66% done, finished in 0:08 at Mon Dec 29 20:01:53 2008 (rdump) and the status never changes. The remote (receiving) system is a HP DS10 running OpenVMS 8.3 with MultiNet 5.1A as the TCP stack. Despite this being a rather rare envir- onment, I haven't had any problems until this most recent kernel build. I have a large number (over a dozen) other systems running a variety of releases (6.4, 7.0, 7.1-PRERELEASE) which can do this same dump oper- ation without difficulty. I have the offending dump process still in this stuck state, so I can generate whatever sort of debugging information is needed. The box is a test box, so I can crash it and get a core dump if that's what is needed. Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA
On 2008-Dec-29 20:28:41 -0500, Terry Kennedy <terry@tmk.com> wrote:> I upgraded a box (Dell Poweredge 1550, dual PIII processors) from a kernel + >world of December 8th to one from today (December 29th) and I am experiencing >a new problem with rdump....> A tcpdump on both the sending and receiving systems shows no packets >between them from the rdump processes. However, I can rshell both ways >and get the expected output, so the link isn't down.This is probably the critical piece of information - the TCP connection has stopped transferring data for some reason and the rdump is blocked waiting to send. Unfortunately, you need the last packets that were exchanged in order to identify which end has the problem (and hopefully provide some pointers as to why). If possible, can you repeat the dump whilst you run a tcpdump on the rdump flow and then post the last dozen or so packets in each direction. -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081230/d9ca1b32/attachment.pgp
> Unfortunately, you need the last packets that were exchanged in order > to identify which end has the problem (and hopefully provide some > pointers as to why). If possible, can you repeat the dump whilst you > run a tcpdump on the rdump flow and then post the last dozen or so > packets in each direction.That could be pretty unpleasant - this happens at a random point while dumping 4GB or so. If I have to, I'll do it but I was hoping there was a better way. Shouldn't this get torn down by a keepalive at some point? It has been sitting for 9 hours or so at this point... Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA
I'm pretty sure it's caused by FreeBSD. It can very well be related to PR 117603, a real nasty dump(8) bug that was introduced in 7.0 on SMP systems. But it should have been patched back in March by this: jeff 2008-03-13 00:46:12 UTC FreeBSD src repository Modified files: sys/kern subr_sleepqueue.c Log: PR 117603 - Close a sleepqueue signal race by interlocking with the per-process spinlock. This was mistakenly omitted from the thread_lock patch and has been a race since. MFC After: 1 week PR: bin/117603 Reported by: Danny Braniss <danny@cs.huji.ac.il> Revision Changes Path 1.48 +5 -2 src/sys/kern/subr_sleepqueue.c So I'm real surprised it shows up again. We got a pretty large backup environment with dump(8) being a critical element of it. I just hope the problem will be resolved before 7.1-RELEASE hit the streets. Terry, please file a bug report on this and get in touch with iedowse@ who was implementing the aforementioned patch. Andy Kosela
> I'm pretty sure it's caused by FreeBSD. It can very well be related to > PR 117603, a real nasty dump(8) bug that was introduced in 7.0 on SMP > systems. But it should have been patched back in March by this:[...]> So I'm real surprised it shows up again. We got a pretty large backup > environment with dump(8) being a critical element of it. I just hope > the problem will be resolved before 7.1-RELEASE hit the streets. > > Terry, please file a bug report on this and get in touch with iedowse@ > who was implementing the aforementioned patch.I don't think my hang is related to that problem - mine seems to be in the TCP code while that problem seems to be in the kernel / filesystem code (or at least that's what I recall of it from prior discussions). Plus, my problem just showed up in a recent build. The last time subr_ sleepqueue was touched seems to have been back in September. Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA