thr3ads.net - freebsd stable - 7.1-RELEASE I/O hang [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Matt Burke

2009-Feb-04 05:13 UTC

7.1-RELEASE I/O hang

I have a machine with a PERC6/e controller. Attached to that are 3 disk
shelves, each configured as individual 14-disk RAID10 arrays (the PERC
annoyingly only lets you use 8 spans per array)

I can run bonnie++ on the arrays individually with no problem.
I can also run it across a gstripe of the arrays with no problem.

However running it over the 3 arrays in parallel causes something I/O
related in the kernel to hang.

To define 'hang' better:

It appears anything which needs disk io, even on a different controller
(albeit the same mfi driver), will hang. A command like 'ps' cached in
ram will work but bash hangs after execution, presumably while trying to
write ~/.bash_history

'sysctl -a' works but trying to run 'sysctl kern.msgbuf' also
hangs

I've done some research and it seems the usual cause of bonnie++
crashing a system is due to overflowing TCQ. camcontrol doesn't see any
disks, so I've tried setting hw.mfi.max_cmds=32 in /boot/loader.conf but
it hadn't made any difference.

The bonnie++ invocation is this:

(newfs devices mfid[2-3], mount)
bonnie++ -s 64g -u root -p3
bonnie++ -d /data/2 -s 64g -u root -y s >b2 2>&1 &
bonnie++ -d /data/3 -s 64g -u root -y s >b3 2>&1 &
bonnie++ -d /data/4 -s 64g -u root -y s >b4 2>&1 &

and it always hangs on "Rewriting...". It's a fresh 7.1-RELEASE
with
nothing else running (devd, sshd, syslogd, etc)


Any ideas?


--

Kostik Belousov

2009-Feb-04 08:00 UTC

head link

7.1-RELEASE I/O hang

On Wed, Feb 04, 2009 at 12:46:53PM +0000, Matt Burke
wrote:> I have a machine with a PERC6/e controller. Attached to that are 3 disk
> shelves, each configured as individual 14-disk RAID10 arrays (the PERC
> annoyingly only lets you use 8 spans per array)
> 
> I can run bonnie++ on the arrays individually with no problem.
> I can also run it across a gstripe of the arrays with no problem.
> 
> However running it over the 3 arrays in parallel causes something I/O
> related in the kernel to hang.
> 
> To define 'hang' better:
> 
> It appears anything which needs disk io, even on a different controller
> (albeit the same mfi driver), will hang. A command like 'ps' cached
in
> ram will work but bash hangs after execution, presumably while trying to
> write ~/.bash_history
> 
> 'sysctl -a' works but trying to run 'sysctl kern.msgbuf'
also hangs
> 
> I've done some research and it seems the usual cause of bonnie++
> crashing a system is due to overflowing TCQ. camcontrol doesn't see any
> disks, so I've tried setting hw.mfi.max_cmds=32 in /boot/loader.conf
but
> it hadn't made any difference.
> 
> The bonnie++ invocation is this:
> 
> (newfs devices mfid[2-3], mount)
> bonnie++ -s 64g -u root -p3
> bonnie++ -d /data/2 -s 64g -u root -y s >b2 2>&1 &
> bonnie++ -d /data/3 -s 64g -u root -y s >b3 2>&1 &
> bonnie++ -d /data/4 -s 64g -u root -y s >b4 2>&1 &
> 
> and it always hangs on "Rewriting...". It's a fresh
7.1-RELEASE with
> nothing else running (devd, sshd, syslogd, etc)
> 
> 
> Any ideas?
Compile ddb into the kernel, and do "ps" from the ddb prompt. If there
are processes hung in the "nbufkv" state, then the patch below might
help.

Index: gnu/fs/xfs/FreeBSD/xfs_buf.c
==================================================================---
gnu/fs/xfs/FreeBSD/xfs_buf.c	(revision 188080)
+++ gnu/fs/xfs/FreeBSD/xfs_buf.c	(working copy)
@@ -81,7 +81,7 @@
 {
 	struct buf *bp;
 
-	bp = geteblk(0);
+	bp = geteblk(0, 0);
 	if (bp != NULL) {
 		bp->b_bufsize = size;
 		bp->b_bcount = size;
@@ -101,7 +101,7 @@
 	if (len >= MAXPHYS)
 		return (NULL);
 
-	bp = geteblk(len);
+	bp = geteblk(len, 0);
 	if (bp != NULL) {
 		KASSERT(BUF_REFCNT(bp) == 1,
 			("xfs_buf_get_empty: bp %p not locked",bp));
Index: ufs/ffs/ffs_vfsops.c
==================================================================---
ufs/ffs/ffs_vfsops.c	(revision 188080)
+++ ufs/ffs/ffs_vfsops.c	(working copy)
@@ -1747,7 +1747,9 @@
 		    ("bufwrite: needs chained iodone (%p)", bp->b_iodone));
 
 		/* get a new block */
-		newbp = geteblk(bp->b_bufsize);
+		newbp = geteblk(bp->b_bufsize, GB_NOWAIT_BD);
+		if (newbp == NULL)
+			goto normal_write;
 
 		/*
 		 * set it to be identical to the old block.  We have to
@@ -1787,6 +1789,7 @@
 	}
 
 	/* Let the normal bufwrite do the rest for us */
+normal_write:
 	return (bufwrite(bp));
 }
 
Index: kern/vfs_bio.c
==================================================================---
kern/vfs_bio.c	(revision 188080)
+++ kern/vfs_bio.c	(working copy)
@@ -105,7 +105,8 @@
 static void vfs_vmio_release(struct buf *bp);
 static int vfs_bio_clcheck(struct vnode *vp, int size,
 		daddr_t lblkno, daddr_t blkno);
-static int flushbufqueues(int, int);
+static int buf_do_flush(struct vnode *vp);
+static int flushbufqueues(struct vnode *, int, int);
 static void buf_daemon(void);
 static void bremfreel(struct buf *bp);
 
@@ -258,6 +259,7 @@
 #define QUEUE_DIRTY_GIANT 3	/* B_DELWRI buffers that need giant */
 #define QUEUE_EMPTYKVA	4	/* empty buffer headers w/KVA assignment */
 #define QUEUE_EMPTY	5	/* empty buffer headers */
+#define QUEUE_SENTINEL	1024	/* not an queue index, but mark for sentinel */
 
 /* Queues for free buffers with various properties */
 static TAILQ_HEAD(bqueues, buf) bufqueues[BUFFER_QUEUES] = { { 0 } };
@@ -1703,21 +1705,23 @@
  */
 
 static struct buf *
-getnewbuf(int slpflag, int slptimeo, int size, int maxsize)
+getnewbuf(struct vnode *vp, int slpflag, int slptimeo, int size, int maxsize,
+    int gbflags)
 {
+	struct thread *td;
 	struct buf *bp;
 	struct buf *nbp;
 	int defrag = 0;
 	int nqindex;
 	static int flushingbufs;
 
+	td = curthread;
 	/*
 	 * We can't afford to block since we might be holding a vnode lock,
 	 * which may prevent system daemons from running.  We deal with
 	 * low-memory situations by proactively returning memory and running
 	 * async I/O rather then sync I/O.
 	 */
-
 	atomic_add_int(&getnewbufcalls, 1);
 	atomic_subtract_int(&getnewbufrestarts, 1);
 restart:
@@ -1949,8 +1953,9 @@
 	 */
 
 	if (bp == NULL) {
-		int flags;
+		int flags, norunbuf;
 		char *waitmsg;
+		int fl;
 
 		if (defrag) {
 			flags = VFS_BIO_NEED_BUFSPACE;
@@ -1968,9 +1973,35 @@
 		mtx_unlock(&bqlock);
 
 		bd_speedup();	/* heeeelp */
+		if (gbflags & GB_NOWAIT_BD)
+			return (NULL);
 
 		mtx_lock(&nblock);
 		while (needsbuffer & flags) {
+			if (vp != NULL && (td->td_pflags & TDP_BUFNEED) == 0) {
+				mtx_unlock(&nblock);
+				/*
+				 * getblk() is called with a vnode
+				 * locked, and some majority of the
+				 * dirty buffers may as well belong to
+				 * the vnode. Flushing the buffers
+				 * there would make a progress that
+				 * cannot be achieved by the
+				 * buf_daemon, that cannot lock the
+				 * vnode.
+				 */
+				norunbuf = ~(TDP_BUFNEED | TDP_NORUNNINGBUF) |
+				    (td->td_pflags & TDP_NORUNNINGBUF);
+				/* play bufdaemon */
+				td->td_pflags |= TDP_BUFNEED | TDP_NORUNNINGBUF;
+				fl = buf_do_flush(vp);
+				td->td_pflags &= norunbuf;
+				mtx_lock(&nblock);
+				if (fl != 0)
+					continue;
+				if ((needsbuffer & flags) == 0)
+					break;
+			}
 			if (msleep(&needsbuffer, &nblock,
 			    (PRIBIO + 4) | slpflag, waitmsg, slptimeo)) {
 				mtx_unlock(&nblock);
@@ -2039,6 +2070,35 @@
 };
 SYSINIT(bufdaemon, SI_SUB_KTHREAD_BUF, SI_ORDER_FIRST, kproc_start,
&buf_kp);
 
+static int
+buf_do_flush(struct vnode *vp)
+{
+	int flushed;
+
+	flushed = flushbufqueues(vp, QUEUE_DIRTY, 0);
+	/* The list empty check here is slightly racy */
+	if (!TAILQ_EMPTY(&bufqueues[QUEUE_DIRTY_GIANT])) {
+		mtx_lock(&Giant);
+		flushed += flushbufqueues(vp, QUEUE_DIRTY_GIANT, 0);
+		mtx_unlock(&Giant);
+	}
+	if (flushed == 0) {
+		/*
+		 * Could not find any buffers without rollback
+		 * dependencies, so just write the first one
+		 * in the hopes of eventually making progress.
+		 */
+		flushbufqueues(vp, QUEUE_DIRTY, 1);
+		if (!TAILQ_EMPTY(
+			    &bufqueues[QUEUE_DIRTY_GIANT])) {
+			mtx_lock(&Giant);
+			flushbufqueues(vp, QUEUE_DIRTY_GIANT, 1);
+			mtx_unlock(&Giant);
+		}
+	}
+	return (flushed);
+}
+
 static void
 buf_daemon()
 {
@@ -2052,7 +2112,7 @@
 	/*
 	 * This process is allowed to take the buffer cache to the limit
 	 */
-	curthread->td_pflags |= TDP_NORUNNINGBUF;
+	curthread->td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
 	mtx_lock(&bdlock);
 	for (;;) {
 		bd_request = 0;
@@ -2067,30 +2127,8 @@
 		 * normally would so they can run in parallel with our drain.
 		 */
 		while (numdirtybuffers > lodirtybuffers) {
-			int flushed;
-
-			flushed = flushbufqueues(QUEUE_DIRTY, 0);
-			/* The list empty check here is slightly racy */
-			if (!TAILQ_EMPTY(&bufqueues[QUEUE_DIRTY_GIANT])) {
-				mtx_lock(&Giant);
-				flushed += flushbufqueues(QUEUE_DIRTY_GIANT, 0);
-				mtx_unlock(&Giant);
-			}
-			if (flushed == 0) {
-				/*
-				 * Could not find any buffers without rollback
-				 * dependencies, so just write the first one
-				 * in the hopes of eventually making progress.
-				 */
-				flushbufqueues(QUEUE_DIRTY, 1);
-				if (!TAILQ_EMPTY(
-				    &bufqueues[QUEUE_DIRTY_GIANT])) {
-					mtx_lock(&Giant);
-					flushbufqueues(QUEUE_DIRTY_GIANT, 1);
-					mtx_unlock(&Giant);
-				}
+			if (buf_do_flush(NULL) == 0)
 				break;
-			}
 			uio_yield();
 		}
 
@@ -2136,7 +2174,7 @@
     0, "Number of buffers flushed with dependecies that require
rollbacks");
 
 static int
-flushbufqueues(int queue, int flushdeps)
+flushbufqueues(struct vnode *lvp, int queue, int flushdeps)
 {
 	struct thread *td = curthread;
 	struct buf sentinel;
@@ -2147,20 +2185,37 @@
 	int flushed;
 	int target;
 
-	target = numdirtybuffers - lodirtybuffers;
-	if (flushdeps && target > 2)
-		target /= 2;
+	if (lvp == NULL) {
+		target = numdirtybuffers - lodirtybuffers;
+		if (flushdeps && target > 2)
+			target /= 2;
+	} else
+		target = 1;
 	flushed = 0;
 	bp = NULL;
+	sentinel.b_qindex = QUEUE_SENTINEL;
 	mtx_lock(&bqlock);
-	TAILQ_INSERT_TAIL(&bufqueues[queue], &sentinel, b_freelist);
+	TAILQ_INSERT_HEAD(&bufqueues[queue], &sentinel, b_freelist);
 	while (flushed != target) {
-		bp = TAILQ_FIRST(&bufqueues[queue]);
-		if (bp == &sentinel)
+		bp = TAILQ_NEXT(&sentinel, b_freelist);
+		if (bp != NULL) {
+			TAILQ_REMOVE(&bufqueues[queue], &sentinel, b_freelist);
+			TAILQ_INSERT_AFTER(&bufqueues[queue], bp, &sentinel,
+			    b_freelist);
+		} else
 			break;
-		TAILQ_REMOVE(&bufqueues[queue], bp, b_freelist);
-		TAILQ_INSERT_TAIL(&bufqueues[queue], bp, b_freelist);
-
+		/*
+		 * Skip sentinels inserted by other invocations of the
+		 * flushbufqueues(), taking care to not reorder them.
+		 */
+		if (bp->b_qindex == QUEUE_SENTINEL)
+			continue;
+		/*
+		 * Only flush the buffers that belong to the
+		 * vnode locked by the curthread.
+		 */
+		if (lvp != NULL && bp->b_vp != lvp)
+			continue;
 		if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0)
 			continue;
 		if (bp->b_pin_count > 0) {
@@ -2208,16 +2263,28 @@
 			BUF_UNLOCK(bp);
 			continue;
 		}
-		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT, td) == 0) {
+		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE, td)
+		    == 0) {
 			mtx_unlock(&bqlock);
 			CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
 			    bp, bp->b_vp, bp->b_flags);
-			vfs_bio_awrite(bp);
+			if (curproc == bufdaemonproc)
+				vfs_bio_awrite(bp);
+			else {
+				bremfree(bp);
+				bwrite(bp);
+			}
 			vn_finished_write(mp);
 			VOP_UNLOCK(vp, 0, td);
 			flushwithdeps += hasdeps;
 			flushed++;
-			waitrunningbufspace();
+
+			/*
+			 * Sleeping on runningbufspace while holding
+			 * vnode lock leads to deadlock.
+			 */
+			if (curproc == bufdaemonproc)
+				waitrunningbufspace();
 			numdirtywakeup((lodirtybuffers + hidirtybuffers) / 2);
 			mtx_lock(&bqlock);
 			continue;
@@ -2599,7 +2666,7 @@
 		maxsize = vmio ? size + (offset & PAGE_MASK) : size;
 		maxsize = imax(maxsize, bsize);
 
-		bp = getnewbuf(slpflag, slptimeo, size, maxsize);
+		bp = getnewbuf(vp, slpflag, slptimeo, size, maxsize, flags);
 		if (bp == NULL) {
 			if (slpflag || slptimeo)
 				return NULL;
@@ -2674,14 +2741,17 @@
  * set to B_INVAL.
  */
 struct buf *
-geteblk(int size)
+geteblk(int size, int flags)
 {
 	struct buf *bp;
 	int maxsize;
 
 	maxsize = (size + BKVAMASK) & ~BKVAMASK;
-	while ((bp = getnewbuf(0, 0, size, maxsize)) == 0)
-		continue;
+	while ((bp = getnewbuf(NULL, 0, 0, size, maxsize, flags)) == NULL) {
+		if ((flags & GB_NOWAIT_BD) &&
+		    (curthread->td_pflags & TDP_BUFNEED) != 0)
+			return (NULL);
+	}
 	allocbuf(bp, size);
 	bp->b_flags |= B_INVAL;	/* b_dep cleared by getnewbuf() */
 	KASSERT(BUF_REFCNT(bp) == 1, ("geteblk: bp %p not locked",bp));
Index: sys/proc.h
==================================================================--- sys/proc.h
(revision 188080)
+++ sys/proc.h	(working copy)
@@ -378,6 +378,7 @@
 #define	TDP_NORUNNINGBUF 0x00040000 /* Ignore runningbufspace check */
 #define	TDP_WAKEUP	0x00080000 /* Don't sleep in umtx cond_wait */
 #define	TDP_INBDFLUSH	0x00100000 /* Already in BO_BDFLUSH, do not recurse */
+#define	TDP_BUFNEED	0x00200000 /* Do not recurse into the buf flush */
 
 /*
  * Reasons that the current thread can not be run yet.
Index: sys/buf.h
==================================================================--- sys/buf.h
(revision 188080)
+++ sys/buf.h	(working copy)
@@ -475,6 +475,7 @@
  */
 #define	GB_LOCK_NOWAIT	0x0001		/* Fail if we block on a buf lock. */
 #define	GB_NOCREAT	0x0002		/* Don't create a buf if not found. */
+#define	GB_NOWAIT_BD	0x0004		/* Do not wait for bufdaemon */
 
 #ifdef _KERNEL
 extern int	nbuf;			/* The number of buffer headers */
@@ -519,7 +520,7 @@
 struct buf *incore(struct bufobj *, daddr_t);
 struct buf *gbincore(struct bufobj *, daddr_t);
 struct buf *getblk(struct vnode *, daddr_t, int, int, int, int);
-struct buf *geteblk(int);
+struct buf *geteblk(int, int);
 int	bufwait(struct buf *);
 int	bufwrite(struct buf *);
 void	bufdone(struct buf *);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20090204/8c5aff42/attachment.pgp

Matt Burke

2009-Feb-05 03:27 UTC

head link

7.1-RELEASE I/O hang

Kostik Belousov wrote:> Compile ddb into the kernel, and do "ps" from the ddb prompt. If
there
> are processes hung in the "nbufkv" state, then the patch below
might
> help.
The bonnie++ processes are in state "newbuf" and other hung processes
(bash, newly forked sshds, etc) appear to be in the "ufs" state.

The patch appears to have no effect, although at the last hang I did see
one of the bonnie++ processes in "nbufkv" state. This could be
coincidental.

The problem also exhibits itself when running a parallel bonnie++ on a
single array, both with the onboard PERC6/i and the PERC6/e. I have no
access to other controllers.

Kostik Belousov

2009-Feb-05 04:07 UTC

head link

7.1-RELEASE I/O hang

On Thu, Feb 05, 2009 at 11:26:58AM +0000, Matt Burke
wrote:> Kostik Belousov wrote:
> > Compile ddb into the kernel, and do "ps" from the ddb
prompt. If there
> > are processes hung in the "nbufkv" state, then the patch
below might
> > help.
> 
> The bonnie++ processes are in state "newbuf" and other hung
processes
> (bash, newly forked sshds, etc) appear to be in the "ufs" state.What is the state of the bufdaemon process ?
> 
> The patch appears to have no effect, although at the last hang I did see
> one of the bonnie++ processes in "nbufkv" state. This could be
coincidental.
> 
> 
> The problem also exhibits itself when running a parallel bonnie++ on a
> single array, both with the onboard PERC6/i and the PERC6/e. I have no
> access to other controllers.-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20090205/064f9e49/attachment.pgp

Matt Burke

2009-Feb-05 04:46 UTC

head link

7.1-RELEASE I/O hang

Kostik Belousov wrote:>>> Compile ddb into the kernel, and do "ps" from the ddb
prompt. If there
>>> are processes hung in the "nbufkv" state, then the patch
below might
>>> help.
>> The bonnie++ processes are in state "newbuf" and other hung
processes
>> (bash, newly forked sshds, etc) appear to be in the "ufs"
state.
> What is the state of the bufdaemon process ?
qsleep


--

Kostik Belousov

2009-Feb-06 02:30 UTC

head link

7.1-RELEASE I/O hang

On Thu, Feb 05, 2009 at 12:46:23PM +0000, Matt Burke
wrote:> Kostik Belousov wrote:
> >>> Compile ddb into the kernel, and do "ps" from the
ddb prompt. If there
> >>> are processes hung in the "nbufkv" state, then the
patch below might
> >>> help.
> >> The bonnie++ processes are in state "newbuf" and other
hung processes
> >> (bash, newly forked sshds, etc) appear to be in the
"ufs" state.
> > What is the state of the bufdaemon process ?
> 
> qsleep
Please, increase the value that is assigned to the target variable in the
line 2193 of the patched sys/kern/vfs_bio.c from 1 to, say, 10 or 100.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20090206/342c8b22/attachment.pgp

freebsd stable - Feb 2009 - 7.1-RELEASE I/O hang

7.1-RELEASE I/O hang

7.1-RELEASE I/O hang

7.1-RELEASE I/O hang

7.1-RELEASE I/O hang

7.1-RELEASE I/O hang

7.1-RELEASE I/O hang