I've been playing around with snapshots lately but I've got a problem on one of my servers running 7-STABLE amd64: FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN amd64 I run the mksnap_ffs command to take the snapshot and some time later the system completely freezes up: paladin# cd /u2/.snap/ paladin# mksnap_ffs /u2 test.1 It only happens on this one filesystem, though, which might be to do with its size. It's not over the 2TB marker, but it's pretty close. It's also backed by a hardware RAID system, although a smaller filesystem on the same RAID has no issues. Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s1a 2078881084 921821396 990749202 48% /u2 To clarify "completely freezes up": unresponsive to all services over the network, except ping. On the console I can switch between the ttys, but none of them respond. The only way out is to hit the reset button. Any advice? I'm happy to help debug this further to get to the bottom of it. Thanks, Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984
> -----Original Message----- > From: owner-freebsd-stable@freebsd.org [mailto:owner-freebsd- > stable@freebsd.org] On Behalf Of Tim Bishop > Sent: 12 November 2008 07:58 PM > To: freebsd-stable@freebsd.org > Cc: tim@bishnet.net > Subject: System deadlock when using mksnap_ffs > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 > 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN amd64 > > I run the mksnap_ffs command to take the snapshot and some time later > the system completely freezes up:If the file system is UFS2 it's a known problem but should have been fixed. http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues ident /boot/kernel/kernel | grep subr_sleepqueue version should be greater than 1.39.2.3? Regards -- David Peall :: IT Manager e-Schools' Network :: http://www.esn.org.za/ Phone +27 (021) 674-9140
On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote:> I've been playing around with snapshots lately but I've got a problem on > one of my servers running 7-STABLE amd64: > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN amd64 > > I run the mksnap_ffs command to take the snapshot and some time later > the system completely freezes up: > > paladin# cd /u2/.snap/ > paladin# mksnap_ffs /u2 test.1 > > It only happens on this one filesystem, though, which might be to do > with its size. It's not over the 2TB marker, but it's pretty close. It's > also backed by a hardware RAID system, although a smaller filesystem on > the same RAID has no issues. > > Filesystem 1K-blocks Used Avail Capacity Mounted on > /dev/da0s1a 2078881084 921821396 990749202 48% /u2 > > To clarify "completely freezes up": unresponsive to all services over > the network, except ping. On the console I can switch between the ttys, > but none of them respond. The only way out is to hit the reset button. > > Any advice? I'm happy to help debug this further to get to the bottom of > it.You need to provide information described in the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html and especially http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081112/8b92b2fb/attachment.pgp
On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote:> I run the mksnap_ffs command to take the snapshot and some time later > the system completely freezes up: > > paladin# cd /u2/.snap/ > paladin# mksnap_ffs /u2 test.1Someone (not named because they choose not to reply to the list) gave me the following patch: --- sys/ufs/ffs/ffs_snapshot.c.orig Wed Mar 22 09:42:31 2006 +++ sys/ufs/ffs/ffs_snapshot.c Mon Nov 20 14:59:13 2006 @@ -282,6 +282,8 @@ restart: if (error) goto out; bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); } /* * Copy all the cylinder group maps. Although the @@ -303,6 +305,8 @@ restart: goto out; error = cgaccount(cg, vp, nbp, 1); bawrite(nbp); + if (cg % 10 == 0) + ffs_syncvnode(vp, MNT_WAIT); if (error) goto out; } With the description: "What can happen is on a big file system it will fill up the buffer cache with I/O and then run out. When the buffer cache fills up then no more disk I/O can happen :-( When you do a sync, it flushes that out to disk so things don't hang." It seems to work too. But it seems more like a workaround than a fix? Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984
(moving my thread from -fs to -stable) Before touching anything, here's a description of the symptoms I see... Rather busy system, with quite a bit of filesystem activity occurring while the snapshot is being made. Quad CPU amd64 box with 16GB of ram, 6x10Krpm RAID array. Should be reasonably fast. Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on /dev/da0s1a 739339824 74357926 605834714 11% 1718540 93855474 2% / 1.7 million inodes, 71G used of a 705G volume. Here's a timeline of what I see when starting to make a new snapshot. I've got a few windows running, showing "top", "iostat", etc. Baseline disk activity before starting anything: device r/s w/s kr/s kw/s wait svc_t b da0 24.0 2.0 355.6 32.0 1 10.7 28 0m0s: Snapshot begins, using "mount -u -o snapshot //.snap/weekly. 0 /" Drives immediately jump to 100% busy as expected. device r/s w/s kr/s kw/s wait svc_t b da0 153.8 6.0 3378.6 95.9 2 16.9 100 the mount process is spending 100% of its time in "biord". 2m10s: The mount process starts spending more and more time in "snaplk", alternating with "biord". device r/s w/s kr/s kw/s wait svc_t b da0 77.9 67.9 1270.7 3754.2 1 10.7 100 12m15s: The first intermittent slowdowns start affecting other processes on the system. Occasionally all active processes will get stuck in "snaplk" or "ufs" for 5-10 seconds before resuming. device r/s w/s kr/s kw/s wait svc_t b da0 77.9 31.0 1150.8 1054.9 1 10.4 100 114m47s: Active processes are briefly stuck in "suspfs" 115m22s: Mount is now in "snaprdb", Active processes are now completely stuck in "snaplk". Still responsive to SIGINFO, top is still running, etc. Just hangs any time anything needs the filesystem. device r/s w/s kr/s kw/s wait svc_t b da0 238.8 0.0 3820.1 0.0 1 4.1 99 143m19s: Mount now in wdrain. 143m34s: Finished. snapshot logging shows "/: suspended 13.308 sec, redo 153 of 4058" Most processes were hung for 28 minutes. Is this what others are seeing? It sounds like some of the complaints are it getting stuck in the "wdrain" state, not what I'm showing here. Another mildly annoying note: Any process that touches ".snap" while a snapshot is being generated gets stuck in "ufs" until it finishes. I can understand wanting to keep operations in there in sync, but it would be really nice if "find /" wouldn't get hung when it tries to decent into .snap, for example. ts5# cd /.snap ts5# ls -l ^T load: 0.17 cmd: ls 3696 [ufs] 0.00u 0.00s 0% 1496k
I'll just chime in briefly. I contacted Jeremy off the list about this issue a few days ago. I have one spare box i386 sitting here that I can happily test patches against; if I can be of help, let me know.> uname -aFreeBSD localhost.localdomain 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Tue Nov 11 21:40:27 CST 2008 user@localhost.localdomain:/usr/obj/usr/src/sys/GENERIC i386> ident /boot/kernel/kernel | grep sleepqueue$FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16 20:01:57 jhb Exp $ Suffers from the description given by Jeremy: the box is not deadlocked during snapshot but I might as well walk away from it because I can't use it. I'd really like to see this get fixed; I rely on dump for backups. Regards, Pat -- "Jesus, can't I count on you people!?" --Oh Brother, Where Art Thou, George Clooney