thr3ads.net - freebsd stable - System deadlock when using mksnap

If this information is useful, please help other people find it:
Share via:

Tim Bishop

2008-Nov-12 09:58 UTC

System deadlock when using mksnap_ffs

I've been playing around with snapshots lately but I've got a problem on
one of my servers running 7-STABLE amd64:

FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51
GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN  amd64

I run the mksnap_ffs command to take the snapshot and some time later
the system completely freezes up:

paladin# cd /u2/.snap/
paladin# mksnap_ffs /u2 test.1

It only happens on this one filesystem, though, which might be to do
with its size. It's not over the 2TB marker, but it's pretty close.
It's
also backed by a hardware RAID system, although a smaller filesystem on
the same RAID has no issues.

Filesystem  1K-blocks       Used     Avail Capacity  Mounted on
/dev/da0s1a 2078881084 921821396 990749202    48%    /u2

To clarify "completely freezes up": unresponsive to all services over
the network, except ping. On the console I can switch between the ttys,
but none of them respond. The only way out is to hit the reset button.

Any advice? I'm happy to help debug this further to get to the bottom of
it.

Thanks,

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984

David Peall

2008-Nov-12 10:11 UTC

head link

System deadlock when using mksnap_ffs

> -----Original Message-----
> From: owner-freebsd-stable@freebsd.org [mailto:owner-freebsd-
> stable@freebsd.org] On Behalf Of Tim Bishop
> Sent: 12 November 2008 07:58 PM
> To: freebsd-stable@freebsd.org
> Cc: tim@bishnet.net
> Subject: System deadlock when using mksnap_ffs
>
> FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10
> 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN  amd64
> 
> I run the mksnap_ffs command to take the snapshot and some time later
> the system completely freezes up:

If the file system is UFS2 it's a known problem but should have been
fixed.
http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues

ident /boot/kernel/kernel | grep subr_sleepqueue

version should be greater than 1.39.2.3?

Regards

--
David Peall :: IT Manager
e-Schools' Network :: http://www.esn.org.za/
Phone +27 (021) 674-9140

Kostik Belousov

2008-Nov-12 11:47 UTC

head link

System deadlock when using mksnap_ffs

On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop
wrote:> I've been playing around with snapshots lately but I've got a
problem on
> one of my servers running 7-STABLE amd64:
> 
> FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10
20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN  amd64
> 
> I run the mksnap_ffs command to take the snapshot and some time later
> the system completely freezes up:
> 
> paladin# cd /u2/.snap/
> paladin# mksnap_ffs /u2 test.1
> 
> It only happens on this one filesystem, though, which might be to do
> with its size. It's not over the 2TB marker, but it's pretty close.
It's
> also backed by a hardware RAID system, although a smaller filesystem on
> the same RAID has no issues.
> 
> Filesystem  1K-blocks       Used     Avail Capacity  Mounted on
> /dev/da0s1a 2078881084 921821396 990749202    48%    /u2
> 
> To clarify "completely freezes up": unresponsive to all services
over
> the network, except ping. On the console I can switch between the ttys,
> but none of them respond. The only way out is to hit the reset button.
> 
> Any advice? I'm happy to help debug this further to get to the bottom
of
> it.
You need to provide information described in the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
and especially
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081112/8b92b2fb/attachment.pgp

Tim Bishop

2008-Nov-12 11:49 UTC

head link

System deadlock when using mksnap_ffs

On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop
wrote:> I run the mksnap_ffs command to take the snapshot and some time later
> the system completely freezes up:
> 
> paladin# cd /u2/.snap/
> paladin# mksnap_ffs /u2 test.1
Someone (not named because they choose not to reply to the list) gave me
the following patch:

--- sys/ufs/ffs/ffs_snapshot.c.orig	Wed Mar 22 09:42:31 2006
+++ sys/ufs/ffs/ffs_snapshot.c	Mon Nov 20 14:59:13 2006
@@ -282,6 +282,8 @@ restart:
 		if (error)
 			goto out;
 		bawrite(nbp);
+		if (cg % 10 == 0)
+			ffs_syncvnode(vp, MNT_WAIT);
 	}
 	/*
 	 * Copy all the cylinder group maps. Although the
@@ -303,6 +305,8 @@ restart:
 			goto out;
 		error = cgaccount(cg, vp, nbp, 1);
 		bawrite(nbp);
+		if (cg % 10 == 0)
+			ffs_syncvnode(vp, MNT_WAIT);
 		if (error)
 			goto out;
 	}

With the description:

"What can happen is on a big file system it will fill up the buffer
cache with I/O and then run out.  When the buffer cache fills up then no
more disk I/O can happen :-(  When you do a sync, it flushes that out to
disk so things don't hang."

It seems to work too. But it seems more like a workaround than a fix?

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x5AE7D984

Kevin Day

2008-Nov-12 23:54 UTC

head link

System deadlock when using mksnap_ffs

(moving my thread from -fs to -stable)


Before touching anything, here's a description of the symptoms I  
see... Rather busy system, with quite a bit of filesystem activity  
occurring while the snapshot is being made. Quad CPU amd64 box with  
16GB of ram, 6x10Krpm RAID array. Should be reasonably fast.

Filesystem       1K-blocks     Used     Avail Capacity iused    ifree  
%iused  Mounted on
/dev/da0s1a      739339824 74357926 605834714    11% 1718540  
93855474    2%   /

1.7 million inodes, 71G used of a 705G volume.

Here's a timeline of what I see when starting to make a new snapshot.  
I've got a few windows running, showing "top", "iostat",
etc.


Baseline disk activity before starting anything:

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       24.0   2.0   355.6    32.0    1  10.7  28


0m0s: Snapshot begins, using "mount -u -o snapshot //.snap/weekly. 
0 /"  Drives immediately jump to 100% busy as expected.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0      153.8   6.0  3378.6    95.9    2  16.9 100

the mount process is spending 100% of its time in "biord".


2m10s: The mount process starts spending more and more time in  
"snaplk", alternating with "biord".

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       77.9  67.9  1270.7  3754.2    1  10.7 100


12m15s: The first intermittent slowdowns start affecting other  
processes on the system. Occasionally all active processes will get  
stuck in "snaplk" or "ufs" for 5-10 seconds before resuming.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       77.9  31.0  1150.8  1054.9    1  10.4 100


114m47s: Active processes are briefly stuck in "suspfs"

115m22s: Mount is now in "snaprdb", Active processes are now  
completely stuck in "snaplk". Still responsive to SIGINFO, top is  
still running, etc. Just hangs any time anything needs the filesystem.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0      238.8   0.0  3820.1     0.0    1   4.1  99

143m19s: Mount now in wdrain.

143m34s: Finished.

snapshot logging shows "/: suspended 13.308 sec, redo 153 of 4058"   
Most processes were hung for 28 minutes.


Is this what others are seeing? It sounds like some of the complaints  
are it getting stuck in the "wdrain" state, not what I'm showing
here.




Another mildly annoying note: Any process that touches ".snap" while a
snapshot is being generated gets stuck in "ufs" until it finishes. I  
can understand wanting to keep operations in there in sync, but it  
would be really nice if "find /" wouldn't get hung when it tries
to
decent into .snap, for example.

ts5# cd /.snap
ts5# ls -l
^T
load: 0.17  cmd: ls 3696 [ufs] 0.00u 0.00s 0% 1496k

Patrick Reich

2008-Nov-13 10:49 UTC

head link

System deadlock when using mksnap_ffs

I'll just chime in briefly.  I contacted Jeremy off the list
about this issue a few days ago.  I have one spare box i386
sitting here that I can happily test patches against; if I
can be of help, let me know.
> uname -aFreeBSD localhost.localdomain 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0:
Tue Nov 11 21:40:27 CST 2008
user@localhost.localdomain:/usr/obj/usr/src/sys/GENERIC   i386
> ident /boot/kernel/kernel | grep sleepqueue   $FreeBSD: src/sys/kern/subr_sleepqueue.c,v 1.39.2.5 2008/09/16
20:01:57 jhb Exp $

Suffers from the description given by Jeremy: the box is not deadlocked
during snapshot but I might as well walk away from it because I can't
use it.  I'd really like to see this get fixed; I rely on dump for
backups.

Regards,
Pat
-- 

"Jesus, can't I count on you people!?"
--Oh Brother, Where Art Thou, George Clooney

Apparently Analagous Threads

Search for more apparently analagous threads

freebsd stable - Nov 2008 - System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

System deadlock when using mksnap_ffs

Apparently Analagous Threads