Scott Lambert
2009-Mar-20 13:09 UTC
Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?
I have a previously stable machine, other than a one time panic in soft-updates which I could never reproduce, running RELENG_7 from July 23, 2008. Starting update: Wed Jul 23 01:29:47 CDT 2008 Finished update: Wed Jul 23 01:31:13 CDT 2008 I had the userquota option in the fstab for /home, but I did not yet have anything in /etc/rc.conf to enable them. I have been running an unmodified GENERIC kernel config. /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) It runs a few jails, using ezjails. Two of them were image based jails, 1GB and 2GB. There is also one non-image file jail. The jails live in /home/ezjails. I added another image based jail, 3GB image, on March 12th. I added this machine to our AMANDA setup on March 13, 2009. Things seemed to be okay until the 19th. On the 19th, during the dump of /home, things gradually started to hang. Nagios paged me about services not responding. I did not find any explanation for it. The disks were idle according to systat -vm. I was able to grep the log files on /var for a while, and then I could no longer do anything with it. I eventually had to go to the office and power cycle it. I tried C-A-D first, but shutdown timed out after 30 seconds. Just to make sure it wasn't something that had since been fixed, I updated to RELENG_7 as of Mar 19th. Starting update: Thu Mar 19 03:40:41 CDT 2009 Finished update: Thu Mar 19 03:48:45 CDT 2009 I rebooted to the new kernel and installed the world just after midnight on the 20th. I started getting paged by Nagios again at 3:40am. I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, as things began to circle the drain. That was about 30 minutes after the dump attempt had been started by AMANDA. There were many processes waiting in state D. This time I did a reboot -n -q and the box rebooted but was still fscking when I got to the office. # ls -l /home/.snap -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot # df /home Filesystem Size Used Avail Capacity Mounted on /dev/mirror/gm0s1g 106G 11G 86G 11% /home I removed userquota from the fstab entry for /home and rebooted, just to be sure. The last danger combination I remember for snapshots was in combination with quotas. Am I even in the danger zone for quotas without having them compiled into the kernel? It looks like removing the .snap directory should be enough to prevent any future snapshots during the backup process. Does that sound like a reasonable workaround? It would at least remove one variable from the trouble shooting process. Any other suggestions? Thank you for any help you may be able to provide, -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org
Scott Lambert
2009-Mar-22 02:32 UTC
Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?
On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote:> I have a previously stable machine, other than a one time panic in > soft-updates which I could never reproduce, running RELENG_7 from July > 23, 2008. > > Starting update: Wed Jul 23 01:29:47 CDT 2008 > Finished update: Wed Jul 23 01:31:13 CDT 2008 > > I had the userquota option in the fstab for /home, but I did not yet > have anything in /etc/rc.conf to enable them. I have been running an > unmodified GENERIC kernel config. > > /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) > > It runs a few jails, using ezjails. Two of them were image based jails, > 1GB and 2GB. There is also one non-image file jail. The jails live in > /home/ezjails. > > I added another image based jail, 3GB image, on March 12th. > > I added this machine to our AMANDA setup on March 13, 2009. > > Things seemed to be okay until the 19th. On the 19th, during the dump > of /home, things gradually started to hang. Nagios paged me about > services not responding. > > I did not find any explanation for it. The disks were idle according to > systat -vm. I was able to grep the log files on /var for a while, and > then I could no longer do anything with it. > > I eventually had to go to the office and power cycle it. I tried C-A-D > first, but shutdown timed out after 30 seconds. > > Just to make sure it wasn't something that had since been fixed, I > updated to RELENG_7 as of Mar 19th. > > Starting update: Thu Mar 19 03:40:41 CDT 2009 > Finished update: Thu Mar 19 03:48:45 CDT 2009 > > I rebooted to the new kernel and installed the world just after midnight > on the 20th. I started getting paged by Nagios again at 3:40am. > > I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, > as things began to circle the drain. That was about 30 minutes after > the dump attempt had been started by AMANDA. There were many processes > waiting in state D. This time I did a reboot -n -q and the box rebooted > but was still fscking when I got to the office. > > # ls -l /home/.snap > -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot > > # df /home > Filesystem Size Used Avail Capacity Mounted on > /dev/mirror/gm0s1g 106G 11G 86G 11% /home > > I removed userquota from the fstab entry for /home and rebooted, just > to be sure. The last danger combination I remember for snapshots was > in combination with quotas. Am I even in the danger zone for quotas > without having them compiled into the kernel? > > It looks like removing the .snap directory should be enough to prevent > any future snapshots during the backup process. Does that sound like a > reasonable workaround? It would at least remove one variable from the > trouble shooting process. > > Any other suggestions? > > Thank you for any help you may be able to provide,Did it to me again tonight. I was unable to get in to look at anything. Just pushed the power button. It did give me the same "shutdown timed out after 30 seconds." So, I tuned the /home fs to disable softupdates. I also removed the .snap directory. I would appreciate any suggestions... -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org