thr3ads.net - freebsd stable - Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Scott Lambert

2009-Mar-20 13:09 UTC

Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?

I have a previously stable machine, other than a one time panic in
soft-updates which I could never reproduce, running RELENG_7 from July
23, 2008.

Starting update: Wed Jul 23 01:29:47 CDT 2008
Finished update: Wed Jul 23 01:31:13 CDT 2008

I had the userquota option in the fstab for /home, but I did not yet
have anything in /etc/rc.conf to enable them.  I have been running an
unmodified GENERIC kernel config.

/dev/mirror/gm0s1g on /home (ufs, local, soft-updates)

It runs a few jails, using ezjails.  Two of them were image based jails,
1GB and 2GB.  There is also one non-image file jail.  The jails live in
/home/ezjails.

I added another image based jail, 3GB image, on March 12th.

I added this machine to our AMANDA setup on March 13, 2009.  

Things seemed to be okay until the 19th.  On the 19th, during the dump
of /home, things gradually started to hang.  Nagios paged me about
services not responding.  

I did not find any explanation for it.  The disks were idle according to
systat -vm.  I was able to grep the log files on /var for a while, and
then I could no longer do anything with it.

I eventually had to go to the office and power cycle it.  I tried C-A-D
first, but shutdown timed out after 30 seconds.

Just to make sure it wasn't something that had since been fixed, I
updated to RELENG_7 as of Mar 19th.

Starting update: Thu Mar 19 03:40:41 CDT 2009
Finished update: Thu Mar 19 03:48:45 CDT 2009

I rebooted to the new kernel and installed the world just after midnight
on the 20th.  I started getting paged by Nagios again at 3:40am.  

I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
as things began to circle the drain.  That was about 30 minutes after
the dump attempt had been started by AMANDA.  There were many processes
waiting in state D.  This time I did a reboot -n -q and the box rebooted
but was still fscking when I got to the office.

# ls -l /home/.snap
-r--------   1 root  operator  117285093376 Mar 20 03:18 dump_snapshot

# df /home
Filesystem            Size    Used   Avail Capacity  Mounted on
/dev/mirror/gm0s1g    106G     11G     86G    11%    /home

I removed userquota from the fstab entry for /home and rebooted, just
to be sure.  The last danger combination I remember for snapshots was
in combination with quotas.  Am I even in the danger zone for quotas
without having them compiled into the kernel?

It looks like removing the .snap directory should be enough to prevent
any future snapshots during the backup process.  Does that sound like a
reasonable workaround?  It would at least remove one variable from the
trouble shooting process.

Any other suggestions?

Thank you for any help you may be able to provide,

-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert@lambertfam.org

Scott Lambert

2009-Mar-22 02:32 UTC

head link

Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?

On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert
wrote:> I have a previously stable machine, other than a one time panic in
> soft-updates which I could never reproduce, running RELENG_7 from July
> 23, 2008.
> 
> Starting update: Wed Jul 23 01:29:47 CDT 2008
> Finished update: Wed Jul 23 01:31:13 CDT 2008
> 
> I had the userquota option in the fstab for /home, but I did not yet
> have anything in /etc/rc.conf to enable them.  I have been running an
> unmodified GENERIC kernel config.
> 
> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates)
> 
> It runs a few jails, using ezjails.  Two of them were image based jails,
> 1GB and 2GB.  There is also one non-image file jail.  The jails live in
> /home/ezjails.
> 
> I added another image based jail, 3GB image, on March 12th.
> 
> I added this machine to our AMANDA setup on March 13, 2009.  
> 
> Things seemed to be okay until the 19th.  On the 19th, during the dump
> of /home, things gradually started to hang.  Nagios paged me about
> services not responding.  
> 
> I did not find any explanation for it.  The disks were idle according to
> systat -vm.  I was able to grep the log files on /var for a while, and
> then I could no longer do anything with it.
> 
> I eventually had to go to the office and power cycle it.  I tried C-A-D
> first, but shutdown timed out after 30 seconds.
> 
> Just to make sure it wasn't something that had since been fixed, I
> updated to RELENG_7 as of Mar 19th.
> 
> Starting update: Thu Mar 19 03:40:41 CDT 2009
> Finished update: Thu Mar 19 03:48:45 CDT 2009
> 
> I rebooted to the new kernel and installed the world just after midnight
> on the 20th.  I started getting paged by Nagios again at 3:40am.  
> 
> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
> as things began to circle the drain.  That was about 30 minutes after
> the dump attempt had been started by AMANDA.  There were many processes
> waiting in state D.  This time I did a reboot -n -q and the box rebooted
> but was still fscking when I got to the office.
> 
> # ls -l /home/.snap
> -r--------   1 root  operator  117285093376 Mar 20 03:18 dump_snapshot
> 
> # df /home
> Filesystem            Size    Used   Avail Capacity  Mounted on
> /dev/mirror/gm0s1g    106G     11G     86G    11%    /home
> 
> I removed userquota from the fstab entry for /home and rebooted, just
> to be sure.  The last danger combination I remember for snapshots was
> in combination with quotas.  Am I even in the danger zone for quotas
> without having them compiled into the kernel?
> 
> It looks like removing the .snap directory should be enough to prevent
> any future snapshots during the backup process.  Does that sound like a
> reasonable workaround?  It would at least remove one variable from the
> trouble shooting process.
> 
> Any other suggestions?
> 
> Thank you for any help you may be able to provide,
Did it to me again tonight.  I was unable to get in to look at anything.
Just pushed the power button.  It did give me the same "shutdown timed
out after 30 seconds."

So, I tuned the /home fs to disable softupdates.  I also removed the
.snap directory.

I would appreciate any suggestions...
 
-- 
Scott Lambert                    KC5MLE                       Unix SysAdmin
lambert@lambertfam.org

freebsd stable - Mar 2009 - Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?

Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?

Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?