thr3ads.net - zfs discuss - [zfs-discuss] Disappearing snapshots [Jul 2009]

If this information is useful, please help other people find it:
Share via:

DL Consulting

2009-Jul-06 00:01 UTC

[zfs-discuss] Disappearing snapshots

I did a quick search but couldn''t find anything about this little
problem.

I have an X4100 production machine (called monster) that has a J4200 full of
500GB drives attached. It''s running OpenSolaris 2009.06 and fully up to
date.

It takes daily snapshots and sends them to another machine as a backup. The
sending and receiving is scripted and run from a cronjob. The problem is that
some of the snapshots disappear from monster after they''ve been sent to
the backup machine.

Example:
[i]shane at monster:/$ zfs list -t snapshot | grep local@

...
mpool/local at zfs-auto-snap:daily-2009-07-02-00:00                  64.5K     
-   176K  -
mpool/local at zfs-auto-snap:daily-2009-07-03-00:00                  76.5K     
-   171K  -
mpool/local at zfs-auto-snap:daily-2009-07-05-00:00                  59.8K     
-   173K  -
mpool/local at zfs-auto-snap:daily-2009-07-06-00:00                  59.8K     
-   173K  -

shane at chucky[11:53:46]:/$ zfs list -t snapshot | grep local@

....
mpool/local at zfs-auto-snap:daily-2009-07-01-00:00                    35K     
-    92K  -
mpool/local at zfs-auto-snap:daily-2009-07-02-00:00                    36K     
-    93K  -
mpool/local at zfs-auto-snap:daily-2009-07-03-00:00                  43.5K     
-    89K  -
mpool/local at zfs-auto-snap:daily-2009-07-04-00:00                      0     
-    90K  -[/i]

As you can see the snapshot for 2009-07-04 exists on chucky (the backup machine)

zpool history shows the snapshot was taken:

[i]shane at monster:/$ pfexec zpool history mpool | grep 2009-07-04

2009-07-04.00:00:02 zfs snapshot mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00
2009-07-04.00:00:04 zfs snapshot -r mpool/local/VMwareMachines at
zfs-auto-snap:daily-2009-07-04-00:00
2009-07-04.00:00:05 zfs snapshot -r mpool/local/cvsroot at
zfs-auto-snap:daily-2009-07-04-00:00
2009-07-04.00:00:06 zfs snapshot -r mpool/projects at
zfs-auto-snap:daily-2009-07-04-00:00
2009-07-05.00:05:09 zfs destroy mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00
2009-07-05.00:05:12 zfs destroy mpool/local/cvsroot at
zfs-auto-snap:daily-2009-07-04-00:00[/i]

and the script did not produce any errors:

[i]pfexec /usr/sbin/zfs send -I  mpool/local at
zfs-auto-snap:daily-2009-07-03-00:00 mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00 | ssh shane at chucky pfexec /usr/sbin/zfs
recv  mpool/local[/i]

Any ideas?
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-06 00:16 UTC

head link

[zfs-discuss] Disappearing snapshots

DL Consulting wrote:> I did a quick search but couldn''t find anything about this little
problem.
>
> I have an X4100 production machine (called monster) that has a J4200 full
of 500GB drives attached. It''s running OpenSolaris 2009.06 and fully up
to date.
>
> It takes daily snapshots and sends them to another machine as a backup. The
sending and receiving is scripted and run from a cronjob. The problem is that
some of the snapshots disappear from monster after they''ve been sent to
the backup machine.
>
> Example:
> [i]shane at monster:/$ zfs list -t snapshot | grep local@
>
> ...
> mpool/local at zfs-auto-snap:daily-2009-07-02-00:00                  64.5K 
-   176K  -
> mpool/local at zfs-auto-snap:daily-2009-07-03-00:00                  76.5K 
-   171K  -
> mpool/local at zfs-auto-snap:daily-2009-07-05-00:00                  59.8K 
-   173K  -
> mpool/local at zfs-auto-snap:daily-2009-07-06-00:00                  59.8K 
-   173K  -
>
> shane at chucky[11:53:46]:/$ zfs list -t snapshot | grep local@
>
> ....
> mpool/local at zfs-auto-snap:daily-2009-07-01-00:00                    35K 
-    92K  -
> mpool/local at zfs-auto-snap:daily-2009-07-02-00:00                    36K 
-    93K  -
> mpool/local at zfs-auto-snap:daily-2009-07-03-00:00                  43.5K 
-    89K  -
> mpool/local at zfs-auto-snap:daily-2009-07-04-00:00                      0 
-    90K  -[/i]
>
> As you can see the snapshot for 2009-07-04 exists on chucky (the backup
machine)
>
> zpool history shows the snapshot was taken:
>
> [i]shane at monster:/$ pfexec zpool history mpool | grep 2009-07-04
>
> 2009-07-04.00:00:02 zfs snapshot mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00
> 2009-07-04.00:00:04 zfs snapshot -r mpool/local/VMwareMachines at
zfs-auto-snap:daily-2009-07-04-00:00
> 2009-07-04.00:00:05 zfs snapshot -r mpool/local/cvsroot at
zfs-auto-snap:daily-2009-07-04-00:00
> 2009-07-04.00:00:06 zfs snapshot -r mpool/projects at
zfs-auto-snap:daily-2009-07-04-00:00
> 2009-07-05.00:05:09 zfs destroy mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00
> 2009-07-05.00:05:12 zfs destroy mpool/local/cvsroot at
zfs-auto-snap:daily-2009-07-04-00:00[/i]
>
> and the script did not produce any errors:
>
> [i]pfexec /usr/sbin/zfs send -I  mpool/local at
zfs-auto-snap:daily-2009-07-03-00:00 mpool/local at
zfs-auto-snap:daily-2009-07-04-00:00 | ssh shane at chucky pfexec /usr/sbin/zfs
recv  mpool/local[/i]
>   
Actually, you can''t tell from this script if an error has occurred
because
you do not check the return value of zfs receive.
> Any ideas?
>   
For some reason, the receive failed.  Since receives are an all-or-nothing
event, the snapshot would not exist on the remote site.  You must check
the return codes.

But... your script should also sync with the last common snapshot, so
it shouldn''t matter if a transient event caused a disruption in the
snapshot
sequence.  I have written such code, and it isn''t particularly hard,
just a
bit tedious.
 -- richard

DL Consulting

2009-Jul-06 00:34 UTC

head link

[zfs-discuss] Disappearing snapshots

Thanks.

I''ll fiddle things so it tells me what the return value is and use the
last common snapshot rather than the last received snapshot.
-- 
This message posted from opensolaris.org

DL Consulting

2009-Jul-06 01:31 UTC

head link

[zfs-discuss] Disappearing snapshots

Just reread your response. If the send/recv fails the snapshot should NOT turn
up on chucky (the recv machine) right? However, it is turning up but the
original on the sending machine is being destroyed by something (which
I''m guessing is the time-slider-cleanup cronjob below)

Here''s the full crontab for root

10 3 * * * /usr/sbin/logadm
15 3 * * 0 [ -x /usr/lib/fs/nfs/nfsfind ] && /usr/lib/fs/nfs/nfsfind
30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] &&
/usr/lib/gss/gsscred_clean
30 0,9,12,18,21 * * * /usr/lib/update-manager/update-refresh.sh
5,20,35,50 * * * * /usr/lib/time-slider-cleanup -y

Do you have any suggestions as to why they''re being destroyed and why
they''re being destroyed after the a gap of 1 hour 4 minutes
(that''s the delay between taking the snapshot and the start of
sending/receiving)? time-slider-cleanup would have run 4 times during that
period.
-- 
This message posted from opensolaris.org

Juergen Nickelsen

2009-Jul-06 08:00 UTC

head link

[zfs-discuss] Disappearing snapshots

DL Consulting <no-reply at opensolaris.org> writes:
> It takes daily snapshots and sends them to another machine as a
> backup. The sending and receiving is scripted and run from a
> cronjob. The problem is that some of the snapshots disappear from
> monster after they''ve been sent to the backup machine.
Do not use the snapshots made for the time slider feature. These are
under control of the auto-snapshot service for exactly the time
slider and not for anything else.

Snapshots are cheap; create your own for file system replication. 
As you always need to keep the last common snapshot on both source
and target of the replication, you want to have snapshot creation
and deletion under your own control and not under the control of a
service that is made for something else.

For my own filesystem replication I have written a script that looks
at the snapshots on the target side, locates the last one of those,
and then makes an incremental replication with a newly created
snapshot relativ to the last common one. That one is then destroyed
after the replication was successful, so the new snapshot is now the
last common one.

Once your replication gets out of sync such that the last snapshot
on the target is not the common one, you must delete snapshots on
the target until the common one is the last one; if there is no
common one any more, you have to start the replication anew with
deleting (or renaming) the file system on the target and doing a
non-incremental send of a source snapshot to the target.

Regards, Juergen.

Richard Elling

2009-Jul-06 16:36 UTC

head link

[zfs-discuss] Disappearing snapshots

DL Consulting wrote:> Just reread your response. If the send/recv fails the snapshot should NOT
turn up on chucky (the recv machine) right? However, it is turning up but the
original on the sending machine is being destroyed by something (which
I''m guessing is the time-slider-cleanup cronjob below)
>   
Yes, but this is configurable via SMF.  To see the properties:
    svccfg -s auto-snapshot:daily listprop
and you can set them as desired
    svccfg -s auto-snapshot:daily setprop zfs/keep=62

for more info, see
http://docs.sun.com/app/docs/doc/817-2271/gbcxl?a=view
> Here''s the full crontab for root
>
> 10 3 * * * /usr/sbin/logadm
> 15 3 * * 0 [ -x /usr/lib/fs/nfs/nfsfind ] &&
/usr/lib/fs/nfs/nfsfind
> 30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] &&
/usr/lib/gss/gsscred_clean
> 30 0,9,12,18,21 * * * /usr/lib/update-manager/update-refresh.sh
> 5,20,35,50 * * * * /usr/lib/time-slider-cleanup -y
>
> Do you have any suggestions as to why they''re being destroyed and
why they''re being destroyed after the a gap of 1 hour 4 minutes
(that''s the delay between taking the snapshot and the start of
sending/receiving)? time-slider-cleanup would have run 4 times during that
period.
>   
Your script should match the policy you wish to implement.
 -- richard

DL Consulting

2009-Jul-06 22:27 UTC

head link

[zfs-discuss] Disappearing snapshots

Thanks guys
-- 
This message posted from opensolaris.org

Tim Foster

2009-Jul-07 19:32 UTC

head link

[zfs-discuss] Disappearing snapshots

On Mon, 2009-07-06 at 10:00 +0200, Juergen Nickelsen
wrote:> DL Consulting <no-reply at opensolaris.org> writes:
> Do not use the snapshots made for the time slider feature. These are
> under control of the auto-snapshot service for exactly the time
> slider and not for anything else.
 - or you could use the auto-snapshot:event SMF instance in 0.12 of the
auto-snapshot service, where by default snapshots are not destroyed and
are only taken when _you_ want, not via cron.

[ or as Richard suggests, simply set the snapshot expiry on the other
instances to keep more snapshots, see the ''zfs/keep'' SMF
property ]

time-slider-cleanup is the thing that deletes snapshots iff you''re
running low on disk space. The auto-snapshot service runs all of it''s
cron job from the ''zfssnap'' role.

	cheers,
			tim

DL Consulting

2009-Jul-12 23:10 UTC

head link

[zfs-discuss] Disappearing snapshots

Thanks Tim
-- 
This message posted from opensolaris.org

zfs discuss - Jul 2009 - Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots

[zfs-discuss] Disappearing snapshots