I did a quick search but couldn''t find anything about this little problem. I have an X4100 production machine (called monster) that has a J4200 full of 500GB drives attached. It''s running OpenSolaris 2009.06 and fully up to date. It takes daily snapshots and sends them to another machine as a backup. The sending and receiving is scripted and run from a cronjob. The problem is that some of the snapshots disappear from monster after they''ve been sent to the backup machine. Example: [i]shane at monster:/$ zfs list -t snapshot | grep local@ ... mpool/local at zfs-auto-snap:daily-2009-07-02-00:00 64.5K - 176K - mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 76.5K - 171K - mpool/local at zfs-auto-snap:daily-2009-07-05-00:00 59.8K - 173K - mpool/local at zfs-auto-snap:daily-2009-07-06-00:00 59.8K - 173K - shane at chucky[11:53:46]:/$ zfs list -t snapshot | grep local@ .... mpool/local at zfs-auto-snap:daily-2009-07-01-00:00 35K - 92K - mpool/local at zfs-auto-snap:daily-2009-07-02-00:00 36K - 93K - mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 43.5K - 89K - mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 0 - 90K -[/i] As you can see the snapshot for 2009-07-04 exists on chucky (the backup machine) zpool history shows the snapshot was taken: [i]shane at monster:/$ pfexec zpool history mpool | grep 2009-07-04 2009-07-04.00:00:02 zfs snapshot mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 2009-07-04.00:00:04 zfs snapshot -r mpool/local/VMwareMachines at zfs-auto-snap:daily-2009-07-04-00:00 2009-07-04.00:00:05 zfs snapshot -r mpool/local/cvsroot at zfs-auto-snap:daily-2009-07-04-00:00 2009-07-04.00:00:06 zfs snapshot -r mpool/projects at zfs-auto-snap:daily-2009-07-04-00:00 2009-07-05.00:05:09 zfs destroy mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 2009-07-05.00:05:12 zfs destroy mpool/local/cvsroot at zfs-auto-snap:daily-2009-07-04-00:00[/i] and the script did not produce any errors: [i]pfexec /usr/sbin/zfs send -I mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 | ssh shane at chucky pfexec /usr/sbin/zfs recv mpool/local[/i] Any ideas? -- This message posted from opensolaris.org
DL Consulting wrote:> I did a quick search but couldn''t find anything about this little problem. > > I have an X4100 production machine (called monster) that has a J4200 full of 500GB drives attached. It''s running OpenSolaris 2009.06 and fully up to date. > > It takes daily snapshots and sends them to another machine as a backup. The sending and receiving is scripted and run from a cronjob. The problem is that some of the snapshots disappear from monster after they''ve been sent to the backup machine. > > Example: > [i]shane at monster:/$ zfs list -t snapshot | grep local@ > > ... > mpool/local at zfs-auto-snap:daily-2009-07-02-00:00 64.5K - 176K - > mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 76.5K - 171K - > mpool/local at zfs-auto-snap:daily-2009-07-05-00:00 59.8K - 173K - > mpool/local at zfs-auto-snap:daily-2009-07-06-00:00 59.8K - 173K - > > shane at chucky[11:53:46]:/$ zfs list -t snapshot | grep local@ > > .... > mpool/local at zfs-auto-snap:daily-2009-07-01-00:00 35K - 92K - > mpool/local at zfs-auto-snap:daily-2009-07-02-00:00 36K - 93K - > mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 43.5K - 89K - > mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 0 - 90K -[/i] > > As you can see the snapshot for 2009-07-04 exists on chucky (the backup machine) > > zpool history shows the snapshot was taken: > > [i]shane at monster:/$ pfexec zpool history mpool | grep 2009-07-04 > > 2009-07-04.00:00:02 zfs snapshot mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 > 2009-07-04.00:00:04 zfs snapshot -r mpool/local/VMwareMachines at zfs-auto-snap:daily-2009-07-04-00:00 > 2009-07-04.00:00:05 zfs snapshot -r mpool/local/cvsroot at zfs-auto-snap:daily-2009-07-04-00:00 > 2009-07-04.00:00:06 zfs snapshot -r mpool/projects at zfs-auto-snap:daily-2009-07-04-00:00 > 2009-07-05.00:05:09 zfs destroy mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 > 2009-07-05.00:05:12 zfs destroy mpool/local/cvsroot at zfs-auto-snap:daily-2009-07-04-00:00[/i] > > and the script did not produce any errors: > > [i]pfexec /usr/sbin/zfs send -I mpool/local at zfs-auto-snap:daily-2009-07-03-00:00 mpool/local at zfs-auto-snap:daily-2009-07-04-00:00 | ssh shane at chucky pfexec /usr/sbin/zfs recv mpool/local[/i] >Actually, you can''t tell from this script if an error has occurred because you do not check the return value of zfs receive.> Any ideas? >For some reason, the receive failed. Since receives are an all-or-nothing event, the snapshot would not exist on the remote site. You must check the return codes. But... your script should also sync with the last common snapshot, so it shouldn''t matter if a transient event caused a disruption in the snapshot sequence. I have written such code, and it isn''t particularly hard, just a bit tedious. -- richard
Thanks. I''ll fiddle things so it tells me what the return value is and use the last common snapshot rather than the last received snapshot. -- This message posted from opensolaris.org
Just reread your response. If the send/recv fails the snapshot should NOT turn up on chucky (the recv machine) right? However, it is turning up but the original on the sending machine is being destroyed by something (which I''m guessing is the time-slider-cleanup cronjob below) Here''s the full crontab for root 10 3 * * * /usr/sbin/logadm 15 3 * * 0 [ -x /usr/lib/fs/nfs/nfsfind ] && /usr/lib/fs/nfs/nfsfind 30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] && /usr/lib/gss/gsscred_clean 30 0,9,12,18,21 * * * /usr/lib/update-manager/update-refresh.sh 5,20,35,50 * * * * /usr/lib/time-slider-cleanup -y Do you have any suggestions as to why they''re being destroyed and why they''re being destroyed after the a gap of 1 hour 4 minutes (that''s the delay between taking the snapshot and the start of sending/receiving)? time-slider-cleanup would have run 4 times during that period. -- This message posted from opensolaris.org
DL Consulting <no-reply at opensolaris.org> writes:> It takes daily snapshots and sends them to another machine as a > backup. The sending and receiving is scripted and run from a > cronjob. The problem is that some of the snapshots disappear from > monster after they''ve been sent to the backup machine.Do not use the snapshots made for the time slider feature. These are under control of the auto-snapshot service for exactly the time slider and not for anything else. Snapshots are cheap; create your own for file system replication. As you always need to keep the last common snapshot on both source and target of the replication, you want to have snapshot creation and deletion under your own control and not under the control of a service that is made for something else. For my own filesystem replication I have written a script that looks at the snapshots on the target side, locates the last one of those, and then makes an incremental replication with a newly created snapshot relativ to the last common one. That one is then destroyed after the replication was successful, so the new snapshot is now the last common one. Once your replication gets out of sync such that the last snapshot on the target is not the common one, you must delete snapshots on the target until the common one is the last one; if there is no common one any more, you have to start the replication anew with deleting (or renaming) the file system on the target and doing a non-incremental send of a source snapshot to the target. Regards, Juergen.
DL Consulting wrote:> Just reread your response. If the send/recv fails the snapshot should NOT turn up on chucky (the recv machine) right? However, it is turning up but the original on the sending machine is being destroyed by something (which I''m guessing is the time-slider-cleanup cronjob below) >Yes, but this is configurable via SMF. To see the properties: svccfg -s auto-snapshot:daily listprop and you can set them as desired svccfg -s auto-snapshot:daily setprop zfs/keep=62 for more info, see http://docs.sun.com/app/docs/doc/817-2271/gbcxl?a=view> Here''s the full crontab for root > > 10 3 * * * /usr/sbin/logadm > 15 3 * * 0 [ -x /usr/lib/fs/nfs/nfsfind ] && /usr/lib/fs/nfs/nfsfind > 30 3 * * * [ -x /usr/lib/gss/gsscred_clean ] && /usr/lib/gss/gsscred_clean > 30 0,9,12,18,21 * * * /usr/lib/update-manager/update-refresh.sh > 5,20,35,50 * * * * /usr/lib/time-slider-cleanup -y > > Do you have any suggestions as to why they''re being destroyed and why they''re being destroyed after the a gap of 1 hour 4 minutes (that''s the delay between taking the snapshot and the start of sending/receiving)? time-slider-cleanup would have run 4 times during that period. >Your script should match the policy you wish to implement. -- richard
Thanks guys -- This message posted from opensolaris.org
On Mon, 2009-07-06 at 10:00 +0200, Juergen Nickelsen wrote:> DL Consulting <no-reply at opensolaris.org> writes:> Do not use the snapshots made for the time slider feature. These are > under control of the auto-snapshot service for exactly the time > slider and not for anything else.- or you could use the auto-snapshot:event SMF instance in 0.12 of the auto-snapshot service, where by default snapshots are not destroyed and are only taken when _you_ want, not via cron. [ or as Richard suggests, simply set the snapshot expiry on the other instances to keep more snapshots, see the ''zfs/keep'' SMF property ] time-slider-cleanup is the thing that deletes snapshots iff you''re running low on disk space. The auto-snapshot service runs all of it''s cron job from the ''zfssnap'' role. cheers, tim