Hi folks, I recently read up on Scott Dickson''s blog with his solution for jumpstart/flashless cloning of ZFS root filesystem boxes. I have to say that it initially looks to work out cleanly, but of course there are kinks to be worked out that deal with auto mounting filesystems mostly. The issue that I''m having is that a few days after these cloned systems are brought up and reconfigured they are crashing and svc.configd refuses to start. I thought about using zpool scrub <poolname> right after completing the stream as an integrity check. If you have any suggestions about this I''d love to hear them! Thanks, Christopher Mera -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090223/8bb43d0d/attachment.html>
I don''t know what''s causing this, nor have I seen it. Can you send more information about the errors you see when the system crashes and svc.configd fails? Doing the scrub seems like a harmless and possibly useful thing to do. Let us know what you find out from it. Lori On 02/23/09 11:05, Christopher Mera wrote:> > Hi folks, > > > > I recently read up on Scott Dickson''s blog with his solution for > jumpstart/flashless cloning of ZFS root filesystem boxes. I have to > say that it initially looks to work out cleanly, but of course there > are kinks to be worked out that deal with auto mounting filesystems > mostly. > > > > The issue that I''m having is that a few days after these cloned > systems are brought up and reconfigured they are crashing and > svc.configd refuses to start. > > > > I thought about using zpool scrub <poolname> right after completing > the stream as an integrity check. > > > > If you have any suggestions about this I''d love to hear them! > > > > > > Thanks, > > Christopher Mera > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090223/68c03eea/attachment.html>
Forgive the double posts,  they will cease immediately
 
panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault)
rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal
access to a user address
 
net-init: #pf Page fault
Bad kernel fault at addr=0x1048c0d
pid=1069, pc=0xfebab410, sp=0xd1c38018, eflags=0x10296
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
6b8<xmme,fxsr,pge,pae,pse,de>
cr2: 1048c0dcr3: 7965020
         gs: e7b601b0  fs: fec90000  es:      160  ds:      160
        edi:        0 esi:  1048bf9 ebp: d9f618c4 esp: d9f61888
        ebx:  1048bf9 edx:  1048bfd ecx: dc311900 eax: d9f6192c
        trp:        e err:        0 eip: febab410  cs:      158
        efl:    10296 usp: d1c38018  ss: e43feb72
 
d9f6178c unix:die+93 (e, d9f61850, 1048c0)
d9f6183c unix:trap+1422 (d9f61850, 1048c0d, )
d9f61850 unix:cmntrap+7c (e7b601b0, fec90000,)
d9f618c4 zfs:mze_compare+18 (d9f6192c, 1048bf9, )
d9f61904 genunix:avl_find+39 (d34b2958, d9f6192c,)
d9f619a4 zfs:mze_find+4a (e45fb8c0, d9f61c9c,)
d9f619e4 zfs:zap_lookup_norm+65 (dc2665a8, 21d, 0, d)
d9f61a34 zfs:zap_lookup+31 (dc2665a8, 21d, 0, d)
d9f61a94 zfs:zfs_match_find+ba (dc8fb980, e0ed3460,)
d9f61b04 zfs:zfs_dirent_lock+358 (d9f61b38, e0ed3460,)
d9f61b54 zfs:zfs_dirlook+f7 (e0ed3460, d9f61c9c,)
d9f61ba4 zfs:zfs_lookup+d5 (e0b420c0, d9f61c9c,)
d9f61c04 genunix:fop_lookup+b0 (e0b420c0, d9f61c9c,)
d9f61dc4 genunix:lookuppnvp+3e4 (d9f61e3c, 0, 1, 0, )
d9f61e14 genunix:lookuppnat+f3 (d9f61e3c, 0, 1, 0, )
d9f61e94 genunix:lookupnameat+52 (807b51c, 0, 1, 0, d)
d9f61ef4 genunix:cstatat_getvp+15d (ffd19553, 807b51c, )
d9f61f54 genunix:cstatat64+68 (ffd19553, 807b51c, )
d9f61f84 genunix:stat64+1c (807b51c, 8047b50, 8)
 
From: Lori.Alt at Sun.COM [mailto:Lori.Alt at Sun.COM] 
Sent: Monday, February 23, 2009 1:17 PM
To: Christopher Mera
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] zfs streams & data corruption
 
I don''t know what''s causing this, nor have I seen it.  
Can you send more information about the errors you
see when the system crashes and svc.configd fails?
Doing the scrub seems like a harmless and possibly
useful thing to do.  Let us know what you find out
from it.
Lori
On 02/23/09 11:05, Christopher Mera wrote: 
Hi folks,
 
I recently read up on Scott Dickson''s blog with his solution for
jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
that it initially looks to work out cleanly, but of course there are
kinks to be worked out that deal with auto mounting filesystems mostly.
 
The issue that I''m having is that a few days after these cloned systems
are brought up and reconfigured they are crashing and svc.configd
refuses to start.
 
I thought about using zpool scrub <poolname>  right after completing the
stream as an integrity check.  
 
If you have any suggestions about this I''d love to hear them!
 
 
Thanks,
Christopher Mera
 
 
________________________________
 
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090223/d4fa75e3/attachment.html>
On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:> I recently read up on Scott Dickson''s blog with his solution for > jumpstart/flashless cloning of ZFS root filesystem boxes. I have to say > that it initially looks to work out cleanly, but of course there are > kinks to be worked out that deal with auto mounting filesystems mostly. > > The issue that I''m having is that a few days after these cloned systems > are brought up and reconfigured they are crashing and svc.configd > refuses to start.When you snapshot a ZFS filesystem you get just that -- a snapshot at the filesystem level. That does not mean you get a snapshot at the _application_ level. Now, svc.configd is a daemon that keeps a SQLite2 database. If you snapshot the filesystem in the middle of a SQLite2 transaction you won''t get the behavior that you want. In other words: quiesce your system before you snapshot its root filesystem for the purpose of replicating that root on other systems. Nico --
On Tue, Feb 24, 2009 at 19:18, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote: >> I recently read up on Scott Dickson''s blog with his solution for >> jumpstart/flashless cloning of ZFS root filesystem boxes. ?I have to say >> that it initially looks to work out cleanly, but of course there are >> kinks to be worked out that deal with auto mounting filesystems mostly. >> >> The issue that I''m having is that a few days after these cloned systems >> are brought up and reconfigured they are crashing and svc.configd >> refuses to start. > > When you snapshot a ZFS filesystem you get just that -- a snapshot at > the filesystem level. ?That does not mean you get a snapshot at the > _application_ level. ?Now, svc.configd is a daemon that keeps a SQLite2 > database. ?If you snapshot the filesystem in the middle of a SQLite2 > transaction you won''t get the behavior that you want. > > In other words: quiesce your system before you snapshot its root > filesystem for the purpose of replicating that root on other systems.That would be a bug in ZFS or SQLite2. A snapshoot should be an atomic operation. The effect should be the same as power fail in the meddle of an transaction and decent databases can cope with that.
Either way - it would be ideal to quiesce the system before a snapshot anyway, no? My next question now is what particular steps would be recommended to quiesce a system for the clone/zfs stream that I''m looking to achieve... All your help is appreciated. Regards, Christopher Mera -----Original Message----- From: Mattias Pantzare [mailto:pantzare at gmail.com] Sent: Tuesday, February 24, 2009 1:38 PM To: Nicolas Williams Cc: Christopher Mera; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs streams & data corruption On Tue, Feb 24, 2009 at 19:18, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote: >> I recently read up on Scott Dickson''s blog with his solution for >> jumpstart/flashless cloning of ZFS root filesystem boxes. ?I have to say >> that it initially looks to work out cleanly, but of course there are >> kinks to be worked out that deal with auto mounting filesystems mostly. >> >> The issue that I''m having is that a few days after these cloned systems >> are brought up and reconfigured they are crashing and svc.configd >> refuses to start. > > When you snapshot a ZFS filesystem you get just that -- a snapshot at > the filesystem level. ?That does not mean you get a snapshot at the > _application_ level. ?Now, svc.configd is a daemon that keeps a SQLite2 > database. ?If you snapshot the filesystem in the middle of a SQLite2 > transaction you won''t get the behavior that you want. > > In other words: quiesce your system before you snapshot its root > filesystem for the purpose of replicating that root on other systems.That would be a bug in ZFS or SQLite2. A snapshoot should be an atomic operation. The effect should be the same as power fail in the meddle of an transaction and decent databases can cope with that.
On Tue, Feb 24, 2009 at 10:41 AM, Christopher Mera <cmera at reliantsec.net> wrote:> Either way - ?it would be ideal to quiesce the system before a snapshot anyway, no? > > My next question now is what particular steps would be recommended to quiesce a system for the clone/zfs stream that I''m looking to achieve... > > > All your help is appreciated. > > Regards, > Christopher Mera > -----Original Message----- > From: Mattias Pantzare [mailto:pantzare at gmail.com] > Sent: Tuesday, February 24, 2009 1:38 PM > To: Nicolas Williams > Cc: Christopher Mera; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] zfs streams & data corruption > > On Tue, Feb 24, 2009 at 19:18, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: >> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote: >>> I recently read up on Scott Dickson''s blog with his solution for >>> jumpstart/flashless cloning of ZFS root filesystem boxes. ?I have to say >>> that it initially looks to work out cleanly, but of course there are >>> kinks to be worked out that deal with auto mounting filesystems mostly. >>> >>> The issue that I''m having is that a few days after these cloned systems >>> are brought up and reconfigured they are crashing and svc.configd >>> refuses to start. >> >> When you snapshot a ZFS filesystem you get just that -- a snapshot at >> the filesystem level. ?That does not mean you get a snapshot at the >> _application_ level. ?Now, svc.configd is a daemon that keeps a SQLite2 >> database. ?If you snapshot the filesystem in the middle of a SQLite2 >> transaction you won''t get the behavior that you want. >> >> In other words: quiesce your system before you snapshot its root >> filesystem for the purpose of replicating that root on other systems. > > That would be a bug in ZFS or SQLite2. > > A snapshoot should be an atomic operation. The effect should be the > same as power fail in the meddle of an transaction and decent > databases can cope with that. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >If you are writing a script to handle ZFS snapshots/backups, you could issue an SMF command to stop the service before taking the snapshot. Or at the very minimum, perform an SQL dump of the DB so you at least have a consistent full copy of the DB as a flat file in case you can''t stop the DB service. -- Brent Jones brent at servuhome.net
On Tue, Feb 24, 2009 at 07:37:39PM +0100, Mattias Pantzare wrote:> On Tue, Feb 24, 2009 at 19:18, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: > > When you snapshot a ZFS filesystem you get just that -- a snapshot at > > the filesystem level. ?That does not mean you get a snapshot at the > > _application_ level. ?Now, svc.configd is a daemon that keeps a SQLite2 > > database. ?If you snapshot the filesystem in the middle of a SQLite2 > > transaction you won''t get the behavior that you want. > > > > In other words: quiesce your system before you snapshot its root > > filesystem for the purpose of replicating that root on other systems. > > That would be a bug in ZFS or SQLite2.I suspect it''s actually a bug in svc.configd. Nico --
On Tue, Feb 24, 2009 at 10:56:45AM -0800, Brent Jones wrote:> If you are writing a script to handle ZFS snapshots/backups, you could > issue an SMF command to stop the service before taking the snapshot. > Or at the very minimum, perform an SQL dump of the DB so you at least > have a consistent full copy of the DB as a flat file in case you can''t > stop the DB service.I don''t think there''s any way to ask svc.config to pause.
Thanks for your responses.. Brent: And I''d have to do that for every system that I''d want to clone? There must be a simpler way.. perhaps I''m missing something. Regards, Chris
On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera <cmera at reliantsec.net> wrote:> Thanks for your responses.. > > Brent: > And I''d have to do that for every system that I''d want to clone? ?There > must be a simpler way.. perhaps I''m missing something. > > > Regards, > Chris >Well, unless the database software itself can "notice" a snapshot taking place, and flush all data to disk, pause transactions until the snapshot is finished, then properly resume, I don''t know what to tell you. It''s an issue for all databases, Oracle, MSSQL, MySQL... how to do an atomic backup, without stopping transactions, and maintaining consistency. Replication is on possible solution, dumping to a file periodically is one, or just tolerating that your database will not be consistent after a snapshot and have to replay logs / consistency check it after bringing it up from a snapshot. Once you figure that out in a filesystem agnostic way, you''ll be a wealthy person indeed. -- Brent Jones brent at servuhome.net
>>>>> "cm" == Christopher Mera <cmera at reliantsec.net> writes:cm> it would be ideal to quiesce the system before a snapshot cm> anyway, no? It would be more ideal to find the bug in SQLite2 or ZFS. Training everyone, ``you always have to quiesce the system before proceeding, because it''s full of bugs'''' is retarded MS-DOS behavior. I think it is actually harmful. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/180f6512/attachment.bin>
>>>>> "bj" == Brent Jones <brent at servuhome.net> writes:bj> tolerating that your database will not be consistent after a bj> snapshot and have to replay logs / consistency check it ``not be consistent'''' != ``have to replay logs'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/adc6eb42/attachment.bin>
How is it that flash archives can avoid these headaches? Ultimately I''m doing this to clone ZFS root systems because at the moment Flash Archives are UFS only. -----Original Message----- From: Brent Jones [mailto:brent at servuhome.net] Sent: Tuesday, February 24, 2009 2:49 PM To: Christopher Mera Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs streams & data corruption On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera <cmera at reliantsec.net> wrote:> Thanks for your responses.. > > Brent: > And I''d have to do that for every system that I''d want to clone? ?There > must be a simpler way.. perhaps I''m missing something. > > > Regards, > Chris >Well, unless the database software itself can "notice" a snapshot taking place, and flush all data to disk, pause transactions until the snapshot is finished, then properly resume, I don''t know what to tell you. It''s an issue for all databases, Oracle, MSSQL, MySQL... how to do an atomic backup, without stopping transactions, and maintaining consistency. Replication is on possible solution, dumping to a file periodically is one, or just tolerating that your database will not be consistent after a snapshot and have to replay logs / consistency check it after bringing it up from a snapshot. Once you figure that out in a filesystem agnostic way, you''ll be a wealthy person indeed. -- Brent Jones brent at servuhome.net
On Tue, Feb 24, 2009 at 02:53:14PM -0500, Miles Nordin wrote:> >>>>> "cm" == Christopher Mera <cmera at reliantsec.net> writes: > > cm> it would be ideal to quiesce the system before a snapshot > cm> anyway, no? > > It would be more ideal to find the bug in SQLite2 or ZFS. Training > everyone, ``you always have to quiesce the system before proceeding, > because it''s full of bugs'''' is retarded MS-DOS behavior. I think it > is actually harmful.It''s NOT a bug in ZFS. It might be a bug in SQLite2, it might be a bug in svc.configd. More information would help; specifically: error/log messages from svc.configd, and /etc/svc/repository.db.
On 02/24/09 12:57, Christopher Mera wrote:> How is it that flash archives can avoid these headaches? >Are we sure that they do avoid this headache? A flash archive (on ufs root) is created by doing a cpio of the root file system. Could a cpio end up archiving a file that was mid-way through an SQLite2 transaction? Lori> Ultimately I''m doing this to clone ZFS root systems because at the moment Flash Archives are UFS only. > > > -----Original Message----- > From: Brent Jones [mailto:brent at servuhome.net] > Sent: Tuesday, February 24, 2009 2:49 PM > To: Christopher Mera > Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] zfs streams & data corruption > > On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera <cmera at reliantsec.net> wrote: > >> Thanks for your responses.. >> >> Brent: >> And I''d have to do that for every system that I''d want to clone? There >> must be a simpler way.. perhaps I''m missing something. >> >> >> Regards, >> Chris >> >> > > Well, unless the database software itself can "notice" a snapshot > taking place, and flush all data to disk, pause transactions until the > snapshot is finished, then properly resume, I don''t know what to tell > you. > It''s an issue for all databases, Oracle, MSSQL, MySQL... how to do an > atomic backup, without stopping transactions, and maintaining > consistency. > Replication is on possible solution, dumping to a file periodically is > one, or just tolerating that your database will not be consistent > after a snapshot and have to replay logs / consistency check it after > bringing it up from a snapshot. > > Once you figure that out in a filesystem agnostic way, you''ll be a > wealthy person indeed. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/25d616eb/attachment.html>
On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote:> I don''t think there''s any way to ask svc.config to pause.Well, IIRC that''s not quite right. You can pstop svc.startd, gently kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun svc.startd. Nico --
Here''s what makes me say that: There are over 700 boxes deployed using Flash Archive''s on an S10 system with a UFS root. We''ve been working on basing our platform on a ZFS root and took Scott Dickson''s suggestions (http://blogs.sun.com/scottdickson/entry/flashless_system_cloning_with_z fs) for doing a System Clone. The process worked out well, the system came up and looked stable until 24 hours later kernel panic''s became incessant and svc.configd won''t load its repository any longer. Hope that explains where I''m coming from.. Regards, Chris From: Lori.Alt at Sun.COM [mailto:Lori.Alt at Sun.COM] Sent: Tuesday, February 24, 2009 3:13 PM To: Christopher Mera Cc: Brent Jones; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs streams & data corruption On 02/24/09 12:57, Christopher Mera wrote: How is it that flash archives can avoid these headaches? Are we sure that they do avoid this headache? A flash archive (on ufs root) is created by doing a cpio of the root file system. Could a cpio end up archiving a file that was mid-way through an SQLite2 transaction? Lori Ultimately I''m doing this to clone ZFS root systems because at the moment Flash Archives are UFS only. -----Original Message----- From: Brent Jones [mailto:brent at servuhome.net] Sent: Tuesday, February 24, 2009 2:49 PM To: Christopher Mera Cc: Mattias Pantzare; Nicolas Williams; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs streams & data corruption On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera <cmera at reliantsec.net> <mailto:cmera at reliantsec.net> wrote: Thanks for your responses.. Brent: And I''d have to do that for every system that I''d want to clone? There must be a simpler way.. perhaps I''m missing something. Regards, Chris Well, unless the database software itself can "notice" a snapshot taking place, and flush all data to disk, pause transactions until the snapshot is finished, then properly resume, I don''t know what to tell you. It''s an issue for all databases, Oracle, MSSQL, MySQL... how to do an atomic backup, without stopping transactions, and maintaining consistency. Replication is on possible solution, dumping to a file periodically is one, or just tolerating that your database will not be consistent after a snapshot and have to replay logs / consistency check it after bringing it up from a snapshot. Once you figure that out in a filesystem agnostic way, you''ll be a wealthy person indeed. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/59f7299d/attachment.html>
On Tue, Feb 24, 2009 at 2:15 PM, Nicolas Williams <Nicolas.Williams at sun.com>wrote:> On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote: > > I don''t think there''s any way to ask svc.config to pause. > > Well, IIRC that''s not quite right. You can pstop svc.startd, gently > kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun > svc.startd. > > Nico > --Hot Backup? # Connect to the database sqlite3 db $dbfile # Lock the database, copy and commit or rollback if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} res]} { puts "Backup failed: $res" } else { puts "Backup succeeded" } -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/30145147/attachment.html>
On Tue, Feb 24, 2009 at 02:27:18PM -0600, Tim wrote:> On Tue, Feb 24, 2009 at 2:15 PM, Nicolas Williams > <Nicolas.Williams at sun.com>wrote: > > > On Tue, Feb 24, 2009 at 01:17:47PM -0600, Nicolas Williams wrote: > > > I don''t think there''s any way to ask svc.config to pause. > > > > Well, IIRC that''s not quite right. You can pstop svc.startd, gently > > kill (i.e., not with SIGKILL) svc.configd, take your snapshot, then prun > > svc.startd. > > > > Nico > > -- > > > Hot Backup? > > # Connect to the database > sqlite3 db $dbfile > # Lock the database, copy and commit or rollback > if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} res]} { > puts "Backup failed: $res" > } else { > puts "Backup succeeded" > }SMF uses SQLite2. Sorry.
On Tue, Feb 24, 2009 at 12:19:22PM -0800, Christopher Mera wrote:> There are over 700 boxes deployed using Flash Archive''s on an S10 system > with a UFS root. We''ve been working on basing our platform on a ZFS > root and took Scott Dickson''s suggestions > (http://blogs.sun.com/scottdickson/entry/flashless_system_cloning_with_z > fs) for doing a System Clone. The process worked out well, the system > came up and looked stable until 24 hours later kernel panic''s became > incessant and svc.configd won''t load its repository any longer.OK, svc.configd cannot cause a panic, so perhaps there is a ZFS bug.
>>>>> "la" == Lori Alt <Lori.Alt at Sun.COM> writes:la> Could a cpio end up archiving a file that was mid-way la> through an SQLite2 transaction? cpio is actually much worse for a database than a snapshot! I don''t know what''s going on in this specific case, but the cpio backup is worse for SQLite2-using things like Thunderbird than a snapshot backup. It''s ok if your backup is equivalent to this, and snapshot backups are equivalent: * yank the cord. * boot up, but do NOT start SQLite2. * copy SQLite2''s files somewhere else. * later, feed the copied files to SQLite2, and say ``recover, as if power failed.'''' SQLite2 should be able to do this ``recover'''' step speedily and without ``corruption'''' or ``inconsistency,'''' and without any ``half completed'''' transactions. The fact that databases have transactions is not something that makes them vulnerable to cord-yanking or corrupt from snapshot backups. About 1/4 of the reason databases even have something *called* a Transaction, is to support *exactly* this scenario. What''s not workable is to back up the file storing the database gradually while the database is writing to it, so the backed-up blocks near the start of the file are older than blocks near the end. cpio backups on live filesystems are like your backup is a wand sweeping through the file''s space, while at the same time SQlite2 writes are dipping into the file sometimes before the wand, sometimes behind. Any writes SQLite2 does to offsets behind the wand are lost, while writes in front of the wand are captured into the backup. This will cause corruption. It''s not the same as a cord-yank and not speedily recoverable. The way I try to back up UFS systems is to take a snapshot with fssnap, then backup the snapshot with ufsdump. You could also UFS-mount the fssnap device somewhere read-only and use cpio on that mountpoint instead of ufsdump on the device---that''s safe too. modulo bugs in SQLite2 and SMF. but backing up the writeable filesystem with cpio is never safe for SQLite2 or berkeley DB or any real database. Older systems had no fssnap and no ''zfs snap'', so it was impossible to do backups by performing the cord-yank-simulation procedure above. Most Linux systems still can''t do it. You need operating system support to do it, so if you don''t have it, whether you''re cpio or you''re an ``enterprise backup solution,'''' you need some help from the database to do a live backup. When databases have some mode to support backups, usually what they do is to make two kinds of promises: (1) certain files, I will not write to them at all until you take me out of backup-mode. Pass your backup wands through them all you want. I''ll not be changing them. (2) other files, I will only append to them. I will never write to the middle. Both behaviors are wand-safe, so you can use userspace-only cpio backups without shutting the database all the way down. You do *NOT* need to use the (1) (2) helper-mode to do a snapshot backup. If your database can''t handle a snapshot backup unless you put it into remedial backup-assistance (1) (2) mode first, then your database can''t handle cord-yanking either, and is BROKEN. The observed problem doesn''t mean SQLite2 is broken. It''s possible the software above SQLite2 is not using the transactions aggressively enough. For example suppose SMF craps its pants if it ever boots up to find database-stored switches 1 and 2 are not set to the same value. If SMF is commanding SQLite2 to: * Transaction 1: flip switch 1 to B * Transaction 2: flip switch 2 to B then it could have trouble surviving cord-yanking or backups, and it''ll have trouble no matter whether it''s a cord-yank or a snapshot backup or a sweeping-wand backup, and no matter if you somehow put SQLite2 in backup-friendly mode first or not. The proper way is for SMF to tell SQLite2: * Transaction 1: flip switch 1 to B flip switch 2 to B SQLite2 will then guarantee that both happen, or neither happens, but only if you ask it to by putting both in one transaction. The whole *point* of using SQLite2 in your SMF project is to arrange for such guarantees as these to be kept during backups and cord-yanks. but a database cannot magically make the system appear to run continuously---SMF still needs to specify to SQLite2 what ``consistency'''' means before the database can guarantee it. Hope this helps untangle some FUD. Snapshot backups of databases *are* safe, unless the database or application above it is broken in a way that makes cord-yanking unsafe too. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/f30fa1ca/attachment.bin>
On 24-Feb-09, at 1:37 PM, Mattias Pantzare wrote:> On Tue, Feb 24, 2009 at 19:18, Nicolas Williams > <Nicolas.Williams at sun.com> wrote: >> On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote: >>> I recently read up on Scott Dickson''s blog with his solution for >>> jumpstart/flashless cloning of ZFS root filesystem boxes. I have >>> to say >>> that it initially looks to work out cleanly, but of course there are >>> kinks to be worked out that deal with auto mounting filesystems >>> mostly. >>> >>> The issue that I''m having is that a few days after these cloned >>> systems >>> are brought up and reconfigured they are crashing and svc.configd >>> refuses to start. >> >> When you snapshot a ZFS filesystem you get just that -- a snapshot at >> the filesystem level. That does not mean you get a snapshot at the >> _application_ level. Now, svc.configd is a daemon that keeps a >> SQLite2 >> database. If you snapshot the filesystem in the middle of a SQLite2 >> transaction you won''t get the behavior that you want. >> >> In other words: quiesce your system before you snapshot its root >> filesystem for the purpose of replicating that root on other systems. > > That would be a bug in ZFS or SQLite2. > > A snapshoot should be an atomic operation. The effect should be the > same as power fail in the meddle of an transaction and decent > databases can cope with that.In this special case, that is likely so. But Nicolas'' point is salutary in general, especially in the increasingly common case of virtual machines whose disk images are on ZFS. Interacting bugs or bad configuration can produce novel failure modes. Quiescing a system with a complex mix of applications and service layers is no simple matter either, as many readers of this list well know... :) --Toby> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Feb 24, 2009 at 03:25:53PM -0600, Tim wrote:> On Tue, Feb 24, 2009 at 2:37 PM, Nicolas Williams > <Nicolas.Williams at sun.com>wrote: > > > > > > > > > > > > Hot Backup? > > > > > > # Connect to the database > > > sqlite3 db $dbfile > > > # Lock the database, copy and commit or rollback > > > if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} > > res]} { > > > puts "Backup failed: $res" > > > } else { > > > puts "Backup succeeded" > > > } > > > > SMF uses SQLite2. Sorry. > > > > > I don''t quite follow why it wouldn''t work for sqlite2 as well...Because SQLite2 doesn''t have that feature.
On Tue, Feb 24, 2009 at 2:37 PM, Nicolas Williams <Nicolas.Williams at sun.com>wrote:> > > > > > > Hot Backup? > > > > # Connect to the database > > sqlite3 db $dbfile > > # Lock the database, copy and commit or rollback > > if {[catch {db transaction immediate {file copy $dbfile ${dbfile}.bak}} > res]} { > > puts "Backup failed: $res" > > } else { > > puts "Backup succeeded" > > } > > SMF uses SQLite2. Sorry. >I don''t quite follow why it wouldn''t work for sqlite2 as well... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090224/772a08d4/attachment.html>
On Mon, Feb 23, 2009 at 02:36:07PM -0800, Christopher Mera wrote:> panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault) > rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal > access to a user addressCan you describe what you''re doing with your snapshot? Are you zfs send''ing your snapshots to new systems'' rpools? Or something else? You''re not using dd(1) or anything like that, right? Nico --
It''s a zfs snapshot that''s then sent to a file.. On the new boxes I''m doing a jumpstart install with the SUNWCreq package, and using the finish script to mount an NFS filesystem that contains the *.zfs dump files. Zfs receive is actually importing the data and the boot environment then boots fine. -----Original Message----- From: Nicolas Williams [mailto:Nicolas.Williams at sun.com] Sent: Tuesday, February 24, 2009 5:43 PM To: Christopher Mera Cc: Lori.Alt at sun.com; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs streams & data corruption On Mon, Feb 23, 2009 at 02:36:07PM -0800, Christopher Mera wrote:> panic[cpu0]/thread=dacac880: BAD TRAP: type=e (#pf Page fault) > rp=d9f61850 addr=1048c0d occurred in module "zfs" due to an illegal > access to a user addressCan you describe what you''re doing with your snapshot? Are you zfs send''ing your snapshots to new systems'' rpools? Or something else? You''re not using dd(1) or anything like that, right? Nico --
On Tue, Feb 24, 2009 at 03:08:18PM -0800, Christopher Mera wrote:> It''s a zfs snapshot that''s then sent to a file.. > > On the new boxes I''m doing a jumpstart install with the SUNWCreq > package, and using the finish script to mount an NFS filesystem that > contains the *.zfs dump files. Zfs receive is actually importing the > data and the boot environment then boots fine.It''s possible that your zfs send output files are getting corrupted when accessed via NFS. Try ssh. Also, when does the panic happen? I searched for CRs with parts of that panic string and found none.
Miles Nordin wrote:> Hope this helps untangle some FUD. Snapshot backups of databases > *are* safe, unless the database or application above it is broken in a > way that makes cord-yanking unsafe too. >Actually Miles, what they were asking for is generally referred to as a checkpoint and they are used by all major databases for backing up files. Performing a checkpoint will perform such tasks as making sure that all transactions recorded in the log but not yet written to the database are written out and that the system is not in the middle of a write when you grab the data. Dragging the discussion of database recovery into the discussion seems to me to only be increasing the FUD factor. Regards, Greg
>>>>> "gp" == Greg Palmer <GregoryLPalmer at Netscape.net> writes:gp> Performing a checkpoint will perform such tasks as making sure gp> that all transactions recorded in the log but not yet written gp> to the database are written out and that the system is not in gp> the middle of a write when you grab the data. great copying of buzzwords out of a glossary, but does it change my claim or not? My claim is: that SQLite2 should be equally as tolerant of snapshot backups as it is of cord-yanking. The special backup features of databases including ``performing a checkpoint'''' or whatever, are for systems incapable of snapshots, which is most of them. Snapshots are not writeable, so this ``in the middle of a write'''' stuff just does not happen. gp> Dragging the discussion of database recovery into the gp> discussion seems to me to only be increasing the FUD factor. except that you need to draw a distinction between recovery from cord-yanking which should be swift and absolutely certain, and recovery from a cpio-style backup done with the database still running which requires some kind of ``consistency scanning'''' and may involve ``corruption'''' and has every right to simply not work at all. The FUD I''m talking about, is mostly that people seem to think all kinds of recovery are of the second kind, which is flatly untrue! Backing up a snapshot of the database should involve the first category of recovery (after restore), the swift and certain kind, EVEN if you do not ``quiesce'''' the database or take a ``checkpoint'''' or whatever your particular vendor calls it, before taking the snapshot. You are entitled to just snap it, and expect that recovery work swiftly and certainly just as it does if you yank the cord. If your database vendor considers it some major catastrophe to have the cord yanked, requiring special tools, training seminars, buzzwords, and hours of manual checking, then we have a separate problem, but I don''t think SQLite2 is in that category! Of course Toby rightly pointed out this claim does not apply if you take a host snapshot of a virtual disk, inside which a database is running on the VM guest---that implicates several pieces of untrustworthy stacked software. But for snapshotting SQLite2 to clone the currently-running machine I think the claim does apply, no? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090225/940bb87c/attachment.bin>
Miles Nordin wrote:> that SQLite2 should be equally as tolerant of snapshot backups as it > is of cord-yanking. > > The special backup features of databases including ``performing a > checkpoint'''' or whatever, are for systems incapable of snapshots, > which is most of them. Snapshots are not writeable, so this ``in the > middle of a write'''' stuff just does not happen.This is correct. The general term for these sorts of point-in-time backups is "crash consistant". If the database can be recovered easily (and/or automatically) from pulling the plug (or a kill -9), then a snapshot is an instant backup of that database. In-flight transactions (ones that have not been committed) at the database level are rolled back. Applications using the database will be confused by this in a recovery scenario, since the transaction was reported as committed are gone when the database comes back. But that''s the case any time a database moves "backward" in time.> Of course Toby rightly pointed out this claim does not apply if you > take a host snapshot of a virtual disk, inside which a database is > running on the VM guest---that implicates several pieces of > untrustworthy stacked software. But for snapshotting SQLite2 to clone > the currently-running machine I think the claim does apply, no? >Snapshots of a virtual disk are also crash-consistant. If the VM has not committed its transactionally-committed data and is still holding it volatile memory, that VM is not maintaining its ACID requirements, and that''s a bug in either the database or in the OS running on the VM. The snapshot represents the disk state as if the VM were instantly gone. If the VM or the database can''t recover from pulling the virtual plug, the snapshot can''t help that. That said, it is a good idea to quiesce the software stack as much as possible to make the recovery from the crash-consistant image as painless as possible. For example, if you take a snapshot of a VM running on an EXT2 filesystem (or unlogged UFS for that matter) the recovery will require an fsck of that filesystem to ensure that the filesystem structure is consistant. Perforing a "lockfs" on the filesystem while the snapshot is taken could mitigate that, but that''s still out of the scope of the ZFS snapshot. --Joe --Joe
On 25-Feb-09, at 9:53 AM, Moore, Joe wrote:> Miles Nordin wrote: >> that SQLite2 should be equally as tolerant of snapshot backups >> as it >> is of cord-yanking. >> >> The special backup features of databases including ``performing a >> checkpoint'''' or whatever, are for systems incapable of snapshots, >> which is most of them. Snapshots are not writeable, so this ``in the >> middle of a write'''' stuff just does not happen. > > This is correct. The general term for these sorts of point-in-time > backups is "crash consistant". If the database can be recovered > easily (and/or automatically) from pulling the plug (or a kill -9), > then a snapshot is an instant backup of that database. > > In-flight transactions (ones that have not been committed) at the > database level are rolled back. Applications using the database > will be confused by this in a recovery scenario, since the > transaction was reported as committed are gone when the database > comes back. But that''s the case any time a database moves > "backward" in time. > >> Of course Toby rightly pointed out this claim does not apply if you >> take a host snapshot of a virtual disk, inside which a database is >> running on the VM guest---that implicates several pieces of >> untrustworthy stacked software. But for snapshotting SQLite2 to >> clone >> the currently-running machine I think the claim does apply, no? >> > > Snapshots of a virtual disk are also crash-consistant. If the VM > has not committed its transactionally-committed data and is still > holding it volatile memory, that VM is not maintaining its ACID > requirements, and that''s a bug in either the database or in the OS > running on the VM.Or the virtual machine! I hate to dredge up the recent thread again - but if your virtual machine is not maintaining guest barrier semantics (write ordering) on the underlying host, then your snapshot may contain inconsistencies entirely unexpected to the virtualised transactional/journaled database or filesystem.[1] I believe this can be reproduced by simply running VirtualBox with default settings (ignore flush), though I have been too busy lately to run tests which could prove this. (Maybe others would be interested in testing as well.) I infer this explanation from consistency failures in InnoDB and ext3fs that I have seen[2], which would not be expected on bare metal in pull-plug tests. My point is not about VB specifically, but just that in general, the consistency issue - already complex on bare metal - is tangled further as the software stack gets deeper. --Toby [1] - The SQLite web site has a good summary of related issues. http://sqlite.org/atomiccommit.html [2] http://forums.virtualbox.org/viewtopic.php?t=13661> The snapshot represents the disk state as if the VM were instantly > gone. If the VM or the database can''t recover from pulling the > virtual plug, the snapshot can''t help that. > > That said, it is a good idea to quiesce the software stack as much > as possible to make the recovery from the crash-consistant image as > painless as possible. For example, if you take a snapshot of a VM > running on an EXT2 filesystem (or unlogged UFS for that matter) the > recovery will require an fsck of that filesystem to ensure that the > filesystem structure is consistant. Perforing a "lockfs" on the > filesystem while the snapshot is taken could mitigate that, but > that''s still out of the scope of the ZFS snapshot. > > --Joe > > --Joe > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "jm" == Moore, Joe <joe.moore at siemens.com> writes:jm> This is correct. The general term for these sorts of jm> point-in-time backups is "crash consistant". phew, thanks, glad I wasn''t talking out my ass again. jm> In-flight transactions (ones that have not been committed) at jm> the database level are rolled back. Applications using the jm> database will be confused by this in a recovery scenario, jm> since the transaction was reported as committed are gone when jm> the database comes back. But that''s the case any time a jm> database moves "backward" in time. hm. I thought a database would not return success to the app until it was actually certain the data was on disk with fsync() or whatever, and this is why databases like NVRAM''s and slogs. Are you saying it''s a common ``optimisation'''' for DBMS to worry about write barriers only, not about flushing? jm> Snapshots of a virtual disk are also crash-consistant. If the jm> VM has not committed its transactionally-committed data and is jm> still holding it volatile memory, that VM is not maintaining jm> its ACID requirements, and that''s a bug in either the database jm> or in the OS running on the VM. I''m betting mostly ``the OS running inside the VM'''' and ``the virtualizer itself''''. For the latter, from Toby''s thread: -----8<----- If desired, the virtual disk images (VDI) can be flushed when the guest issues the IDE FLUSH CACHE command. Normally these requests are ignored for improved performance. To enable flushing, issue the following command: VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 -----8<----- Virtualizers are able to take snapshots themselves without help from the host OS, so I would expect at least those to work, and host snapshots to be fixable. VirtualBox has a ``pause'''' feature---it could pretend it''s received a flush command from the guest, and flush whatever internal virtualizer buffers it has to the host OS when paused. Also a host snapshot is a little more forgiving than a host cord-yank because the snapshot will capture things applications like VBox have written to files but not fsync()d yet. so it''s ok for snapshots but not cord-yanks if VBox never bothers to call fsync(). It''s just not okay that VBox might buffer data internally sometimes. Even if that''s all sorted, though, ``the OS running inside the VM''''---neither UFS nor ext3 sends these cache flush commands to virtual drives. At least for ext3, the story is pretty long: http://lwn.net/Articles/283161/ So, for those that wish to enable them, barriers apparently are turned on by giving "barrier=1" as an option to the mount(8) command, either on the command line or in /etc/fstab: mount -t ext3 -o barrier=1 <device> <mount point> (but, does not help at all if using LVM2 because LVM2 drops the barriers) ext3 get away with it because drive write buffers are small enough they can mostly get away with only flushing the journal, and the journal''s written in LBA order, so except when it wraps around there''s little incentive for drives to re-order it. But ext3''s supposed ability to mostly work ok without barriers depends on assumptions about physical disks---the size of the write cache being <32MB, their reordering sorting algorithm being elevator-like---that probably don''t apply to a virtual disk so a Linux guest OS very likely is ``broken'''' w.r.t. taking these crash-consistent virtual disk snapshots. And also a Solaris guest: we''ve been told UFS+logging expects the write cache to be *off* for correctness. I don''t know if UFS is less good at evading the problem than ext3, or if Solaris users are just more conservative. but, with a virtual disk the write cache will always be effectively on no matter what simon-sez flags you pass to that awful ''format'' tool. That was never on the bargaining table because there''s no other way it can have remotely reasonable performance. Possibly the ``pause'''' command would be a workaround for this becuase it could let you force a barrier into the write stream yourself (one the guest OS never sent) and then take a snapshot right after the barrier with no writes allowed between barrier and snapshot. If the fake barrier is inserted into the stack right at the guest/VBox boundary, then it should make the overall system behave as well as the guest running on a drive with the write cache disabled. I''m not sure such a barrier is actually implied by VBox ``pause'''' but if I were designing the pause feature it would be. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090225/a7ed5945/attachment.bin>
On 25-Feb-09, at 1:08 PM, Miles Nordin wrote:>>>>>> "jm" == Moore, Joe <joe.moore at siemens.com> writes: > > jm> This is correct. The general term for these sorts of > jm> point-in-time backups is "crash consistant". > > phew, thanks, glad I wasn''t talking out my ass again. > > jm> In-flight transactions (ones that have not been committed) at > jm> the database level are rolled back. Applications using the > jm> database will be confused by this in a recovery scenario, > jm> since the transaction was reported as committed are gone when > jm> the database comes back. But that''s the case any time a > jm> database moves "backward" in time. > > hm. I thought a database would not return success to the app until it > was actually certain the data was on disk with fsync() or whatever, > and this is why databases like NVRAM''s and slogs. Are you saying it''s > a common ``optimisation'''' for DBMS to worry about write barriers only, > not about flushing?That would break the "Durable" promise of ACID. To be durable, commit must be synchronous to the application, because the application is about to promise something big to the user (e.g. printing APPROVED :) That said, this RDBMS behaviour is generally configurable. In fact the subtext of the whole thread is "know your configuration" at all layers, whether that is drive, filesystem, virtual machine, RDBMS, ...> > jm> Snapshots of a virtual disk are also crash-consistant. If the > jm> VM has not committed its transactionally-committed data and is > jm> still holding it volatile memory, that VM is not maintaining > jm> its ACID requirements, and that''s a bug in either the database > jm> or in the OS running on the VM. > > I''m betting mostly ``the OS running inside the VM'''' and ``the > virtualizer > itself''''. For the latter, from Toby''s thread: > > -----8<----- > If desired, the virtual disk images (VDI) can be flushed when the > guest issues the IDE FLUSH CACHE command. Normally these requests are > ignored for improved performance. > To enable flushing, issue the following command: > VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/ > LUN#[x]/Config/IgnoreFlush" 0 > -----8<----- > > Virtualizers are able to take snapshots themselves without help from > the host OS, so I would expect at least those to work, and host > snapshots to be fixable. VirtualBox has a ``pause'''' feature---it > could pretend it''s received a flush command from the guest, and flush > whatever internal virtualizer buffers it has to the host OS when > paused.Indeed.> > Also a host snapshot is a little more forgiving than a host cord-yank > because the snapshot will capture things applications like VBox have > written to files but not fsync()d yet. so it''s ok for snapshots but > not cord-yanks if VBox never bothers to call fsync().Taking good host snapshots may require VB to do that, though.> It''s just not > okay that VBox might buffer data internally sometimes. > > Even if that''s all sorted, though, ``the OS running inside the > VM''''---neither UFS nor ext3 sends these cache flush commands to > virtual drives. At least for ext3, the story is pretty long: > > http://lwn.net/Articles/283161/ > So, for those that wish to enable them, barriers apparently are > turned on by giving "barrier=1" as an option to the mount(8) > command, > either on the command line or in /etc/fstab: > mount -t ext3 -o barrier=1 <device> <mount point> > (but, does not help at all if using LVM2 because LVM2 drops the > barriers) > > ext3 get away with it because drive write buffers are small enough > they can mostly get away with only flushing the journal, and the > journal''s written in LBA order, so except when it wraps around there''s > little incentive for drives to re-order it. But ext3''s supposed > ability to mostly work ok without barriersWithout *working* barriers, you mean? I haven''t RTFS but I suspect ext3 needs functioning barriers to maintain "crash consistency".> depends on assumptions > about physical disks---the size of the write cache being <32MB, their > reordering sorting algorithm being elevator-like---that probably don''t > apply to a virtual disk so a Linux guest OS very likely is ``broken''''Yes, the problems I observed indicate to me that with the Ignore Flushes default, VB can''t crash and maintain consistency in ext3 or MySQL+InnoDB (and, I''d bet, pretty much *any* transactional system).> w.r.t. taking these crash-consistent virtual disk snapshots. > > And also a Solaris guest: we''ve been told UFS+logging expects the > write cache to be *off* for correctness. I don''t know if UFS is less > good at evading the problem than ext3, or if Solaris users are just > more conservative. but, with a virtual disk the write cache will > always be effectively on no matter what simon-sez flags you pass to > that awful ''format'' tool. That was never on the bargaining table > because there''s no other way it can have remotely reasonable > performance....which may imply that a Solaris UFS filesystem is just as prone to damage in VB as a Linux one. (Even ZFS, I''d wager.)> > Possibly the ``pause'''' command would be a workaround for this becuase > it could let you force a barrier into the write stream yourself (one > the guest OS never sent) and then take a snapshot right after the > barrier with no writes allowed between barrier and snapshot. If the > fake barrier is inserted into the stack right at the guest/VBox > boundary, then it should make the overall system behave as well as the > guest running on a drive with the write cache disabled. I''m not sure > such a barrier is actually implied by VBox ``pause'''' but if I were > designing the pause feature it would be.Totally. --Toby> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes:c> so it''s ok for snapshots but not cord-yanks if VBox never c> bothers to call fsync(). tt> Taking good host snapshots may require VB to do that, though. AIUI the contents of a snapshot on the host will be invariant no matter where VBox places host fsync() calls along the timeline, or if it makes them at all. The host snapshot will not be invariant of when applications running inside the guest call fsync(), because this inner fsync() implicates the buffer cache in the guest OS, possibly flush commands at the guest/VBox driver/virtualdisk boundary, and stdio buffers inside the VBox app. so...in the sense that, in a hypothetical nonexistent working overall system, a guest app calling fsync() eventually propogates out until finally VBox calls fsync() on the host''s kernel, then yeah, observing a lack of fsync()''s coming out of VBox probably means host snapshots won''t be crash-consistent. BUT the effect of the fsync() on the host itself is not what''s needed for host snapshots (only needed for host cord-yanks). It''s all the other stuff that''s needed for host snapshots---flushing the buffer cache inside the guest OS, flushing VBox''s stdio buffers, u.s.w., that makes a bunch of write()''s spew out just before the fsync() and dams up other write()s inside VBox and the guest OS until after the fsync() comes out. c> But ext3''s supposed ability to mostly work ok without c> barriers tt> Without *working* barriers, you mean? I haven''t RTFS but I tt> suspect ext3 needs functioning barriers to maintain "crash tt> consistency". no, the lwn article says that ext3 is just like Solaris UFS and never issues a cache flush to the drive (except on SLES where Novell made local patches to their kernel). ext3 probably does still use an internal Linux barrier API to stop dangerous kinds of reordering within the Linux buffer cache, but nothing that makes it down to the drive (nor into VBox). so I think even if you turn on the flush-respecting feature in VBox, Linux ext3 and Solaris UFS would both still be necessarily unsafe (according to our theory so far), at least unsafe from: (1) host cord-yanking, (2) host snapshots taken without ``pausing'''' the VM. If you''re going to turn on the VBox flush option, maybe it would be worth trying XFS or ext4 or ZFS inside the guest and comparing their corruptability. For VBox to simulate a real disk with its write cache turned off, and thus work better with UFS and ext3, VBox would need to make sure writes are not re-ordered. For the unpaused-host-snapshot case this should be relatively easy---just make VBox stop using stdio, and call write() exactly once for every disk command the guest issues and call it in the same order the guest passed it. It''s not necessary to call fsync() at all, so it should not make things too much slower. For the host cord-yanking case, I don''t think POSIX gives enough to achieve this and still be fast because you''d be expected to call fsync() between each write. What we really want is some flag, ``make sure my writes appear to have been done in order after a crash.'''' I don''t think there''s such a thing as a write barrier in POSIX, only the fsync() flush command? Maybe it should be a new rule of zvol''s that they always act this way. It need not slow things down much for the host to arrange that writes not appear to have been reordered: all you have to do is batch them into chunks along the timeline, and make sure all the writes in a chunk commit, or none of them do. It doesn''t matter how big the chunks are nor where they start and end. It''s sort of a degenerate form of the snapshot case: with the fwrite()-to-write() change above we can already take a clean snapshot without fsync(), so just pretend as thoughyou were taking a snapshot a couple times a minute, and after losing power roll back to the newest one that survived. I''m not sure real snapshots are the right way to implement it, but the idea is with a COW backingn store it should be well within-reach to provide the illusion writes are never reordered (and thus that your virtual hard disk has its write cache turned off) without adding lots of io/s the way fsync() does. This still compromises the D in ACID for databases running inside the guest, in the host cord-yank case, but it should stop the corruption. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090225/00877945/attachment.bin>
Miles Nordin wrote:> gp> Performing a checkpoint will perform such tasks as making sure > gp> that all transactions recorded in the log but not yet written > gp> to the database are written out and that the system is not in > gp> the middle of a write when you grab the data. > > great copying of buzzwords out of a glossary,Wasn''t copied from a glossary, I just tried to simplify it enough for you to understand. I apologize if I didn''t accomplish that goal.> but does it change my > claim or not? My claim is: > > that SQLite2 should be equally as tolerant of snapshot backups as it > is of cord-yanking. >You''re missing the point here Miles. The folks weren''t asking for a method to confirm their database was able to perform proper error recovery and confirm it would survive having the cord yanked out of the wall. They were asking for a reliable way to backup their data. The best way to do that is not by snapshotting alone. The process of performing database backups is well understood and supported throughout the industry. Relying on the equivalent of crashing the database to perform backups isn''t how professionals get the job done. There is a reason that database vendor do not suggest you backup their databases by pulling the plug out of the wall or killing the running process. The best way to backup a database is by using a checkpoint. Your comment about checkpoints being for systems where snapshots are not available is not accurate. That is the normal method of backing up databases under Solaris among others. Checkpoints are useful for all systems since they guarantee that the database files are consistent and do not require recovery which doesn''t always work no matter what the glossy brochures tell you. Typically they are used in concert with snapshots. Force the checkpoint, trigger the snapshot and you''re golden. Let''s take a simple case of a transaction which consists of three database updates within a transaction. One of those writes succeeds, you take a snapshot and then the two other writes succeed. Everyone concerned with the transaction believes it succeeded but your snapshot does not show that. When the database starts up again, the data it will have in your snapshot indicates the transaction never succeeded and therefore it will roll out the database transaction and you will lose that transaction. Well, it will assuming that all code involved in that recovery works flawlessly. Issuing a checkpoint on the other hand causes the database to complete the transaction including ensuring consistency of the database files before you take your snapshot. NOTE: If you issue a checkpoint and then perform a snapshot you will get consistent data which does not require the database perform recovery. Matter of fact, that''s the best way to do it. Your dismissal of write activity taking place is inaccurate. Snapshots take a picture of the file system at a point in time. They have no knowledge of whether or not one of three writes required for the database to be consistent have completed. (Refer to above example) Data does not hit the disk instantly, it takes some finite amount of time in between when the write command is issued for it to arrive at the disk. Plainly, terminating the writes between when they are issued and before it has completed is possible and a matter of timing. The database on the other hand does understand when the transaction has completed and allows outside processes to take advantage of this knowledge via checkpointing. All real database systems have flaws in the recovery process and so far every database system I''ve seen has had issues at one time or another. If we were in a perfect world it SHOULD work every time but we aren''t in a perfect world. ZFS promises on disk consistency but as we saw in the recent thread about "Unreliable for professional usage" it is possible to have issues. Likewise with database systems. Regards, Greg
>>>>> "gp" == Greg Palmer <GregoryLPalmer at Netscape.net> writes:gp> Relying on the equivalent of crashing the database to perform gp> backups isn''t how professionals get the job done. well, nevertheless, it is, and should be, supported by SQLite2. gp> Let''s take a simple case of a transaction which consists of gp> three database updates within a transaction. One of those gp> writes succeeds, you take a snapshot and then the two other gp> writes succeed. Everyone concerned with the transaction gp> believes it succeeded but your snapshot does not show that. I''m glad you have some rigid procedures that work well for you, but it sounds like you do not understand how DBMS''s actually deal with their backing store. You could close the gap by reviewing the glossary entry for ACID. It''s irrelevant whether the transaction spawns one write or three---the lower parts of the DBMS make updates transactional. As long as writes are not re-ordered or silently discarded, it''s not a hand-waving recovery-from-chaos process. It''s certain. Sometimes writes ARE lost or misordered, or there are bugs in the DBMS or bad RAM or who knows what, so I''m not surprised your vendor has given you hand-waving recovery tools along with a lot of scary disclaimers. Nor am I surprised that they ask you to follow procedures that avoid exposing their bugs. But it''s just plain wrong that the only way to achieve a correct backup is with the vendor''s remedial freezing tools. I don''t understand why you are dwelling on ``everyone concerned believes it succeeded but it''s not in the backup.'''' So what? Obviously the backup has to stop including things at some point. As long as the transaction is either in the backup or not in the backup, the backup is FINE. It''s a BACKUP. It has to stop somewhere. You seem to be concerned that a careful forensic scientist could dig into the depths of the backup and find some lingering evidence that a transaction might have once been starting to come into existence. As far as I''m concerned, that transaction is ``not in the backup'''' and thus fine. You might also have a look at the, somewhat overcomplicated w.r.t. database-running-snapshot backups, SQLite2 atomic commit URL Toby posted: http://sqlite.org/atomiccommit.html Their experience points out, filesystems tend to do certain somewhat-predictable but surprising things to the data inside files when the cord is pulled, things which taking a snapshot won''t do. so, I was a little surprised to read about some of the crash behaviors SQLite had to deal with, but, with slight reservation, I stand by my statement that the database should recover swiftly and certainly when the cord is pulled. But! it looks like recovering from a ``crash-consistent'''' snapshot is actually MUCH easier than a pulled cord, at least a pulled cord with some of the filesystems SQLite2 aims to support. gp> [snapshots] have no knowledge of whether or not one of three gp> writes required for the database to be consistent have gp> completed. it depends on what you mean by consistent. In my language, the database is always consistent, after each of those three writes. The DBMS orders the writes carefully to ensure this. Especially in the case of a lightweight DB like SQLite2 this is the main reason you use the database in the first place. gp> Data does not hit the disk instantly, it takes some finite gp> amount of time in between when the write command is issued for gp> it to arrive at the disk. I''m not sure it''s critical to my argument, but, snapshots in ZFS have nothing to do with when data ``hits the disk''''. gp> ZFS promises on disk consistency but as we saw in the recent gp> thread about "Unreliable for professional usage" it is gp> possible to have issues. Likewise with database systems. yes, finally we are in agreement! Here is where we disagree: you want to add a bunch of ponderous cargo-cult procedures and dire warnings, like some convoluted way to tell SMF to put SQLite2 into remedial-backup mode before taking a ZFS snapshot to clone a system. I want to fix the bugs in SQLite2, or in whatever is broken, so that it does what it says on the tin. The first step in doing that is to convince people like you that there is *necessarily* a bug if the snapshot is not a working backup. Nevermind the fact that your way simply isn''t workable with hundreds of these lightweight SQLite/db4/whatever databases all over the system in nameservices and Samba and LDAP and Thunderbird and so on. Workable or not, it''s not _necessary_, and installing this confusing and incorrect expectation that it''s necessary blocks bugs from getting fixed, and is thus harmful for reliability overall (see _unworkable_ one sentence ago). HTH. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090225/c5b04047/attachment.bin>
On Wed, Feb 25, 2009 at 07:33:34PM -0500, Miles Nordin wrote:> You might also have a look at the, somewhat overcomplicated > w.r.t. database-running-snapshot backups, SQLite2 atomic commit URL > Toby posted: > > http://sqlite.org/atomiccommit.htmlThat''s for SQLite_3_, 3, not 2. Also, we don''t know that there''s anything wrong with SQLite2 in this case. That''s because we don''t have enough information. The OP mentioned panics and showed a kernel panic stack trace. That means we should look at things other than SQLite2, or SMF, first. I asked the OP about how they are transferring their zfs send images; the OP has not replied. So rather than go into the weeds I think we need more information from the OP. Enough that someone from the ZFS team could reproduce, say, or otherwise find the cause in user error. Nico --