A recent increase in email about ZFS and SNDR (the replication component of Availability Suite), has given me reasons to post one of my replies.> Well, now I''m confused! A collegue just pointed me towards your blog > entry about SNDR and ZFS which, until now, I thought was not a > supported configuration. So, could you confirm that for me one way > or the other?ZFS is supported with SNDR, because SNDR is filesystem agnostic. That said, ZFS is a very different beast then other Solaris filesystems. The two golden rules of ZFS replication are: 1). All volumes in a ZFS storage pool (see output of zpool status), must be placed in a single SNDR I/O consistency group. ZFS is the first Solaris filesystem that validates consistency at all levels, so all vdevs in a single storage pool must be replicated in a write-order consistent manner, and I/O consistency groups is the means to accomplish this. 2). While SNDR replication is active, do not attempt to zpool import the SNDR secondary volumes, and while the ZFS storage pool is imported on the SNDR secondary node, do not resume replication. This is truly a double-edge sword, as the instance of ZFS running on the SNDR secondary node, will see replicated writes from ZFS on the SNDR primary node, consider these unknown CRCs as some form of data corruption, and panic Solaris. This is the same reason two or more Solaris hosts can''t access the same ZFS storage pool in a SAN. There is a slight safety net here, in that zpool import will think that the ZFS storage pool is active on another node. Unfortunately stopping replication does not change this state, so you will still need to use the -f (force) option anyway, that is unless the zpool is in the exported state on the SNDR primary node, as the exported state will be replicated to the SNDR secondary node.> Of course I know that AVS only cares about blocks so, in principle, > the FS is irrelevant. However, last time I was researching this, I > found a doc that explained that the lack of support was due to the > unpredictable nature of zfs background processes (resilver, etc) and > therefore not being guaranteed of a truly quiesced FS.ZFS the filesystem is always on disk consistent, and ZFS does maintain filesystem consistency through coordination between the ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, therefore the data is in memory, not written to disk, so SNDR does not know this data exists. ZIL flushes to disk can be seconds behind the actual application writes completing, and if SNDR is running asynchronously, these replicated writes to the SNDR secondary can be additional seconds behind the actual application writes. Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no ''supported'' way to get ZFS to empty the ZIL to disk on demand. So even though one will get both ZFS and application filesystem consistency within the SNDR secondary volume, there can be many seconds worth of lost data, since SNDR can''t replicate what it does not see.
Jim Dunham wrote:> Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no > ''supported'' way to get ZFS to empty the ZIL to disk on demand. So even > though one will get both ZFS and application filesystem consistency > within the SNDR secondary volume, there can be many seconds worth of > lost data, since SNDR can''t replicate what it does not see.If the application depends on that then it should be using O_DSYNC - if it isn''t then it is broken. In which case the ZIL is on disk and SNDR should be replicating that too (either because the dataset ZIL is in pool or because the slog device is part of the SNDR consistency group as well). -- Darren J Moffat
Jim Dunham wrote:> ZFS the filesystem is always on disk consistent, and ZFS does maintain > filesystem consistency through coordination between the ZPL (ZFS POSIX > Layer) and the ZIL (ZFS Intent Log). Unfortunately for SNDR, ZFS > caches a lot of an applications filesystem data in the ZIL, therefore > the data is in memory, not written to disk, so SNDR does not know this > data exists. ZIL flushes to disk can be seconds behind the actual > application writes completing, and if SNDR is running asynchronously, > these replicated writes to the SNDR secondary can be additional > seconds behind the actual application writes. > > Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no > ''supported'' way to get ZFS to empty the ZIL to disk on demand.I''m wondering if you really meant ZIL here, or ARC? In either case, creating a snapshot should get both flushed to disk, I think? (If you don''t actually need a snapshot, simply destroy it immediately afterwards.) -- Andrew
On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote:> Jim Dunham wrote: >> ZFS the filesystem is always on disk consistent, and ZFS does >> maintain filesystem consistency through coordination between the >> ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately >> for SNDR, ZFS caches a lot of an applications filesystem data in >> the ZIL, therefore the data is in memory, not written to disk, so >> SNDR does not know this data exists. ZIL flushes to disk can be >> seconds behind the actual application writes completing, and if >> SNDR is running asynchronously, these replicated writes to the SNDR >> secondary can be additional seconds behind the actual application >> writes. >> >> Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no >> ''supported'' way to get ZFS to empty the ZIL to disk on demand. > > I''m wondering if you really meant ZIL here, or ARC? > > In either case, creating a snapshot should get both flushed to disk, > I think? > (If you don''t actually need a snapshot, simply destroy it > immediately afterwards.)not sure if there''s another way to trigger a full flush or lockfs, but to make sure you do have all transactions that may not have been flushed from the ARC you could just unmount the filesystem or export the zpool .. with the latter, then you wouldn''t have to worry about the "-f" on the import --- .je
Andrew,> Jim Dunham wrote: >> ZFS the filesystem is always on disk consistent, and ZFS does >> maintain filesystem consistency through coordination between the >> ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately >> for SNDR, ZFS caches a lot of an applications filesystem data in >> the ZIL, therefore the data is in memory, not written to disk, so >> SNDR does not know this data exists. ZIL flushes to disk can be >> seconds behind the actual application writes completing, and if >> SNDR is running asynchronously, these replicated writes to the SNDR >> secondary can be additional seconds behind the actual application >> writes. >> >> Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no >> ''supported'' way to get ZFS to empty the ZIL to disk on demand. > > I''m wondering if you really meant ZIL here, or ARC?It is my understanding that the ZFS intent log (ZIL) satisfies POSIX requirements for synchronous transactions, thus filesystem consistency. The ZFS adaptive replacement cache (ARC) is where uncommitted filesystem data is being cached. So although unwritten filesystem data allocated from the DMU, retained in the ARC, it is the ZIL which influences filesystem metadata and data consistency on disk.> In either case, creating a snapshot should get both flushed to disk, > I think?No. A ZFS snapshot is a control path, verse data path operation and (to the best of my understanding, and testing) has no influence over POSIX filesystem consistency. See the discussion here: http://www.opensolaris.org/jive/click.jspa?searchID=1695691&messageID=124809 Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on the replicated disk, but not all actively opened files. A simple test I performed to verify this, was to append to a ZFS file (no synchronous filesystem options being set) a series of blocks with a block order pattern contained within. At some random point in this process, I took a ZFS snapshot, immediately dropped SNDR into logging mode. When importing the ZFS storage pool on the SNDR remote host, I could see the ZFS snapshot just taken, but neither the snapshot version of the file, or the file itself contained all of the data previously written to it. I then retested, but opened the file with O_DSYNC, and when following the same test steps above, both the snapshot version of the file, and the file itself contained all of the data previously written to it.> (If you don''t actually need a snapshot, simply destroy it > immediately afterwards.) > > -- > AndrewJim
I''d like to correct a few misconceptions about the ZIL here. On 03/06/09 06:01, Jim Dunham wrote:> ZFS the filesystem is always on disk consistent, and ZFS does maintain > filesystem consistency through coordination between the ZPL (ZFS POSIX > Layer) and the ZIL (ZFS Intent Log).Pool and file system consistency is more a function of the DMU & SPA.> Unfortunately for SNDR, ZFS caches > a lot of an applications filesystem data in the ZIL, therefore the data > is in memory, not written to disk,ZFS data is actually cached in the ARC. The ZIL code keeps in-memory records of system call transactions in case a fsync() occurs.> so SNDR does not know this data > exists. ZIL flushes to disk can be seconds behind the actual application > writes completing,It''s the DMU/SPA that handles the transaction group commits (not the ZIL). Currently these occur 30 seconds or more frequently on a loaded system.> and if SNDR is running asynchronously, these > replicated writes to the SNDR secondary can be additional seconds behind > the actual application writes. > > Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no > ''supported'' way to get ZFS to empty the ZIL to disk on demand.The sync(2) system call is implemented differently in ZFS. For UFS it initiates a flush of cached data to disk, but does not wait for completion. This satisfies the POSIX requirement but never seemed right. For ZFS we wait for all transactions to complete and commit to stable storage (including flushing any disk write caches) before returning. So any asynchronous data in the ARC is written. Alternatively, a lockfs will flush just a file system to stable storage but in this case just the intent log is written. (Then later when the txg commits those intent log records are discarded). For some basic info on the ZIL see: http://blogs.sun.com/perrin/entry/the_lumberjack Neil.
On 03/06/09 08:10, Jim Dunham wrote:> Andrew, > >> Jim Dunham wrote: >>> ZFS the filesystem is always on disk consistent, and ZFS does >>> maintain filesystem consistency through coordination between the ZPL >>> (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for >>> SNDR, ZFS caches a lot of an applications filesystem data in the ZIL, >>> therefore the data is in memory, not written to disk, so SNDR does >>> not know this data exists. ZIL flushes to disk can be seconds behind >>> the actual application writes completing, and if SNDR is running >>> asynchronously, these replicated writes to the SNDR secondary can be >>> additional seconds behind the actual application writes. >>> >>> Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no >>> ''supported'' way to get ZFS to empty the ZIL to disk on demand. >> >> I''m wondering if you really meant ZIL here, or ARC? > > It is my understanding that the ZFS intent log (ZIL) satisfies POSIX > requirements for synchronous transactions,True.> thus filesystem consistency.No. The filesystems in the pool are always consistent with or without the ZIL. The ZIL is not the same as a journal (or the log in UFS).> The ZFS adaptive replacement cache (ARC) is where uncommitted filesystem > data is being cached. So although unwritten filesystem data allocated > from the DMU, retained in the ARC, it is the ZIL which influences > filesystem metadata and data consistency on disk.No. It just ensures the synchronous requests (O_DSYNC, fsync() etc) are on stable storage in case a crash/power fail occurs before the dirty ARC is written when the txg commits.> >> In either case, creating a snapshot should get both flushed to disk, I >> think? > > No. A ZFS snapshot is a control path, verse data path operation and (to > the best of my understanding, and testing) has no influence over POSIX > filesystem consistency. See the discussion here: > http://www.opensolaris.org/jive/click.jspa?searchID=1695691&messageID=124809 > > Invoking a ZFS snapshot will assure the ZFS snapshot is consistent on > the replicated disk, but not all actively opened files. > > A simple test I performed to verify this, was to append to a ZFS file > (no synchronous filesystem options being set) a series of blocks with a > block order pattern contained within. At some random point in this > process, I took a ZFS snapshot, immediately dropped SNDR into logging > mode. When importing the ZFS storage pool on the SNDR remote host, I > could see the ZFS snapshot just taken, but neither the snapshot version > of the file, or the file itself contained all of the data previously > written to it.That seems like a bug in ZFS to me. A snapshot ought to contain all data that has been written (whether synchronous or asynchronous) prior to the snapshot.> > I then retested, but opened the file with O_DSYNC, and when following > the same test steps above, both the snapshot version of the file, and > the file itself contained all of the data previously written to it.
Nicolas Williams
2009-Mar-06 17:08 UTC
[zfs-discuss] [storage-discuss] ZFS and SNDR..., now I''m confused.
On Fri, Mar 06, 2009 at 10:05:46AM -0700, Neil Perrin wrote:> On 03/06/09 08:10, Jim Dunham wrote: > >A simple test I performed to verify this, was to append to a ZFS file > >(no synchronous filesystem options being set) a series of blocks with a > >block order pattern contained within. At some random point in this > >process, I took a ZFS snapshot, immediately dropped SNDR into logging > >mode. When importing the ZFS storage pool on the SNDR remote host, I > >could see the ZFS snapshot just taken, but neither the snapshot version > >of the file, or the file itself contained all of the data previously > >written to it. > > That seems like a bug in ZFS to me. A snapshot ought to contain all data > that has been written (whether synchronous or asynchronous) prior to the > snapshot.Wouldn''t one have to quiesce (export) the pool on the primary before importing it on the secondary? Or does SNDR detect suitable checkpoints using, say, ZFS''s cache flush commands?
Jonathan Edwards wrote:> > On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote: > >> Jim Dunham wrote: >>> ZFS the filesystem is always on disk consistent, and ZFS does >>> maintain filesystem consistency through coordination between the ZPL >>> (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately for >>> SNDR, ZFS caches a lot of an applications filesystem data in the >>> ZIL, therefore the data is in memory, not written to disk, so SNDR >>> does not know this data exists. ZIL flushes to disk can be seconds >>> behind the actual application writes completing, and if SNDR is >>> running asynchronously, these replicated writes to the SNDR >>> secondary can be additional seconds behind the actual application >>> writes. >>> >>> Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no >>> ''supported'' way to get ZFS to empty the ZIL to disk on demand. >> >> I''m wondering if you really meant ZIL here, or ARC? >> >> In either case, creating a snapshot should get both flushed to disk, >> I think? >> (If you don''t actually need a snapshot, simply destroy it immediately >> afterwards.) > > not sure if there''s another way to trigger a full flush or lockfs, but > to make sure you do have all transactions that may not have been > flushed from the ARC you could just unmount the filesystem or export > the zpool .. with the latter, then you wouldn''t have to worry about > the "-f" on the importsync(1m) -- richard
Jim Dunham
2009-Mar-06 20:10 UTC
[zfs-discuss] [storage-discuss] ZFS and SNDR..., now I''m confused.
Nicolas,> On Fri, Mar 06, 2009 at 10:05:46AM -0700, Neil Perrin wrote: >> On 03/06/09 08:10, Jim Dunham wrote: >>> A simple test I performed to verify this, was to append to a ZFS >>> file >>> (no synchronous filesystem options being set) a series of blocks >>> with a >>> block order pattern contained within. At some random point in this >>> process, I took a ZFS snapshot, immediately dropped SNDR into >>> logging >>> mode. When importing the ZFS storage pool on the SNDR remote host, I >>> could see the ZFS snapshot just taken, but neither the snapshot >>> version >>> of the file, or the file itself contained all of the data previously >>> written to it. >> >> That seems like a bug in ZFS to me. A snapshot ought to contain all >> data >> that has been written (whether synchronous or asynchronous) prior >> to the >> snapshot. > > Wouldn''t one have to quiesce (export) the pool on the primary before > importing it on the secondary?No. ZFS is always on-disk consistent, so as long as SNDR is in logging mode, zpool import will work on the secondary node.> Or does SNDR detect suitable checkpoints > using, say, ZFS''s cache flush commands?SNDR is totally volume and filesystem agnostic. It does not know if ZFS, UFS, VxFS, Oracle, Sybase, some application, is writing to the SNDR primary volume. It also does not know if the volumes being replicated are JBODs, RAID-1, RAID-5, RaidZ, RaidZ2. Very simplistically, SNDR replicates write I/Os made to the SNDR primary volume, to the SNDR secondary volume. Unfortunately, SNDR is nowhere as simple to use as ZFS''s send / receive. - Jim
Nicolas Williams
2009-Mar-06 20:42 UTC
[zfs-discuss] [storage-discuss] ZFS and SNDR..., now I''m confused.
On Fri, Mar 06, 2009 at 03:10:41PM -0500, Jim Dunham wrote:> >Wouldn''t one have to quiesce (export) the pool on the primary before > >importing it on the secondary? > > No. ZFS is always on-disk consistent, so as long as SNDR is in logging > mode, zpool import will work on the secondary node.As long as cache flushes are obeyed, but, yes, you''re right.
>>>>> "jd" == Jim Dunham <James.Dunham at Sun.COM> writes:jd> It is my understanding that the ZFS intent log (ZIL) satisfies jd> POSIX requirements for synchronous transactions, thus jd> filesystem consistency. maybe ``file consistency'''' would be clearer. When you say filesystem consistency people imagine their pools won''t import, which I think isn''t what you''re talking about. Databases rely on the ZIL to keep their data files internally consistent, and MTA''s to keep their queue directories consistent: ``file consistency'''' meaning the insides of a file must be consistent with the rest of the insides of the same file, and they won''t be without the ZIL. so, for example, in an imaginary better world where virtual machine software didn''t break all kinds of sync and barrier rules and the ZIL were the only issue, then disabling the ZIL on the Host could cause the filesystems of virtual Guests to become inconsistent and refuse to import or need drastic fsck if the Host lost power, or in the SNDR-replicated copy of the Host, but the Host filesystem and its replica would always stay clean and mountable with or without the ZIL. The ZIL is stored on the disk, never in RAM as your earlier message suggested, so it should be replicated along with everything else, shouldn''t it? unless you are using a slog and leave the slog outside replication, but in that case it should be impossible to import the pool on the secondary because importing with missing slogs doesn''t work yet, so I''m not sure what''s happening to you. Are you actually observing violation of POSIX consistency ``suggestions'''' w.r.t. fsync() or O_DSYNC on the secondary? Or are you talking about close-to-open? Files that you close(), wait for the close to return, break replication, and the file does not appear on the secondary? What''s breaking exactly? jd> A simple test I performed to verify this, was to append to a jd> ZFS file (no synchronous filesystem options being set) a jd> series of blocks with a block order pattern contained jd> within. At some random point in this process, I took a ZFS jd> snapshot, immediately dropped SNDR into logging mode. When jd> importing the ZFS storage pool on the SNDR remote host, I jd> could see the ZFS snapshot just taken, but neither the jd> snapshot version of the file, or the file itself contained all jd> of the data previously written to it. that''s a really good test! so SNDR is good for testing, too, it seems. I''m glad you''ve done it. If we''d just listened to the several people speculating, ``just take a snapshot, it ought to imply a lockfs'''' we could be having nasty surprises months from now. I''m also not that upset about the behavior, if it lets one take and destroy snapshots really fast. I could see the opposing argument that all snapshots should commit to disk atomically, though, because you are saying the snapshot _exists_ but doesn''t have in it what it should---maybe in a more ideal world snapshot should either disappear after reboot, or else if it exists contain exactly what it logically should. jd> I then retested, but opened the file with O_DSYNC, and when jd> following the same test steps above, both the snapshot version jd> of the file, and the file itself contained all of the data jd> previously written to it. AIUI, in this test some of the file data may be written to the ZIL. In the former test, the ZIL would not be used at all. but the ZIL is just a separate area on the disk that''s faster to write to, since with O_DSYNC or fsync() you would like to return to the application in a hurry. ZFS scribbles down the change as quickly as possible in the ZIL on the disk, then rewrites it in a more organized way later. -- READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090306/70d8bce4/attachment.bin>
>>>>> "np" == Neil Perrin <Neil.Perrin at Sun.COM> writes:np> Alternatively, a lockfs will flush just a file system to np> stable storage but in this case just the intent log is np> written. (Then later when the txg commits those intent log np> records are discarded). In your blog it sounded like there''s an in-RAM ZIL through which _everything_ passes, and parts of this in-RAM ZIL are written to the on-disk ZIL as needed. so maybe I was using the word ZIL wrongly in my last post. are you saying, lockfs will divert writes that would normally go straight to the pool, to pass through the on-disk ZIL instead? assuming any separate slog isn''t destroyed while the power''s off, lockfs and sync should get you the same end result after an unclean shutdown, right? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090306/45acdad8/attachment.bin>
On 03/06/09 14:51, Miles Nordin wrote:>>>>>> "np" == Neil Perrin <Neil.Perrin at Sun.COM> writes: > > np> Alternatively, a lockfs will flush just a file system to > np> stable storage but in this case just the intent log is > np> written. (Then later when the txg commits those intent log > np> records are discarded). > > In your blog it sounded like there''s an in-RAM ZIL through which > _everything_ passes, and parts of this in-RAM ZIL are written to the > on-disk ZIL as needed.Thats correct.> so maybe I was using the word ZIL wrongly in > my last post.I understood what you meant.> > are you saying, lockfs will divert writes that would normally go > straight to the pool, to pass through the on-disk ZIL instead?- Not instead but as well. The ZIL (code) will write immediately to the stable intent logs, then later the data cached in the ARC will be written as part of the pool transaction group (txg). As soon as that happens the intent log blocks can be re-used.> > assuming any separate slog isn''t destroyed while the power''s off, > lockfs and sync should get you the same end result after an unclean > shutdown, right?Right. Neil.