If system crashes some time after last commit of transaction group (TxG), what happens to the file system transactions since the last commit of TxG (I presume last commit of TxG represents the last on-disk consistency)? Does ZFS recover all file system transactions which it returned with success since the last commit of TxG, which implis that ZIL must flush log records for each successful file system transaction before it returns to caller so that it can replay the filesystem transactions? Blogs on ZIL states (I hope I read it right) that log records are maintained in-memory and flushed to disk only when 1) at synchronous write request (does that mean they free in-memory log after that), 2) when TxG is committed (and free in-memory log). Thank you for your time. This message posted from opensolaris.org
kyusun Chang wrote On 05/04/07 19:34,:> If system crashes some time after last commit of transaction group (TxG), what > happens to the file system transactions since the last commit of TxGThey are lost, unless they were synchronous (see below).> (I presume last commit of TxG represents the last on-disk consistency)?Correct.> Does ZFS recover all file system transactions which it returned with success > since the last commit of TxG, which implis that ZIL must flush log records for> each successful file system transaction before it returns to caller so that it can replay> the filesystem transactions?Only synchronous transactions (those forced by O_DSYNC or fsync()) are written to the intent log.> Blogs on ZIL states (I hope I read it right) that log records are maintained > in-memory and flushed to disk only when > 1) at synchronous write request (does that mean they free in-memory > log after that),Yes they are then freed in memory> 2) when TxG is committed (and free in-memory log). > > Thank you for your time. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > > Does ZFS recover all file system transactions which > it returned with success > > since the last commit of TxG, which implis that ZIL > must flush log records for > > each successful file system transaction before it > returns to caller so that > t can replay > > the filesystem transactions? > > Only synchronous transactions (those forced by > O_DSYNC or fsync()) are > written to the intent log. >Could you help me to clarify on "writing ''synchronous'' transactions to the log"? Assume a scenario where a sequence of new subdirectories D1, D2 (as child of D1) have been created, then new files F1 in D1 and F2 in D2 have been created, and after some writes to F1 and F2, fsync(F1) was issued. Also , assume a file F3 in other parts of the file system that are being modified. To recover F1, creation of D1 and D2 must be recovered. It would be painful to find and log the relevant information at the time of fsync() to recover them. It means that 1) ZFS needs to log EVERY (vs "synchronous") file system transactions to replay (i.e., redo onto the on-disk state of last commit of TxG) since one cannot predict when fsync() would be requested for which file, i.e., ZFS log them all in-memory, but flushes only at synchronous transaction? It also means ZFS log user data for for every write()? 2) If the cumulated log records up to fsync(F1) (from last fsync()) is flushed to disk for replay at subsequent recovery, ZFS recovers the consistent file system state at the point in time of latest fsync(), including all successful file system transaction up to that point that have nothing to do with F1, e.g., F3, before crash"? Or, am I missing something? I presume that flush of log occurs also at every write() of file opened with O_DSYNC. Otherwise, it should be same as fsync() case. Are there any other synchronization request that forces in-memory log? As a side question, does ZFS log atime update (and does snapshot copy-on-write for it)? Again, thank you for your time. This message posted from opensolaris.org
Neil.Perrin at Sun.COM
2007-May-06 00:39 UTC
[zfs-discuss] Re: recovered state after system crash
kyusun Chang wrote:>>>Does ZFS recover all file system transactions which >> >>it returned with success >> >>>since the last commit of TxG, which implis that ZIL >> >>must flush log records for >> >>>each successful file system transaction before it >> >> returns to caller so that >>t can replay >> >>>the filesystem transactions? >> >>Only synchronous transactions (those forced by >>O_DSYNC or fsync()) are >>written to the intent log. >> > > Could you help me to clarify on "writing ''synchronous'' transactions > to the log"? > > Assume a scenario where a sequence of new subdirectories > D1, D2 (as child of D1) have been created, then new files > F1 in D1 and F2 in D2 have been created, and after some writes > to F1 and F2, fsync(F1) was issued. > Also , assume a file F3 in other parts of the file system > that are being modified. > To recover F1, creation of D1 and D2 must be recovered. > It would be painful to find and log the relevant information > at the time of fsync() to recover them.The ZIL will write log records to stable storage for all directory creations and the data for F1, but not the data for F2, F3.. See the code in zil_commit_writer() for the exact details: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#938> It means that > 1) ZFS needs to log EVERY (vs "synchronous") file system transactions to replay > (i.e., redo onto the on-disk state of last commit of TxG) > since one cannot predict when fsync() would be requested for > which file, i.e., ZFS log them all in-memory, but flushes only at > synchronous transaction? > It also means ZFS log user data for for every write()? > > 2) If the cumulated log records up to fsync(F1) (from > last fsync()) is flushed to disk for replay at subsequent > recovery, > ZFS recovers the consistent file system state at the point > in time of latest fsync(), including all successful file > system transaction up to that point that have nothing to do > with F1, e.g., F3, before crash"? > > Or, am I missing something?So the actual code logs everything except, writes, setattr, acls, and truncates for other files. This has undergone some change over time and may continue to change.> > I presume that flush of log occurs also at every write() of > file opened with O_DSYNC. Otherwise, it should be same as > fsync() case.Correct> > Are there any other synchronization request that forces > in-memory log?There are others: O_RSYNC, O_SYNC, sync(1M)> > As a side question, does ZFS log atime update (and > does snapshot copy-on-write for it)?I don''t think atime updates are logged as transactions. Not sure about snapshots COW-ing.> > Again, thank you for your time.So what are your concerns here? Correctness or performance?> > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
kyusun Chang
2007-May-06 17:14 UTC
[zfs-discuss] Re: Re: recovered state after system crash
> > So what are your concerns here? Correctness or > performance? >It is not about correctness. In journaling filesystems, builders seems to pick either "recover all filesystem transactions up to the point of crash" or "you may lose last few transactions", or make it an option (e.g. VxFS delayed log). Since snapshot-based file systems have 5+ seconds of interval for on-disk consistency, I was wondering how they respond on the issue. It is not about performance issues either, since chosen approach of currency would impact performance accordingly. It is more about curiosity on what customers expect and builders provide about "currency" of recovery, since ZFS raised bar of reliability to detect silent corruption, self-healing, etc. I guess it stll remains as acceptance/preference/tradeoff issues. This message posted from opensolaris.org