James Andrewartha
2009-Mar-18 05:28 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
Hi all, Recently there''s been discussion [1] in the Linux community about how filesystems should deal with rename(2), particularly in the case of a crash. ext4 was found to truncate files after a crash, that had been written with open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is because ext4 uses delayed allocation and may not write the contents to disk immediately, but commits metadata changes quite frequently. So when rename("foo.tmp","foo") is committed to disk, it has a length of zero which is later updated when the data is written to disk. This means after a crash, "foo" is zero-length, and both the new and the old data has been lost, which is undesirable. This doesn''t happen when using ext3''s default settings because ext3 writes data to disk before metadata (which has performance problems, see Firefox 3 and fsync[2]) Ted T''so''s (the main author of ext3 and ext4) response is that applications which perform open(),write(),close(),rename() in the expectation that they will either get the old data or the new data, but not no data at all, are broken, and instead should call open(),write(),fsync(),close(),rename(). Most other people are arguing that POSIX says rename(2) is atomic, and while POSIX doesn''t specify crash recovery, returning no data at all after a crash is clearly wrong, and excessive use of fsync is overkill and counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for fsync). I''ve omitted a lot of detail, but I think this is the core of the argument. Now the question I have, is how does ZFS deal with open(),write(),close(),rename() in the case of a crash? Will it always return the new data or the old data, or will it sometimes return no data? Is returning no data defensible, either under POSIX or common sense? Comments about other filesystems, eg UFS are also welcome. As a counter-point, XFS (written by SGI) is notorious for data-loss after a crash, but its authors defend the behaviour as POSIX-compliant. Note this is purely a technical discussion - I''m not interested in replies saying ?FS is a better filesystem in general, or on GPL vs CDDL licensing. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781?comments=all http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ http://lwn.net/Articles/323169/ http://mjg59.livejournal.com/108257.html http://lwn.net/Articles/323464/ http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/ http://lwn.net/Articles/323752/ * http://lwn.net/Articles/322823/ * * are currently subscriber-only, email me for a free link if you''d like to read them [2] http://lwn.net/Articles/283745/ -- James Andrewartha | Sysadmin Data Analysis Australia Pty Ltd | STRATEGIC INFORMATION CONSULTANTS 97 Broadway, Nedlands, Western Australia, 6009 PO Box 3258, Broadway Nedlands, WA, 6009 T: +61 8 9386 3304 | F: +61 8 9386 3202 | I: http://www.daa.com.au
Casper.Dik at Sun.COM
2009-Mar-18 09:17 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>Recently there''s been discussion [1] in the Linux community about how >filesystems should deal with rename(2), particularly in the case of a crash. >ext4 was found to truncate files after a crash, that had been written with >open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is > because ext4 uses delayed allocation and may not write the contents to disk >immediately, but commits metadata changes quite frequently. So when >rename("foo.tmp","foo") is committed to disk, it has a length of zero which >is later updated when the data is written to disk. This means after a crash, >"foo" is zero-length, and both the new and the old data has been lost, which >is undesirable. This doesn''t happen when using ext3''s default settings >because ext3 writes data to disk before metadata (which has performance >problems, see Firefox 3 and fsync[2])Believing that, somehow, "metadata" is more important than "other data" should have been put to rest with UFS. Yes, it''s easier to "fsck" the filesystem when the metadata is correct and that gets you a valid filesystem but that doesn''t mean that you get a filesystem with valid contents.>Ted T''so''s (the main author of ext3 and ext4) response is that applications >which perform open(),write(),close(),rename() in the expectation that they >will either get the old data or the new data, but not no data at all, are >broken, and instead should call open(),write(),fsync(),close(),rename(). >Most other people are arguing that POSIX says rename(2) is atomic, and while >POSIX doesn''t specify crash recovery, returning no data at all after a crash >is clearly wrong, and excessive use of fsync is overkill and >counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for >fsync). I''ve omitted a lot of detail, but I think this is the core of the >argument.As long as POSIX believes that systems don''t crash, then clearly there is nothing in the standard which would help the argument on either side. It is a "quality of implementation" property. Apparently, T''so feels that reordering filesystem operations is fine.>Now the question I have, is how does ZFS deal with >open(),write(),close(),rename() in the case of a crash? Will it always >return the new data or the old data, or will it sometimes return no data? Is > returning no data defensible, either under POSIX or common sense? Comments >about other filesystems, eg UFS are also welcome. As a counter-point, XFS >(written by SGI) is notorious for data-loss after a crash, but its authors >defend the behaviour as POSIX-compliant.I didn''t know about XFS behaviour on crash. I don''t know exactly how ZFS commits transaction groups; the ZFS authors can tell and I hope they chime in. The only time POSIX is in question is when the fileserver crashes and whether or not the NFS server keeps its promises. Some typical Linux configuration would break some of those promises. Casper
Joerg Schilling
2009-Mar-18 10:08 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
James Andrewartha <jamesa at daa.com.au> wrote:> Recently there''s been discussion [1] in the Linux community about how > filesystems should deal with rename(2), particularly in the case of a crash. > ext4 was found to truncate files after a crash, that had been written with > open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is > because ext4 uses delayed allocation and may not write the contents to disk > immediately, but commits metadata changes quite frequently. So when > rename("foo.tmp","foo") is committed to disk, it has a length of zero which > is later updated when the data is written to disk. This means after a crash, > "foo" is zero-length, and both the new and the old data has been lost, which > is undesirable. This doesn''t happen when using ext3''s default settings > because ext3 writes data to disk before metadata (which has performance > problems, see Firefox 3 and fsync[2]) > > Ted T''so''s (the main author of ext3 and ext4) response is that applications > which perform open(),write(),close(),rename() in the expectation that they > will either get the old data or the new data, but not no data at all, are > broken, and instead should call open(),write(),fsync(),close(),rename(). > Most other people are arguing that POSIX says rename(2) is atomic, and while > POSIX doesn''t specify crash recovery, returning no data at all after a crash > is clearly wrong, and excessive use of fsync is overkill and > counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for > fsync). I''ve omitted a lot of detail, but I think this is the core of the > argument.The problem in this case is not whether rename() is atomic but whether the file that replaces the old file in an atomic rename() operation is in a stable state on the disk before calling rename(). The calling sequence of the failing code was: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); write(f, "dat", size); close(f); rename("new", "old"); The only granted way to have the file "new" in a stable state on the disk is to call: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); write(f, "dat", size); fsync(f); close(f); Do not forget to check error codes..... If the application would call: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); if (write(f, "dat", size) != size) fail(); if (fsync(f) < 0) fail() if (close(f) < 0) fail() if (rename("new", "old") < 0) fail(); and if after a crash there is neither the old file nor the new file on the disk in a consistent state, then you may blame the file system. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling wrote:> James Andrewartha <jamesa at daa.com.au> wrote: > > Recently there''s been discussion [1] in the Linux community about how > > filesystems should deal with rename(2), particularly in the case of a crash. > > ext4 was found to truncate files after a crash, that had been written with > > open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is > > because ext4 uses delayed allocation and may not write the contents to disk > > immediately, but commits metadata changes quite frequently. So when > > rename("foo.tmp","foo") is committed to disk, it has a length of zero which > > is later updated when the data is written to disk. This means after a crash, > > "foo" is zero-length, and both the new and the old data has been lost, which > > is undesirable. This doesn''t happen when using ext3''s default settings > > because ext3 writes data to disk before metadata (which has performance > > problems, see Firefox 3 and fsync[2]) > > > > Ted T''so''s (the main author of ext3 and ext4) response is that applications > > which perform open(),write(),close(),rename() in the expectation that they > > will either get the old data or the new data, but not no data at all, are > > broken, and instead should call open(),write(),fsync(),close(),rename(). > > The only granted way to have the file "new" in a stable state on the > disk > is to call: > > f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); > write(f, "dat", size); > fsync(f); > close(f);AFAIUI, the ZFS transaction group maintains write ordering, at least as far as write()s to the file would be in the ZIL ahead of the rename() metadata updates. So I think the atomicity is maintained without requiring the application to call fsync() before closing the file. If the TXG is applied and the rename() is included, then the file writes have been too, so foo would have the new contents. If the TXG containing the rename() isn''t complete and on the ZIL device at crash time, foo would have the old contents. Posix doesn''t require the OS to sync() the file contents on close for local files like it does for NFS access? How odd. --Joe
Casper.Dik at Sun.COM
2009-Mar-18 15:24 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>AFAIUI, the ZFS transaction group maintains write ordering, at least as far as write()s to the file would be in the ZIL ahead of the rename() metadata updates.> >So I think the atomicity is maintained without requiring the application to call fsync() before closing the file. If the TXG is applied and the rename() is included, then the file writes have been too, so foo would have the new contents. If the TXG containing the rename() isn''t complete and on the ZIL device at crash time, foo would have the old contents.> >Posix doesn''t require the OS to sync() the file contents on close for local files like it does forNFS access? How odd. perhaps sync() but not fsync(). But I''m not sure that that is the case. UFS does that, it schedules writing the modified content when the file is closed but onlyon the last close. Casper
David Dyer-Bennet
2009-Mar-18 16:42 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, March 18, 2009 05:08, Joerg Schilling wrote:> The problem in this case is not whether rename() is atomic but whether the > file that replaces the old file in an atomic rename() operation is in a > stable state on the disk before calling rename().Good, I was hoping somebody saw it that way. People tend to assume that a successful close() guarantees the data written to that file is on disk, and I don''t believe that is actually promised by POSIX (though I''m by no means a POSIX rules lawyer) or most other modern systems. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Bob Friesenhahn
2009-Mar-18 16:43 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, 18 Mar 2009, Joerg Schilling wrote:> > The problem in this case is not whether rename() is atomic but whether the > file that replaces the old file in an atomic rename() operation is in a > stable state on the disk before calling rename().This topic is quite disturbing to me ...> The calling sequence of the failing code was: > > f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); > write(f, "dat", size); > close(f); > rename("new", "old"); > > The only granted way to have the file "new" in a stable state on the disk > is to call: > > f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); > write(f, "dat", size); > fsync(f); > close(f);But the problem is not that the file "new" is in an unstable state. The problem is that it seems that some filesystems are not preserving the ordering of requests. Failing to preserve the ordering of requests is fraught with peril. POSIX does not care about "disks" or "filesystems". The only correct behavior is for operations to be applied in the order that they are requested of the operating system. This is a core function of any operating system. It is therefore ok for some (or all) of the data which was written to "new" to be lost, or for the rename operation to be lost, but it is not ok for the rename to end up with a corrupted file with the new name. In summary, I don''t agree with you that the misbehavior is correct, but I do agree that copious expensive fsync()s should be assured to work around the problem. As it happens, current versions of my own application should be safe from this Linux filesystem bug, but older versions are not. There is even a way to request fsync() on every file close, but that could be quite expensive so it is not the default. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Mar-18 16:59 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
Bob Friesenhahn wrote:> As it happens, current versions of my own application should be safe > from this Linux filesystem bug, but older versions are not. There is > even a way to request fsync() on every file close, but that could be > quite expensive so it is not the default.Pragmatically, it is much easier to change the file system once, than to test or change the zillions of applications that might be broken. -- richard
Nicolas Williams
2009-Mar-18 17:25 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, Mar 18, 2009 at 11:15:48AM -0400, Moore, Joe wrote:> Posix doesn''t require the OS to sync() the file contents on close for > local files like it does for NFS access? How odd.Why should it? If POSIX is agnostic as to system crashes / power failures, then why should it say anything about when data should hit the disk in the absence of explicit sync()/fsync() calls? NFS is a different beast though. Client cache coherency and other issues come up. So to maintain POSIX semantics a number of NFS operations must be synchronous and close() on the client requires flushing dirty buffers to the server. Nico --
Nicolas Williams
2009-Mar-18 17:29 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote:> In summary, I don''t agree with you that the misbehavior is correct, > but I do agree that copious expensive fsync()s should be assured to > work around the problem.fsync() is, indeed, expensive. Lots of calls to fsync() that are not necessary for correct application operation EXCEPT as a workaround for lame filesystem re-ordering are a sure way to kill performance. I''d rather the filesystems were fixed than end up with sync;sync;sync; type folklore. Or just don''t use lame filesystems.> As it happens, current versions of my own application should be safe > from this Linux filesystem bug, but older versions are not. There is > even a way to request fsync() on every file close, but that could be > quite expensive so it is not the default.So now you pepper your apps with an option to fsync() on close()? Ouch. Nico --
Bob Friesenhahn
2009-Mar-18 18:26 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, 18 Mar 2009, Richard Elling wrote:> Bob Friesenhahn wrote: >> As it happens, current versions of my own application should be safe from >> this Linux filesystem bug, but older versions are not. There is even a way >> to request fsync() on every file close, but that could be quite expensive >> so it is not the default. > > Pragmatically, it is much easier to change the file system once, than > to test or change the zillions of applications that might be broken.Yes, and particularly because fsync() can be very expensive. At one time fsync() was the same as sync() for ZFS. Presumably it is improved by now. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin
2009-Mar-18 18:27 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>>>>> "ja" == James Andrewartha <jamesa at daa.com.au> writes:ja> other people are arguing that POSIX says rename(2) is atomic, Their statement is true but it''s NOT an argument against T''so who is 100% right: the applications using that calling sequence for crash consistency are not portable under POSIX. atomic has nothing to do with crash consistency. It''s about the view of the filesystem by other processes on the same system, ex., the security vulnerabilities one can have with setuid binaries that work in /tmp if said binaries don''t take advantage of certain guarantees of atomicity to avoid race conditions. Obviously /tmp has zero to do with what the filesystem looks like after a crash: it always looks _empty_. For ext4 the argument is settled, fix the app. But a more productive way to approach the problem would be to look at tradeoffs between performance and crash consistency. Maybe we need fbarrier() (which could return faster---it sounds like on ZFS it could be a noop) instead of fsync(), or maybe something more, something genuinely post-Unix like limited filesystem-transactions that can open, commit, rollback. It''s hard for a generation that grew up under POSIX to think outside it. A hypothetical new API ought to help balance performance/consistency for networked filesystems, too, like NFS or Lustre/OCFS/... For example, networked filesystems often promise close-to-open consistency, and the promise doesn''t necessarily have to do with crashing. It means, client A client B write close sendmsg --------> poll open read (will see all A''s writes) client A client B write wait a while sendmsg -------> poll read (all bets are off) This could stand obvious improvements in two ways. First, if I''m trying to send data to B using the filesystem (monkey chorus: don''t do that! it won''t work! you have to send data between nodes with libgnetdatasender and it''s associated avahi-using setuid-nobody daemon! just check it out of svn. no it doesn''t support IPv6 but the NEXT VERSION, what, 1000 nodes? well then you definitely don''t want to--- DOWN, monkeychorus! If I feel like writing in Python or Javurscript or even PHP, let me. If I feel like sending data through a filesystem, find a way to let me! why the hell not do it? I said post-POSIX.) send USING THE FILESYSTEM, then maybe I don''t want to close the file all the time because that''s slow or just annoying. Is there some dance I can do using locks on B or A to say, ``I need B to see the data, but I do not necessarily need, nor want to wait, for it to be committed to disk---I just want it consistent on all clients''''? like, suppose I keep the file open on A and B at the same time over NFS. Will taking a write lock on A and a read lock on B actually flush the client''s cache and get the information moved from A to B faster? Second, we''ve discussed before NFSv3 write-write-write-commit batching doesn''t work across close/open, so people need slogs to make their servers fast for the task of writing thousands of tiny files while for mounting VM disk images over NFS the slog might not be so badly needed. Even with the slog, the tiny-files scenario would be slowed down by network roundtrips. If we had a transaction API, we could open a transaction, write 1000 files, then close it. On a high-rtt network this could be many orders of magnitude faster than what we have now. but it''s hard to imagine a transactional API that doesn''t break the good things about POSIX-style like ``relatively simple'''', ``apparently-stateless NFS client-server sessions'''', ``advisory locking only'''', ... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090318/a616594b/attachment.bin>
Miles Nordin
2009-Mar-18 19:01 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>>>>> "c" == Miles Nordin <carton at Ivy.NET> writes:c> fbarrier() on second thought that couldn''t help this problem. The goal is to associate writing to the directory (rename) with writing to the file referenced by that inode/handle (write/fsync/``fbarrier''''), and in POSIX these two things are pretty distant and unrelated to each other. The posix way to associate these two things is to wait for fsync() to return before asking for the rename. The waiting is expressive---it''s an extremely simple, easy-to-understand API for associating one thing with another. I thought maybe this was so simple there was only one thing not two, so the wait coudl be skipped, but I am wrong. It is too bad because as others have said it means these fsync()''s will have to go in to make the app correct/portable with the API we have to work under, even though ZFS has certain convenient quirks and probably doesn''t need them. IMHO the best reaction to the KDE hysteria would be to make sure SQLite and BerkeleyDB are fast as possible and effortlessly correct on ZFS, and anything that''s slow because of too much synchronous writing to tiny files should use a library instead. This is not currently the case because for high performance one has to manually match DB and ZFS record sizes which isn''t practical for these tiny throwaway databases that must share a filesystem with nonDB stuff, and there might be room for improvement in terms of online defragmentation too. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090318/ee64dda5/attachment.bin>
Casper.Dik at Sun.COM
2009-Mar-18 19:35 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote: >> In summary, I don''t agree with you that the misbehavior is correct, >> but I do agree that copious expensive fsync()s should be assured to >> work around the problem. > >fsync() is, indeed, expensive. Lots of calls to fsync() that are not >necessary for correct application operation EXCEPT as a workaround for >lame filesystem re-ordering are a sure way to kill performance. > >I''d rather the filesystems were fixed than end up with sync;sync;sync; >type folklore. Or just don''t use lame filesystems. > >> As it happens, current versions of my own application should be safe >> from this Linux filesystem bug, but older versions are not. There is >> even a way to request fsync() on every file close, but that could be >> quite expensive so it is not the default. > >So now you pepper your apps with an option to fsync() on close()? Ouch.fsync() was always a wart. Many of the Unx filesystem writes didn''t that is was a problem, but it still is. This is now part of the folklore: "you must fsync". But why do filesystem writers insist that the filesystem can reorder all operations? And why do they believe that "meta data" is more important? Clearly, that is false: how else can you rename files which the system hasn''t written already? I noticed that our old ufs code issued two synchronous writes when creating a file. Unfortunately, it should have used three even when we don''t care what''s in the file. Casper
David Dyer-Bennet
2009-Mar-18 20:21 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, March 18, 2009 11:43, Bob Friesenhahn wrote:> On Wed, 18 Mar 2009, Joerg Schilling wrote: >> >> The problem in this case is not whether rename() is atomic but whether >> the >> file that replaces the old file in an atomic rename() operation is in a >> stable state on the disk before calling rename(). > > This topic is quite disturbing to me ... > >> The calling sequence of the failing code was: >> >> f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); >> write(f, "dat", size); >> close(f); >> rename("new", "old"); >> >> The only granted way to have the file "new" in a stable state on the >> disk >> is to call: >> >> f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); >> write(f, "dat", size); >> fsync(f); >> close(f); > > But the problem is not that the file "new" is in an unstable state. > The problem is that it seems that some filesystems are not preserving > the ordering of requests. Failing to preserve the ordering of > requests is fraught with peril.Only in very limited cases. For example, writing the blocks of a file can occur in any order, so long as no block is written twice and so long as no reads are performed. It simply doesn''t matter what order that goes to disk in. As soon as somebody reads one of the blocks written, then some of the ordering becomes important. You''re trying, I think, to argue from first principles; may I suggest that a lot is known about filesystem (and database) semantics, and that we will get further if we work within what''s already known about that, rather than trying to reinvent the wheel from scratch?> > POSIX does not care about "disks" or "filesystems". The only correct > behavior is for operations to be applied in the order that they are > requested of the operating system. This is a core function of any > operating system.Is this what it actually says in the POSIX documents? Or in any other filesystem formal definition? -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Mar-18 20:22 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Wed, March 18, 2009 11:59, Richard Elling wrote:> Bob Friesenhahn wrote: >> As it happens, current versions of my own application should be safe >> from this Linux filesystem bug, but older versions are not. There is >> even a way to request fsync() on every file close, but that could be >> quite expensive so it is not the default. > > Pragmatically, it is much easier to change the file system once, than > to test or change the zillions of applications that might be broken.On the other hand, by doing so we''ve set limits on the behavior of all future applications. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Nicolas Williams
2009-Mar-18 20:57 UTC
[zfs-discuss] SQLite3 on ZFS (Re: rename(2), atomicity, crashes and fsync())
On Wed, Mar 18, 2009 at 03:01:30PM -0400, Miles Nordin wrote:> IMHO the best reaction to the KDE hysteria would be to make sure > SQLite and BerkeleyDB are fast as possible and effortlessly correct on > ZFS, and anything that''s slow because of too much synchronous writingI tried to do that for SQLite3. I ran into these problems: 1) The max page size for SQLite3 is 16KB. It can be made 32KB but I got some tests to core dump when I did that. It cannot go beyond that without massive changes to SQLite3. Or maybe the sizes in question were 32KB and 64KB -- either way, smaller than ZFS'' preferred block size. 2) The SQLite3 tests depend on the page size being 1KB. So changing SQLite3 to select the underlying filesystem''s preferred block size causes spurious test failures. 3) The default SQLite3 cache size becomes a very small 60 or so pages when maxing the page size. I suspect that will mean more pread(2) syscalls; whether that''s a problem or not, I''m not sure. Therefore I held off putting back this change to SQLite3 in the OpenSolaris SFW consolidation. Nico --
On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote:> POSIX does not care about "disks" or "filesystems". The only > correct behavior is for operations to be applied in the order that > they are requested of the operating system. This is a core function > of any operating system. It is therefore ok for some (or all) of > the data which was written to "new" to be lost, or for the rename > operation to be lost, but it is not ok for the rename to end up with > a corrupted file with the new name.Out of curiousity, is this what POSIX actually specifies? If that is the case, wouldn''t that mean that the behaviour of ext3/4 is incorrect? (Assuming that it does re-order operations.)
James Litchfield
2009-Mar-19 04:58 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
POSIX has a Synchronized I/O Data (and File) Integrity Completion definition (line 115434 of the Issue 7 (POSIX.1-2008) specification). What it says is that writes for a byte range in a file must complete before any pending reads for that byte range are satisfied. It does not say that if you have 3 pending writes and pending reads for a byte range, that the writes must complete in the order issued - simply that they must all complete before any reads complete. See lines 71371-71376 in the write() discussion. The specification explicitly avoids discussing the "behavior of concurrent writes to a file from multiple processes." and suggests that applications doing this "should use some form of concurrency control." It is true that because of these semantics, many file system implementations will use locks to ensure that no reads can occur in the entire file while writes are happening which has the side effect of ensuring the writes are executed in the order they are issued. This is an implementation detail that can be complicated by async IO as well. The only guarantee POSIX offers is that all pending writes to the relevant byte range in the file will be completed before a read to that byte range is allowed. An in-progress read is expected to block any writes to the relevant byte range file the read completes. The specification also does not say the bits for a file must end up on the disk without an intervening fsync() operation unless you''ve explicitly asked for data synchronization (O_SYNC, O_DSYNC) when you opened the file. The fsync() discussion (line 31956) says that the bits must undergo a "physical write of data from the buffer cache" that should be completed when the fsync() call returns. If there are errors, the return from the fsync() call should express the fact that one or more errors occurred. The only guarantee that the physical write happens is if the system supports the _POSIX_SYNCHRONIZED_IO option. If not, the comment is to read the system''s conformance documentation (if any) to see what actually does happen. In the case that _POSIX_SYNCHRONIZED_IO is not supported, it''s perfectly allowable for fsync() to be a no-op. Jim Litchfield ------------------- David Magda wrote:> On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote: > >> POSIX does not care about "disks" or "filesystems". The only correct >> behavior is for operations to be applied in the order that they are >> requested of the operating system. This is a core function of any >> operating system. It is therefore ok for some (or all) of the data >> which was written to "new" to be lost, or for the rename operation to >> be lost, but it is not ok for the rename to end up with a corrupted >> file with the new name. > > Out of curiousity, is this what POSIX actually specifies? If that is > the case, wouldn''t that mean that the behaviour of ext3/4 is > incorrect? (Assuming that it does re-order operations.) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Miles Nordin
2009-Mar-19 05:23 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes:dm> is this what POSIX actually specifies? i doubt it. If it did, it would basically mandate a log-structured / COW filesystem, which, although not a _bad_ idea, is way too far from a settled debate to be enshrining in a mandatory ``standard'''' (ex., the database fragmentation problems with LFS, WaFL, ZFS. And the large number of important deployed non-COW filesystems on POSIX systems). There''s no other so-far-demonstrated way than log-structured/COW to achieve this property which some people think they''re entitled to take for granted: ``after a reboot, the system must appear as though it did not reorder any writes. The filesystem must recover to some exact state that it passed through in the minutes leading up to the crash, some state as observed from the POSIX userland (above all write caches).'''' It''s a nice property. Nine years ago when i was trying to get Linux users to try NetBSD, I flogged this as a great virtue of LFS. And if I were designing a non-POSIX operating system to replace Unix, I''d probably promise developers this property. But achieving it is just too constraining to belong in POSIX. If you can find some application that can safely disable some safety feature when it knows it''s running on ZFS that it needs to keep on other filesystems and thus perform absurdly faster on ZFS with no risk, then you can demonstrate the worth of promising this property. The fsync() that i''m sure KDE will add into all their broken apps is such an example, but I doubt it will be ``absurdly faster'''' enough to get ZFS any attention. Maybe something to do with virtual disk backing stores for VM''s? But I don''t think pushing exaggerated expectations as ``obvious'''' in front of people who don''t know the nasty details yet, nor overstating POSIX''s minimal crash requirements, is going to work. There are just too many smart people ready to defend the non-log-stuctured write-in-place filesystems. And I believe it *is* possible to write a correct database or MTA, even with the level of guarantee those systems provide (provide in practice, not provide as specified by POSIX). And the guarantees ARE minimal---just: http://www.google.com/search?q=POSIX+%22crash+consistency%22 and you''ll find even people against T''so''s who want to change ext4 still agree POSIX is on T''so''s side. My own opinion is that the apps are unportable and need to be fixed, and that what te side against T''so wants changed is so poorly stated it''s no more than ad-hoc ``make the apps not broken, because otherwise anything which does the exact same thing as the broken app we just found will also be broken!!!'''' it''s not a clearly articulatable guarantee like that AIUI provided by transaction groups. But linux app developers never seem to give much of a flying shit whether their apps work on notLinux, which is why they think it''s ``practical'''' to change ext4 rather than the nonconformant app, so dragging out the POSIX horse for flogging in support of ``change ext4'''' looks highly hypocritical, while flogging the same horse to support ``ZFS is the only POSIXly correct filesystem on the planet'''' is flatly incorrect but at least not hypocritical. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090319/fdae5c27/attachment.bin>
Bob Friesenhahn
2009-Mar-19 15:32 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
On Thu, 19 Mar 2009, Miles Nordin wrote:> > And the guarantees ARE minimal---just: > > http://www.google.com/search?q=POSIX+%22crash+consistency%22 > > and you''ll find even people against T''so''s who want to change ext4 > still agree POSIX is on T''so''s side.Clearly I am guilty of inflated expectations. Regardless, POSIX specifications define a "minimum set" of expectations and there is nothing to prevent vendors from offering more, or for enhanced specifications (e.g. Open Group) from raising the bar. Now that I am more aware of the situation, I can see that users of my software are likely to lose files if the system were to crash. There is a "fsync-safe" mode for my software which should avoid this but application performance would suffer quite a lot if it was used on a large scale. If ZFS does try to order its disk updates in cronological order without prioritizing metadata updates over data, then the risk is minimized. While a number of esteemed Sun kernel engineers have expressed their views here, we have yet to hear an opinion/statement from a Sun ZFS development engineer. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin
2009-Mar-19 18:13 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
>>>>> "bf" == Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:bf> If ZFS does try to order its disk updates in cronological bf> order without prioritizing metadata updates over data, then bf> the risk is minimized. AIUI it doesn''t exactly order them, just puts them into 5-second chunks. so it rolls the on-disk representation forward in lurching steps every 5 seconds, and the boundaries between each step are exact representations of how the filesystem once looked to the userland. I do not udnerstand yet if fsync() will lurch forward the _entire_ filesystem, or just the inode being fsync()d. Unless i''m mistaken ``the property,'''' as I described it, can only be achieved by lurching forward the entire filesystem whenever you fsync anything because otherwise you will recover to an overall state through which you never passed before the crash (with the fsync''d file being a little newer), but it might be faster/better to violate the property and only sync what was asked. If it''s the entire filesystem, then it might improve performance to separate unrelated heavy writers into different filesystems---for example it would be better to put a tape-emulation backup directory and a mail queue directory into separate filesystems even fi they have to go in the same pool. If it breaks the property and only sync''s the inode asked, then two directories on one filesystem vs two filesystems should not change performacne of that scenario which is an advantage. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090319/86c6343b/attachment.bin>
Peter Schuller
2009-Mar-19 23:00 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
> fsync() is, indeed, expensive. Lots of calls to fsync() that are not > necessary for correct application operation EXCEPT as a workaround for > lame filesystem re-ordering are a sure way to kill performance.IMO the fundamental problem is that the only way to achieve a write barrier is fsync() (disregarding direct I/O etc). Again I would just like an fbarrier() as I''ve mentioned on the list previously. It seems to me that if this were just adopted by some operating systems and applications could start using it, things would just sort itself out when file systems/block devices layers start actually implementing the optimization possible (instead of the native fbarrier() -> fsync()). As was noted previously in the previous thread on this topic, ZFS effectively has an implicit fbarrier() in between each write. Imagine now if all the applications out there were automatically massively faster on ZFS... but this won''t happen until operating systems start exposing the necessary interface. What does one need to do to get something happening here? Other than whine on mailing lists... -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090320/6724141b/attachment.bin>
Peter Schuller
2009-Mar-19 23:51 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
Uh, I should probably clarify some things (I was too quick to hit send):> IMO the fundamental problem is that the only way to achieve a write > barrier is fsync() (disregarding direct I/O etc). Again I would just > like an fbarrier() as I''ve mentioned on the list previously. It seemsOf course if fbarrier() is analogous to fsync() this does not actually address the particular problem which is the main topic of this thread, since there the fbarrier() would presumably apply only to I/O within that file. This particular case would only be helped if the fbarrier() were global, or at least extending further than the particular file. Fundamentally, I think a userful observation is that the only time you ever care about persistence is when you make a contract with an external party outside of your blackbox of I/O. Typical examples are database commits and mail server queues. Anything within the blackbox is only concerned with consistency. In this particular case, the fsync()/fbarrier() operateon the black box of the file, with the directory being an external party. The rename() operation on the directory entry constitutes an operation which depends on the state of the individual file blackbox, thus constituting an external dependency and thus requireing persistence. The question is whether it is necessarily a good idea to make the blackbox be the entire file system. If it is, a lot of things would be much much easier. On the other hand, it also makes optimization more difficult in many cases. For example the latency of persisting 8kb of data could be very very significant if there is large amounts of bulk I/O happening in the same file system. So I definitely see the motivation behind having persistence guarantees be non-global. Perhaps it boils down to the files+directory model not necessarily being the best one in all cases. Perhaps one would like to define subtrees which have global fsync()/fbarrier() type semantics within each respective subtree. On the other hand, that sounds a lot like a ZFS file system, other than the fact that ZFS file system creation is not something which is exposed to the application programmer. How about having file-system global barrier/persistence semantics, but having a well-defined API for creating child file systems rooted at any point in a hierarchy? It would allow "global" semantics and what that entails, while allowing that bulk I/O happening in your 1 TB PostgreSQL database to be segregated, in terms of performance impact, from your "kde settings" file system.> What does one need to do to get something happening here? Other than > whine on mailing lists...And that came off much more rude than intended. Clearly it''s not an implementation effort issues ince the naive fbarrioer() is basically calling fsync(). However I get the feeling there is little motivation in the operating system community for addressing these concerns, for whatever reason (IIRC it was only recently that some write barrier/write caching issues started being seriously discussed in the Linux kernel community for example). -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090320/fc96709d/attachment.bin>
Joerg Schilling
2009-Mar-20 09:17 UTC
[zfs-discuss] rename(2), atomicity, crashes and fsync()
Peter Schuller <peter.schuller at infidyne.com> wrote:> > fsync() is, indeed, expensive. Lots of calls to fsync() that are not > > necessary for correct application operation EXCEPT as a workaround for > > lame filesystem re-ordering are a sure way to kill performance. > > IMO the fundamental problem is that the only way to achieve a write > barrier is fsync() (disregarding direct I/O etc). Again I would just > like an fbarrier() as I''ve mentioned on the list previously. It seems > to me that if this were just adopted by some operating systems and > applications could start using it, things would just sort itself out > when file systems/block devices layers start actually implementing the > optimization possible (instead of the native fbarrier() -> fsync()).In addition, POSIX does not mention that close() needs to sync the file to disk. If an application like star likes to verify whether files could be written to disk in order to create a correct exit code, the only way is to call fsync() before close(). With UFS, this creates an performance impact of aprox. 10%, with ZFS this was more than 10% the last time I checked. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily