Hi, everybody, I''m a newbie to ZFS. I have a special question against the COW transaction of ZFS. Does ZFS keeps the sequential consistency when it meets power outage or server crash? Assume following scenario: My application has only a single thread and it appends the data to the file continuously. Suppose at time t1, it append a buf named A to the file. At time t2, which is later than t1, it appends a buf named B to the file. If the server crashes after t2, is it possible the buf B is flushed back to the disk but buf A is not? Does ZFS keep the consistency that the data written to a file in sequential order or casual order be flushed to disk in the same order? If the writer operation to a single file always binding with the same transaction group, I think the answer should be YES. Hope anybody can help me clarify it. Thank you very much! -- This message posted from opensolaris.org
Yes. It is my understanding that (at least recent versions) will detect incomplete transactions and simply rollback to the last consistent uberblock in case of trouble. I''m not completely up to speed with regard to the ODF, Uberblocks and the ZIL; In my recollection the "inspection / selection" of uberblocks had been in realm of manual recovery with zdb only, until lately. If I''m not mistaken a automatic ''regress-to-last-known-good-uberblock'' function is new and recent. I''m not quite sure whether that uberblock-based rollback _is being used in the context of_ ZIL transaction recovery, or intended in case the ZIL itself had failed (e.g.: ZIL on ramdisk or ZIL on failed vdev with insufficient redundancy). I suspect it is separate and works even without a ZIL. Note that of course this still means that working without a ZIL or having a loss of the ZIL with a crash/unexpected shutdown of ZFS will result in data-loss. It just won''t (easily) result in a corrupted zpool because it will try and find a working uberblock at all times, possibly an older one, lacking the latest changes... So far my ramblings. I''m sure it contains a few handy pointers where to look for more solid info... Seth nxyyt wrote:> Hi, everybody, > > I''m a newbie to ZFS. I have a special question against the COW transaction of ZFS. > Does ZFS keeps the sequential consistency when it meets power outage or server crash? > > Assume following scenario: > > My application has only a single thread and it appends the data to the file continuously. Suppose at time t1, it append a buf named A to the file. At time t2, which is later than t1, it appends a buf named B to the file. If the server crashes after t2, is it possible the buf B is flushed back to the disk but buf A is not? > > Does ZFS keep the consistency that the data written to a file in sequential order or casual order be flushed to disk in the same order? If the writer operation to a single file always binding with the same transaction group, I think the answer should be YES. > > Hope anybody can help me clarify it. Thank you very much! >
Thank you very much for your quick response. My question is I want to figure out whether there is data loss after power outage. I have replicas on other machines so I can recovery from the data loss. But I need a way to know whether there is data loss without comparing the different data replicas. I suppose if I append a footer to the end of file before I close it, I can detect the data loss by validating the footer. Is it a work aroud for me ? Or is there a better alternative? In my scenario, the file is append-only, no in-place overwrite. -- This message posted from opensolaris.org
On Sat, 5 Dec 2009, Seth Heeren wrote:> Yes. It is my understanding that (at least recent versions) will detect > incomplete transactions and simply rollback to the last consistent > uberblock in case of trouble. > > I''m not completely up to speed with regard to the ODF, Uberblocks and > the ZIL; In my recollection the "inspection / selection" of uberblocks > had been in realm of manual recovery with zdb only, until lately. If I''m > not mistaken a automatic ''regress-to-last-known-good-uberblock'' function > is new and recent.Zfs has always rolled back to the last good state. The manual rollback is to deal with the case where the underlying storage hardware misbehaved and did not persist the data as instructed but an older transaction group did get persisted ok. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 5-Dec-09, at 8:32 AM, nxyyt wrote:> Thank you very much for your quick response. > > My question is I want to figure out whether there is data loss > after power outage. I have replicas on other machines so I can > recovery from the data loss. But I need a way to know whether there > is data loss without comparing the different data replicas. > > I suppose if I append a footer to the end of file before I close > it, I can detect the data loss by validating the footer. Is it a > work aroud for me ? Or is there a better alternative? In my > scenario, the file is append-only, no in-place overwrite.You seem to be looking for fsync() and/or fdatasync(); or, take advantage of existing systems with durable commits (e.g. [R]DBMS). --Toby> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
If power failure happens you will lose anything in cache. So you could lose the entire file on power failure if the system is not busy (ie ZFS does delay writes, unless you do a fsync before closing the file). I would still like to see a file system option "sync on close" or even "wait for txg on close" Some of the best methods are to create a temp file e.g. ".download.filename" and rename when the download (or what ever) is sucessfull to "filename" Or create a extra empty file to say it has been completed e.g. filename.dn. I prefer the rename trick. -- This message posted from opensolaris.org
On Sat, 5 Dec 2009, Damon Atkins wrote:> If power failure happens you will lose anything in cache. So you > could lose the entire file on power failure if the system is not > busy (ie ZFS does delay writes, unless you do a fsync before closing > the file). I would still like to see a file system option "sync on > close" or even "wait for txg on close"A memory-mapped file may still be updated even after its file descriptor has been closed. It may be updated as long as any of its pages remain mapped. File updates due to updated pages are usually lazy unless msync() is used to flush the pages to backing store. How do you propose that this would be handled? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
The "rename trick" may not work here. Even if I renamed the file successfully, the data of the file may still reside in the memory instead of flushing back to the disk. If I made any mistake here, please correct me. Thank you! I''ll try to find out whether ZFS binding the same file always to the same opening transaction group. If so, I guess my assumption here would be true. Seems like there is only one opening transaction group at anytime. Can anybody give me a definitive answer here? For ZIL, it must be flushed back to disk in the order of fsync(). So that the last append of the file would happen as the last transaction log in ZIL for this file, I think. The assumption should still be true. fsync or fdatasync may be too heavyweight for my case because it''s a write intensive workload. I hope replicating the data to different machines to protect the data from power outage would be better. -- This message posted from opensolaris.org
This question is forwarded from ZFS-discussion. Hope any developer can throw some light on it. I''m a newbie to ZFS. I have a special question against the COW transaction of ZFS. Does ZFS keeps the sequential consistency of the same file when it meets power outage or server crash? Assume following scenario: My application has only a single thread and it appends the data to the file continuously. Suppose at time t1, it append a buf named A to the file. At time t2, which is later than t1, it appends a buf named B to the file. If the server crashes after t2, is it possible the buf B is flushed back to the disk but buf A is not? My application appends the file only without truncation or overwrite.Does ZFS keep the consistency that the data written to a file in sequential order or casual order be flushed to disk in the same order? If the uncommitted writer operation to a single file always binding with the same opening transaction group and all transaction group is committed in sequential order, I think the answer should be YES. In other words, [b]whether there is only one opening transaction group at any time and the transaction group is committed in order for a single pool?[/b] Hope anybody can help me clarify it. Thank you very much! -- This message posted from opensolaris.org
On 5-Dec-09, at 9:32 PM, nxyyt wrote:> The "rename trick" may not work here. Even if I renamed the file > successfully, the data of the file may still reside in the memory > instead of flushing back to the disk. If I made any mistake here, > please correct me. Thank you! > > I''ll try to find out whether ZFS binding the same file always to > the same opening transaction group. If so, I guess my assumption > here would be true. Seems like there is only one opening > transaction group at anytime. Can anybody give me a definitive > answer here? > > For ZIL, it must be flushed back to disk in the order of fsync(). > So that the last append of the file would happen as the last > transaction log in ZIL for this file, I think. The assumption > should still be true. > > fsync or fdatasync may be too heavyweight for my case because it''s > a write intensive workload.That''s the point, isn''t it? :)> I hope replicating the data to different machines to protect the > data from power outage would be better.This is the Durability referred to in "ACID". This is a very well studied problem, I suggest you look at the literature and architecture surrounding transactional databases, if you find that tackling this through a POSIX filesystem is problematic. --Toby> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Anurag Agarwal
2009-Dec-06 17:11 UTC
[zfs-discuss] [zfs-code] Transaction consistency of ZFS
Hi, My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the writes in zfs are logged in the ZIL. And if that indeed is the case, then yes, ZFS does guarantee the sequential consistency, even when there are power outage or server crash. You might loose some writes if ZIL has not committed to disk. But that would not change the sequential consistency guarantee. There is no need to do a fsync or open the file with O_SYNC. It should work as it is. I have not done any experiments to verify this, so please take my observation with pinch of salt. Any ZFS developers to verify or refute this. Regards, Anurag. On Sun, Dec 6, 2009 at 8:12 AM, nxyyt <schumi.han at gmail.com> wrote:> This question is forwarded from ZFS-discussion. Hope any developer can > throw some light on it. > > I''m a newbie to ZFS. I have a special question against the COW transaction > of ZFS. > > Does ZFS keeps the sequential consistency of the same file when it meets > power outage or server crash? > > Assume following scenario: > > My application has only a single thread and it appends the data to the file > continuously. Suppose at time t1, it append a buf named A to the file. At > time t2, which is later than t1, it appends a buf named B to the file. If > the server crashes after t2, is it possible the buf B is flushed back to the > disk but buf A is not? > > My application appends the file only without truncation or overwrite.Does > ZFS keep the consistency that the data written to a file in sequential order > or casual order be flushed to disk in the same order? > > If the uncommitted writer operation to a single file always binding with > the same opening transaction group and all transaction group is committed in > sequential order, I think the answer should be YES. In other words, > [b]whether there is only one opening transaction group at any time and the > transaction group is committed in order for a single pool?[/b] > > > Hope anybody can help me clarify it. Thank you very much! > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >-- Anurag Agarwal CEO, Founder KQ Infotech, Pune www.kqinfotech.com 9881254401 Coordinator Akshar Bharati www.aksharbharati.org Spreading joy through reading -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091206/45b9c5bd/attachment.html>
On 12/06/09 10:11, Anurag Agarwal wrote:> Hi, > > My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all > the writes in zfs are logged in the ZIL.Each write gets recorded in memory in case it needs to be forced out later (eg fsync()), but is not written to the on-disk log until then or until the transaction group commits which contains the write in which case the in-memory transaction is discarded.> And if that indeed is the case, > then yes, ZFS does guarantee the sequential consistency, even when there > are power outage or server crash. You might loose some writes if ZIL has > not committed to disk. But that would not change the sequential > consistency guarantee. > > There is no need to do a fsync or open the file with O_SYNC. It should > work as it is. > > I have not done any experiments to verify this, so please take my > observation with pinch of salt. > Any ZFS developers to verify or refute this. > > Regards, > Anurag.
> I''ll try to find out whether ZFS binding the same file always to the same > opening transaction group.Not sure what you mean by this. Transactions (eg writes) will go into the current open transaction group (txg). Subsequent writes may enter the same or a future txg. Txgs are obviously committed in order. So writes are not committed out of order. The txg commit is all or nothing, so on a crash you get to see all the transactions in that txg or none. I think this answers your original question/concern.> If so, I guess my assumption here would be true. > Seems like there is only one opening transaction group at anytime. > Can anybody give me a definitive answer here?ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing. Transactions enter in Open. Quiescing is where a new Open stage has started and waits for transactions that have yet to commit to finish. Syncing is where all the completed transactions are pushed to the pool in an atomic manner with the last write being the root of the new tree of blocks (uberblock). All the guarantees assume good hardware. As part of the new uberblock update we flush the write caches of the pool devices. If this is broken all bets are off. Neil.
Andrey Kuzmin
2009-Dec-06 19:36 UTC
[zfs-discuss] [zfs-code] Transaction consistency of ZFS
On Sun, Dec 6, 2009 at 8:11 PM, Anurag Agarwal <anurag at kqinfotech.com> wrote:> Hi, > > My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the > writes in zfs are logged in the ZIL. And if that indeed is the case, thenIIRC, there is some upper limit (1MB?) on writes that go to ZIL, with larger ones executed directly. Yet again, this is an outsider''s impression, not the architect''s () statement. Regards, Andrey> yes, ZFS does guarantee the sequential consistency, even when there are > power outage or server crash. You might loose some writes if ZIL has not > committed to disk. But that would not change the sequential consistency > guarantee. > > There is no need to do a fsync or open the file with O_SYNC. It should work > as it is. > > I have not done any experiments to verify this, so please take my > observation with pinch of salt. > Any ZFS developers to verify or refute this. > > Regards, > Anurag. > > On Sun, Dec 6, 2009 at 8:12 AM, nxyyt <schumi.han at gmail.com> wrote: >> >> This question is forwarded from ZFS-discussion. Hope any developer can >> throw some light on it. >> >> I''m a newbie to ZFS. I have a special question against the COW transaction >> of ZFS. >> >> Does ZFS keeps the sequential consistency of the same file ?when it meets >> power outage or server crash? >> >> Assume following scenario: >> >> My application has only a single thread and it appends the data to the >> file continuously. Suppose at time t1, it append a buf named A to the file. >> At time t2, which is later than t1, it appends a buf named B to the file. If >> the server crashes after t2, is it possible the buf B is flushed back to the >> disk but buf A is not? >> >> My application appends the file only without truncation or overwrite.Does >> ZFS keep the consistency that the data written to a file in sequential order >> or casual order be flushed to disk in the same order? >> >> ?If the uncommitted writer operation to a single file always binding with >> the same opening transaction group and all transaction group is committed in >> sequential order, I think the answer should be YES. In other words, >> [b]whether there is only one opening transaction group at any time and ?the >> transaction group is committed in order for a single pool?[/b] >> >> >> Hope anybody can help me clarify it. Thank you very much! >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-code mailing list >> zfs-code at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-code > > > > -- > Anurag Agarwal > CEO, Founder > KQ Infotech, Pune > www.kqinfotech.com > 9881254401 > Coordinator Akshar Bharati > www.aksharbharati.org > Spreading joy through reading > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code > >
Neil, Thank you. You closed my question. :-) best regards, hanzhu On Mon, Dec 7, 2009 at 3:00 AM, Neil Perrin <Neil.Perrin at sun.com> wrote:> > I''ll try to find out whether ZFS binding the same file always to the same >> opening transaction group. >> > > Not sure what you mean by this. Transactions (eg writes) will go into > the current open transaction group (txg). Subsequent writes may enter > the same or a future txg. Txgs are obviously committed in order. > So writes are not committed out of order. The txg commit is all or nothing, > so on a crash you get to see all the transactions in that txg or none. > I think this answers your original question/concern. > > > If so, I guess my assumption here would be true. >> Seems like there is only one opening transaction group at anytime. >> Can anybody give me a definitive answer here? >> > > ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing. > Transactions enter in Open. Quiescing is where a new Open stage has > started and waits for transactions that have yet to commit to finish. > Syncing is where all the completed transactions are pushed to the pool > in an atomic manner with the last write being the root of the new tree > of blocks (uberblock). > > All the guarantees assume good hardware. As part of the new uberblock > update > we flush the write caches of the pool devices. If this is broken all bets > are off. > > Neil. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/3b9f6176/attachment.html>
Zhu Han
2009-Dec-07 12:43 UTC
[zfs-discuss] Fwd: [zfs-code] Transaction consistency of ZFS
Answer from another guru... nxyyt wrote:> This question is forwarded from ZFS-discussion. Hope any developer can > throw some light on it. > > I''m a newbie to ZFS. I have a special question against the COW transaction > of ZFS. > > Does ZFS keeps the sequential consistency of the same file when it meets > power outage or server crash? > > Assume following scenario: > > My application has only a single thread and it appends the data to the file > continuously. Suppose at time t1, it append a buf named A to the file. At > time t2, which is later than t1, it appends a buf named B to the file. If > the server crashes after t2, is it possible the buf B is flushed back to the > disk but buf A is not? > > My application appends the file only without truncation or overwrite.Does > ZFS keep the consistency that the data written to a file in sequential order > or casual order be flushed to disk in the same order? > > If the uncommitted writer operation to a single file always binding with > the same opening transaction group and all transaction group is committed in > sequential order, I think the answer should be YES. In other words, > [b]whether there is only one opening transaction group at any time and the > transaction group is committed in order for a single pool?[/b] > > > Hope anybody can help me clarify it. Thank you very much! > >Assuming you are using synchronous write semantics, the system call to do a write will NEVER return UNTIL the data has been written to stable media (which, the case of ZFS, might be an SSD-based ZIL, and not the actual backing hard disks). That is the whole point of synchronous write. If, however, you are doing async writes, or are never closing the filehandle (essentially doing a streaming write, which, it sounds like you are doing), you have no guaranty that it will make it to stable storage at any given instant (fsync() or fflush() is required to guaranty a commit). For your type of write, however, where you are constantly appending to the same file handle, you can count on previous writes committing before subsequent ones - that is, IF B has made it to stable storage, THEN A will also be there. However, there is no guaranty that A makes it, it''s just that B never makes it without A having done so already. I''m not 100% sure, but if you have uncommitted writes A (at t1), B (at t2) both against the same file, and C (at t2) against a different file, there is no guaranty that A commits before C. Just that A will commit before/simultaneously as B. Don''t count on there being a single transaction group for a single file - if there are say 5 data writes pending on your file, you may see 1-3 committed at once, while 4-5 wait (they might be committed together, or separately). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/c3fde241/attachment.html>
Because ZFS is transaction, (effectively preserves order), the rename trick will work. If you find the ".filename" delete create a new ".filename" and when finish writing rename it to "filename". If "filename" exists you no all writes were completed. If you have a batch system which looks for the file it will not find it until it is renamed. Not that I am a of batch systems which use CPU poll for files existance. -- This message posted from opensolaris.org