thr3ads.net - zfs discuss - [zfs-discuss] Transaction consistency of ZFS [Dec 2009]

If this information is useful, please help other people find it:
Share via:

nxyyt

2009-Dec-05 11:47 UTC

[zfs-discuss] Transaction consistency of ZFS

Hi, everybody,

I''m a newbie to ZFS. I have a special question against the COW
transaction of ZFS.
Does ZFS keeps the sequential consistency when it meets power outage or server
crash?

Assume following scenario:

My application has only a single thread and it appends the data to the file
continuously. Suppose at time t1, it append a buf named A to the file. At time
t2, which is later than t1, it appends a buf named B to the file. If the server
crashes after t2, is it possible the buf B is flushed back to the disk but buf A
is not?

Does ZFS keep the consistency that the data written to a file in sequential
order or casual order be flushed to disk in the same order? If the writer
operation to a single file always binding with the same transaction group, I
think the answer should be YES.

Hope anybody can help me clarify it. Thank you very much!
-- 
This message posted from opensolaris.org

Seth Heeren

2009-Dec-05 12:39 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

Yes. It is my understanding that (at least recent versions) will detect
incomplete transactions and simply rollback to the last consistent
uberblock in case of trouble.

I''m not completely up to speed with regard to the ODF, Uberblocks and
the ZIL; In my recollection the "inspection / selection" of uberblocks
had been in realm of manual recovery with zdb only, until lately. If
I''m
not mistaken a automatic
''regress-to-last-known-good-uberblock'' function
is new and recent.

I''m not quite sure whether that uberblock-based rollback _is being used
in the context of_ ZIL transaction recovery, or intended in case the ZIL
itself had failed (e.g.: ZIL on ramdisk or ZIL on failed vdev with
insufficient redundancy). I suspect it is separate and works even
without a ZIL. Note that of course this still means that working without
a ZIL or having a loss of the ZIL with a crash/unexpected shutdown of
ZFS will result in data-loss. It just won''t (easily) result in a
corrupted zpool because it will try and find a working uberblock at all
times, possibly an older one, lacking the latest changes...

So far my ramblings. I''m sure it contains a few handy pointers where to
look for more solid info...

Seth

nxyyt wrote:> Hi, everybody,
>
> I''m a newbie to ZFS. I have a special question against the COW
transaction of ZFS.
> Does ZFS keeps the sequential consistency when it meets power outage or
server crash?
>
> Assume following scenario:
>
> My application has only a single thread and it appends the data to the file
continuously. Suppose at time t1, it append a buf named A to the file. At time
t2, which is later than t1, it appends a buf named B to the file. If the server
crashes after t2, is it possible the buf B is flushed back to the disk but buf A
is not?
>
> Does ZFS keep the consistency that the data written to a file in sequential
order or casual order be flushed to disk in the same order? If the writer
operation to a single file always binding with the same transaction group, I
think the answer should be YES.
>
> Hope anybody can help me clarify it. Thank you very much!
>

nxyyt

2009-Dec-05 13:32 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

Thank you very much for your quick response.

My question is I  want to figure out whether there is data loss after power
outage. I have replicas on other machines so I can recovery from the data loss.
But I need a way to know whether there is data loss without comparing the
different data replicas.

I suppose if I append a footer to the end of file before I close it, I can
detect the data loss by validating the footer. Is it a work aroud for me ? Or is
there a better alternative? In my scenario, the file is append-only, no in-place
overwrite.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-05 15:46 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

On Sat, 5 Dec 2009, Seth Heeren wrote:
> Yes. It is my understanding that (at least recent versions) will detect
> incomplete transactions and simply rollback to the last consistent
> uberblock in case of trouble.
>
> I''m not completely up to speed with regard to the ODF, Uberblocks
and
> the ZIL; In my recollection the "inspection / selection" of
uberblocks
> had been in realm of manual recovery with zdb only, until lately. If
I''m
> not mistaken a automatic
''regress-to-last-known-good-uberblock'' function
> is new and recent.
Zfs has always rolled back to the last good state.  The manual 
rollback is to deal with the case where the underlying storage 
hardware misbehaved and did not persist the data as instructed but an 
older transaction group did get persisted ok.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Toby Thain

2009-Dec-05 21:44 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

On 5-Dec-09, at 8:32 AM, nxyyt wrote:
> Thank you very much for your quick response.
>
> My question is I  want to figure out whether there is data loss  
> after power outage. I have replicas on other machines so I can  
> recovery from the data loss. But I need a way to know whether there  
> is data loss without comparing the different data replicas.
>
> I suppose if I append a footer to the end of file before I close  
> it, I can detect the data loss by validating the footer. Is it a  
> work aroud for me ? Or is there a better alternative? In my  
> scenario, the file is append-only, no in-place overwrite.
You seem to be looking for fsync() and/or fdatasync(); or, take  
advantage of existing systems with durable commits (e.g. [R]DBMS).

--Toby
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Damon Atkins

2009-Dec-06 00:58 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

If power failure happens you will lose anything in cache. So you could lose the
entire file on power failure if the system is not busy (ie ZFS does delay
writes, unless you do a fsync before closing the file).  I would still like to
see a file system option "sync on close" or even "wait for txg on
close"

Some of the best methods are to create a temp file  e.g.
".download.filename" and rename when the download (or what ever) is
sucessfull to "filename" Or create a extra empty file to say it has
been completed e.g. filename.dn. I prefer the rename trick.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Dec-06 01:25 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

On Sat, 5 Dec 2009, Damon Atkins wrote:
> If power failure happens you will lose anything in cache. So you 
> could lose the entire file on power failure if the system is not 
> busy (ie ZFS does delay writes, unless you do a fsync before closing 
> the file).  I would still like to see a file system option "sync on 
> close" or even "wait for txg on close"
A memory-mapped file may still be updated even after its file 
descriptor has been closed.  It may be updated as long as any of its 
pages remain mapped.  File updates due to updated pages are usually 
lazy unless msync() is used to flush the pages to backing store.  How 
do you propose that this would be handled?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

nxyyt

2009-Dec-06 02:32 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

The "rename trick" may not work here. Even if I renamed the file
successfully, the data of the file may still reside in the memory instead of
flushing back to the disk.  If I made any mistake here, please correct me. Thank
you!

I''ll try to find out whether ZFS binding the same file always to the
same opening transaction group. If so, I guess my assumption here would be true.
Seems like there is only one opening transaction group at anytime. Can anybody
give me a definitive answer here?

For ZIL, it must be flushed back to disk in the order of fsync(). So that the
last append of the file would happen as the last transaction log in ZIL for this
file, I think. The assumption should still be true.

fsync or fdatasync may be too heavyweight for my case because it''s a
write intensive workload. I hope replicating the data to different machines to
protect the data from power outage would be better.
-- 
This message posted from opensolaris.org

nxyyt

2009-Dec-06 02:42 UTC

head link

[zfs-code] Transaction consistency of ZFS

This question is forwarded from ZFS-discussion. Hope any developer can throw
some light on it.

I''m a newbie to ZFS. I have a special question against the COW
transaction of ZFS.

Does ZFS keeps the sequential consistency of the same file  when it meets power
outage or server crash?

Assume following scenario:

My application has only a single thread and it appends the data to the file
continuously. Suppose at time t1, it append a buf named A to the file. At time
t2, which is later than t1, it appends a buf named B to the file. If the server
crashes after t2, is it possible the buf B is flushed back to the disk but buf A
is not?

My application appends the file only without truncation or overwrite.Does ZFS
keep the consistency that the data written to a file in sequential order or
casual order be flushed to disk in the same order?

 If the uncommitted writer operation to a single file always binding with the
same opening transaction group and all transaction group is committed in
sequential order, I think the answer should be YES. In other words, [b]whether
there is only one opening transaction group at any time and  the transaction
group is committed in order for a single pool?[/b]


Hope anybody can help me clarify it. Thank you very much!
-- 
This message posted from opensolaris.org

Toby Thain

2009-Dec-06 16:16 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

On 5-Dec-09, at 9:32 PM, nxyyt wrote:
> The "rename trick" may not work here. Even if I renamed the file
> successfully, the data of the file may still reside in the memory  
> instead of flushing back to the disk.  If I made any mistake here,  
> please correct me. Thank you!
>
> I''ll try to find out whether ZFS binding the same file always to  
> the same opening transaction group. If so, I guess my assumption  
> here would be true. Seems like there is only one opening  
> transaction group at anytime. Can anybody give me a definitive  
> answer here?
>
> For ZIL, it must be flushed back to disk in the order of fsync().  
> So that the last append of the file would happen as the last  
> transaction log in ZIL for this file, I think. The assumption  
> should still be true.
>
> fsync or fdatasync may be too heavyweight for my case because it''s
> a write intensive workload.
That''s the point, isn''t it? :)
> I hope replicating the data to different machines to protect the  
> data from power outage would be better.
This is the Durability referred to in "ACID". This is a very well  
studied problem, I suggest you look at the literature and  
architecture surrounding transactional databases, if you find that  
tackling this through a POSIX filesystem is problematic.

--Toby
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anurag Agarwal

2009-Dec-06 17:11 UTC

head link

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

Hi,

My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the
writes in zfs are logged in the ZIL. And if that indeed is the case, then
yes, ZFS does guarantee the sequential consistency, even when there are
power outage or server crash. You might loose some writes if ZIL has not
committed to disk. But that would not change the sequential consistency
guarantee.

There is no need to do a fsync or open the file with O_SYNC. It should work
as it is.

I have not done any experiments to verify this, so please take my
observation with pinch of salt.
Any ZFS developers to verify or refute this.

Regards,
Anurag.

On Sun, Dec 6, 2009 at 8:12 AM, nxyyt <schumi.han at gmail.com> wrote:
> This question is forwarded from ZFS-discussion. Hope any developer can
> throw some light on it.
>
> I''m a newbie to ZFS. I have a special question against the COW
transaction
> of ZFS.
>
> Does ZFS keeps the sequential consistency of the same file  when it meets
> power outage or server crash?
>
> Assume following scenario:
>
> My application has only a single thread and it appends the data to the file
> continuously. Suppose at time t1, it append a buf named A to the file. At
> time t2, which is later than t1, it appends a buf named B to the file. If
> the server crashes after t2, is it possible the buf B is flushed back to
the
> disk but buf A is not?
>
> My application appends the file only without truncation or overwrite.Does
> ZFS keep the consistency that the data written to a file in sequential
order
> or casual order be flushed to disk in the same order?
>
>  If the uncommitted writer operation to a single file always binding with
> the same opening transaction group and all transaction group is committed
in
> sequential order, I think the answer should be YES. In other words,
> [b]whether there is only one opening transaction group at any time and  the
> transaction group is committed in order for a single pool?[/b]
>
>
> Hope anybody can help me clarify it. Thank you very much!
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>


-- 
Anurag Agarwal
CEO, Founder
KQ Infotech, Pune
www.kqinfotech.com
9881254401
Coordinator Akshar Bharati
www.aksharbharati.org
Spreading joy through reading
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091206/45b9c5bd/attachment.html>

Neil Perrin

2009-Dec-06 18:40 UTC

head link

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

On 12/06/09 10:11, Anurag Agarwal wrote:> Hi,
> 
> My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all 
> the writes in zfs are logged in the ZIL.
Each write gets recorded in memory in case it needs to be forced out
later (eg fsync()), but is not written to the on-disk log until then
or until the transaction group commits which contains the write
in which case the in-memory transaction is discarded.

> And if that indeed is the case, 
> then yes, ZFS does guarantee the sequential consistency, even when there 
> are power outage or server crash. You might loose some writes if ZIL has 
> not committed to disk. But that would not change the sequential 
> consistency guarantee.
> 
> There is no need to do a fsync or open the file with O_SYNC. It should 
> work as it is.
> 
> I have not done any experiments to verify this, so please take my 
> observation with pinch of salt.
> Any ZFS developers to verify or refute this.
> 
> Regards,
> Anurag.

Neil Perrin

2009-Dec-06 19:00 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

> I''ll try to find out whether ZFS binding the same file always to
the same
> opening transaction group.
Not sure what you mean by this. Transactions (eg writes) will go into
the current open transaction group (txg). Subsequent writes may enter
the same or a future txg. Txgs are obviously committed in order.
So writes are not committed out of order. The txg commit is all or nothing,
so on a crash you get to see all the transactions in that txg or none.
I think this answers your original question/concern.
> If so, I guess my assumption here would be true.
> Seems like there is only one opening transaction group at anytime.
> Can anybody give me a definitive answer here?
ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
Transactions enter in Open. Quiescing is where a new Open stage has
started and waits for transactions that have yet to commit to finish.
Syncing is where all the completed transactions are pushed to the pool
in an atomic manner with the last write being the root of the new tree
of blocks (uberblock).

All the guarantees assume good hardware. As part of the new uberblock update
we flush the write caches of the pool devices. If this is broken all bets
are off.

Neil.

Andrey Kuzmin

2009-Dec-06 19:36 UTC

head link

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

On Sun, Dec 6, 2009 at 8:11 PM, Anurag Agarwal <anurag at kqinfotech.com>
wrote:> Hi,
>
> My reading of write code of ZFS (zfs_write in zfs_vnops.c), is that all the
> writes in zfs are logged in the ZIL. And if that indeed is the case, then
IIRC, there is some upper limit (1MB?) on writes that go to ZIL, with
larger ones executed directly. Yet again, this is an outsider''s
impression, not the architect''s () statement.

Regards,
Andrey
> yes, ZFS does guarantee the sequential consistency, even when there are
> power outage or server crash. You might loose some writes if ZIL has not
> committed to disk. But that would not change the sequential consistency
> guarantee.
>
> There is no need to do a fsync or open the file with O_SYNC. It should work
> as it is.
>
> I have not done any experiments to verify this, so please take my
> observation with pinch of salt.
> Any ZFS developers to verify or refute this.
>
> Regards,
> Anurag.
>
> On Sun, Dec 6, 2009 at 8:12 AM, nxyyt <schumi.han at gmail.com>
wrote:
>>
>> This question is forwarded from ZFS-discussion. Hope any developer can
>> throw some light on it.
>>
>> I''m a newbie to ZFS. I have a special question against the COW
transaction
>> of ZFS.
>>
>> Does ZFS keeps the sequential consistency of the same file ?when it
meets
>> power outage or server crash?
>>
>> Assume following scenario:
>>
>> My application has only a single thread and it appends the data to the
>> file continuously. Suppose at time t1, it append a buf named A to the
file.
>> At time t2, which is later than t1, it appends a buf named B to the
file. If
>> the server crashes after t2, is it possible the buf B is flushed back
to the
>> disk but buf A is not?
>>
>> My application appends the file only without truncation or
overwrite.Does
>> ZFS keep the consistency that the data written to a file in sequential
order
>> or casual order be flushed to disk in the same order?
>>
>> ?If the uncommitted writer operation to a single file always binding
with
>> the same opening transaction group and all transaction group is
committed in
>> sequential order, I think the answer should be YES. In other words,
>> [b]whether there is only one opening transaction group at any time and
?the
>> transaction group is committed in order for a single pool?[/b]
>>
>>
>> Hope anybody can help me clarify it. Thank you very much!
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-code mailing list
>> zfs-code at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>
>
>
> --
> Anurag Agarwal
> CEO, Founder
> KQ Infotech, Pune
> www.kqinfotech.com
> 9881254401
> Coordinator Akshar Bharati
> www.aksharbharati.org
> Spreading joy through reading
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>
>

Zhu Han

2009-Dec-07 12:42 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

Neil,

Thank you. You closed my question. :-)

best regards,
hanzhu


On Mon, Dec 7, 2009 at 3:00 AM, Neil Perrin <Neil.Perrin at sun.com>
wrote:
>
>  I''ll try to find out whether ZFS binding the same file always to
the same
>> opening transaction group.
>>
>
> Not sure what you mean by this. Transactions (eg writes) will go into
> the current open transaction group (txg). Subsequent writes may enter
> the same or a future txg. Txgs are obviously committed in order.
> So writes are not committed out of order. The txg commit is all or nothing,
> so on a crash you get to see all the transactions in that txg or none.
> I think this answers your original question/concern.
>
>
>  If so, I guess my assumption here would be true.
>> Seems like there is only one opening transaction group at anytime.
>> Can anybody give me a definitive answer here?
>>
>
> ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
> Transactions enter in Open. Quiescing is where a new Open stage has
> started and waits for transactions that have yet to commit to finish.
> Syncing is where all the completed transactions are pushed to the pool
> in an atomic manner with the last write being the root of the new tree
> of blocks (uberblock).
>
> All the guarantees assume good hardware. As part of the new uberblock
> update
> we flush the write caches of the pool devices. If this is broken all bets
> are off.
>
> Neil.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/3b9f6176/attachment.html>

Zhu Han

2009-Dec-07 12:43 UTC

head link

[zfs-discuss] Fwd: [zfs-code] Transaction consistency of ZFS

Answer from another guru...

nxyyt wrote:
> This question is forwarded from ZFS-discussion. Hope any developer can
> throw some light on it.
>
> I''m a newbie to ZFS. I have a special question against the COW
transaction
> of ZFS.
>
> Does ZFS keeps the sequential consistency of the same file  when it meets
> power outage or server crash?
>
> Assume following scenario:
>
> My application has only a single thread and it appends the data to the file
> continuously. Suppose at time t1, it append a buf named A to the file. At
> time t2, which is later than t1, it appends a buf named B to the file. If
> the server crashes after t2, is it possible the buf B is flushed back to
the
> disk but buf A is not?
>
> My application appends the file only without truncation or overwrite.Does
> ZFS keep the consistency that the data written to a file in sequential
order
> or casual order be flushed to disk in the same order?
>
>  If the uncommitted writer operation to a single file always binding with
> the same opening transaction group and all transaction group is committed
in
> sequential order, I think the answer should be YES. In other words,
> [b]whether there is only one opening transaction group at any time and  the
> transaction group is committed in order for a single pool?[/b]
>
>
> Hope anybody can help me clarify it. Thank you very much!
>
>
Assuming you are using synchronous write semantics,  the system call to do a
write will NEVER return UNTIL the data has been written to stable media
(which, the case of ZFS, might be an SSD-based ZIL, and not the actual
backing hard disks).  That is the whole point of synchronous write.

If, however, you are doing async writes, or are never closing the filehandle
(essentially doing a streaming write, which, it sounds like you are doing),
you have no guaranty that it will make it to stable storage at any given
instant (fsync() or fflush() is required to guaranty a commit).  For your
type of write, however, where you are constantly appending to the same file
handle, you can count on previous writes committing before subsequent ones -
that is, IF B has made it to stable storage, THEN A will also be there.
However, there is no guaranty that A makes it, it''s just that B never
makes
it without A having done so already.

I''m not 100% sure, but if you have uncommitted writes A (at t1), B (at
t2)
both against the same file, and C (at t2) against a different file, there is
no guaranty that A commits before C.  Just that A will commit
before/simultaneously as B.

Don''t count on there being a single transaction group for a single file
- if
there are say 5 data writes pending on your file, you may see 1-3 committed
at once, while 4-5 wait (they might be committed together, or separately).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091207/c3fde241/attachment.html>

Damon Atkins

2009-Dec-08 01:34 UTC

head link

[zfs-discuss] Transaction consistency of ZFS

Because ZFS is transaction, (effectively preserves order), the rename trick will
work.
If you find the ".filename" delete create a new ".filename"
and when finish writing rename it to "filename". If
"filename" exists you no all writes were completed. If you have a
batch system which looks for the file it will not find it until it is renamed.
Not that I am a  of batch systems which use CPU poll for files existance.
-- 
This message posted from opensolaris.org

zfs discuss - Dec 2009 - Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-code] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] [zfs-code] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS

[zfs-discuss] Fwd: [zfs-code] Transaction consistency of ZFS

[zfs-discuss] Transaction consistency of ZFS