thr3ads.net - zfs discuss - [zfs-discuss] Does your device honor write barriers? [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Bryant Eadon

2009-Feb-10 18:35 UTC

[zfs-discuss] Does your device honor write barriers?

All,

	I''ve been following the thread titled ''ZFS: unreliable for
professional use''
and I''ve learned a few things.  Put simply, external devices
don''t behave like
internal ones.

 >From JB :
 >The good news is that ZFS is getting popular enough on consumer-grade
 >hardware.  The bad news is that said hardware has a different set of
 >failure modes, so it takes a bit of work to become resilient to them.
 >This is pretty high on my short list.

 >from PS :
 >I had a cheap-o USB enclosure that definitely did ignore such
 >commands. On every txg commit I''d get a warning in dmesg (this was
on
 >FreeBSD) about the device not implementing the relevant SCSI command.


I use 3 external devices of on 2 models of external enclosures (eSATA and USB 
consumer grade)-- how can I test this write barrier issue on these 2 ??   Is it 
worthwhile adding to a wiki (table) somewhere what has or has not been tested ?

Given that ZFS is planned to be used in Snow Leopard, is it worth setting 
something up for consumer grade appliance vendors to ''certify''
against?  ("Ok,
you play nice with ZFS by doing the right things", etc.. )  Maybe you can
give
them a ''Gold Star'' == ''Supports ZFS'' . 
That''ll give them a selling point to
consumers and Sun some free marketing ?

Thoughts ?


Thanks,
Bryant

Peter Schuller

2009-Feb-10 18:52 UTC

head link

[zfs-discuss] Does your device honor write barriers?

> I use 3 external devices of on 2 models of external enclosures (eSATA and
USB
> consumer grade)-- how can I test this write barrier issue on these 2 ??  
Is it
> worthwhile adding to a wiki (table) somewhere what has or has not been
tested ?
It depends on circumstances. If write barriers are enforced by
instructing the device to flush caches, and assuming there is no
battery-backed cache, a good way is to make sure that the latency of
an fsync() is in fact what it is expected to be.

A test I did was to write a minimalistic program that simply appended
one block (8k in this case), fsync():ing in between, timing each
fsync().

In my case I was able to detect three distinct modes:

* Write-back caching on the RAID controller (lowest latency).

* Write-through on the RAID controller but write-back on the drives (medium
latency).

* Write-through on the RAID controller and the drive (highest latency, as
expected by rotational delay and seek delay of drives).

This was useful to test that things "seemed" to behave properly. Of
course you only test that it is not systematically mis-behaving, not
that it will actually behave correctly under all circumstances.

However this test boils down to testing durable persistence. If you
want to specifically test write barriers regardless of durable
persistence, you can write a tool that performs I/O:s in a way where
you can determine, after the fact, whether they happened in order. For
example you could write an ever increasing sequence of values to
deterministic but pseudo-random pages in some larger file, such that
you can, after a powerfail test, read them back in and test the
sequence of numbers (after sorting it) for the existence of holes.
> Given that ZFS is planned to be used in Snow Leopard, is it worth setting 
> something up for consumer grade appliance vendors to
''certify'' against?  ("Ok,
> you play nice with ZFS by doing the right things", etc.. )  Maybe you
can give
> them a ''Gold Star'' == ''Supports ZFS'' . 
That''ll give them a selling point to
> consumers and Sun some free marketing ?
It would actually be nice in general I think, not just for ZFS, to
have some standard "run this tool" that will give you a check list of
successes/failures that specifically target storage
correctness. Though correctness cannot be proven, you can at least
test for common cases of systematic incorrect behavior.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/68ef2d1a/attachment-0013.bin>

Miles Nordin

2009-Feb-10 19:23 UTC

head link

[zfs-discuss] Does your device honor write barriers?

>>>>> "ps" == Peter Schuller <peter.schuller at
infidyne.com> writes:
    ps> A test I did was to write a minimalistic program that simply
    ps> appended one block (8k in this case), fsync():ing in between,
    ps> timing each fsync().

were you the one that suggested writing backwards to make the
difference bigger?  I guess you found that trick unnecessary---speeds
differed enough when writing forwards?

    ps> * Write-back caching on the RAID controller (lowest latency).

Did you find a good way to disable this case so you could distinguish
between the second two?

like, I thought there was some type of SYNCHRONIZE CACHE with a
certain flag-bit set, which demands a flush to disk not to NVRAM, and
that years ago ZFS was mistakenly sending this overly aggressive
command instead of the normal ``just make it persistent''''
sync, so
there was that stale best-practice advice to lobotomize the array by
ordering it to treat the two commands equivalent.

Maybe it would be possible to send that old SYNC command on purpose.
Then you could start the tool by comparing speeds with to-disk-SYNC
and normal-nvramallowed-SYNC: if they''re the same speed and oddly
fast, then you know the array controller is lobotomized, and the
second half of the test is thus invalid.  If they''re different speeds,
then you can trust the second half is actually testing the disks, so
lnog as you send old-SYNC.  If they''re the same speed but slow, then
you don''t have NVRAM.

    ps> you could write an ever increasing sequence of values to
    ps> deterministic but pseudo-random pages in some larger file,
    ps> such that you can, after a powerfail test, read them back in
    ps> and test the sequence of numbers (after sorting it) for the
    ps> existence of holes.

yeah, the perl script I linked to requires a ``server'''' which
is not
rebooted and a ``client'''' which is rebooted during the test,
and the
client checks in its behavior with the server.  I think the server
should be unnecessary---the script should just know itself, know in
the check phase what it would have written.  I guess the original
script author is thinking more of the SYNC comand and less of the
write barrier, but in terms of losing pools or _corrupting_ databases,
it''s really only barriers that matter, and SYNC matters only because
it''s also an implicit barrier, doesn''t matter exactly when it
returns.

so....I guess you would need the listening-server to test SYNC is not
returning early, like if you want to detect that someone has disabled
the ZIL, or if you have an n-tier database system with retries at
higher tiers or a system that''s distributed or doing replication, then
you do care when SYNC returns and need the not-rebooted
listening-server.  But you should be able to make a serverless tool
just to check write barriers and thus corruption-proofness.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/3bb4b20f/attachment-0013.bin>

Chris Ridd

2009-Feb-10 19:27 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On 10 Feb 2009, at 18:35, Bryant Eadon wrote:
> Given that ZFS is planned to be used in Snow Leopard, is it worth  
> setting something up for consumer grade appliance vendors to  
> ''certify'' against?  ("Ok, you play nice with ZFS by
doing the right
> things", etc.. )  Maybe you can give them a ''Gold
Star'' == ''Supports
> ZFS'' .  That''ll give them a selling point to consumers
and Sun some
> free marketing ?
Curiously though, Apple''s only mentioning ZFS in the context of Snow  
Leopard *Server*, so that''s probably enterprise-type disks again.

Cheers,

Chris

Tim

2009-Feb-10 20:20 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On Tue, Feb 10, 2009 at 1:27 PM, Chris Ridd <chrisridd at mac.com> wrote:
> On 10 Feb 2009, at 18:35, Bryant Eadon wrote:
>
>  Given that ZFS is planned to be used in Snow Leopard, is it worth setting
>> something up for consumer grade appliance vendors to
''certify'' against?
>>  ("Ok, you play nice with ZFS by doing the right things",
etc.. )  Maybe you
>> can give them a ''Gold Star'' == ''Supports
ZFS'' .  That''ll give them a selling
>> point to consumers and Sun some free marketing ?
>>
>
> Curiously though, Apple''s only mentioning ZFS in the context of
Snow
> Leopard *Server*, so that''s probably enterprise-type disks again.
>
> Cheers,
>
> Chris
>

You apparently have not used apple''s disk.  It''s nothing
remotely resembling
"enterprise-type" disk.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/7ea1c775/attachment-0011.html>

Peter Schuller

2009-Feb-10 21:36 UTC

head link

[zfs-discuss] Does your device honor write barriers?

>     ps> A test I did was to write a minimalistic program that simply
>     ps> appended one block (8k in this case), fsync():ing in between,
>     ps> timing each fsync().
> 
> were you the one that suggested writing backwards to make the
> difference bigger?  I guess you found that trick unnecessary---speeds
> differed enough when writing forwards?
No, that must have been someone else.

In this case I did a sequential test exactly because any trivial
optimizations done by caching drives or raid controllers, should
trivially be able to optimize this particular use case of sequential
writing. In other words, I wanted to maximize the chance of hitting
the optimization in case caching is in fact disabled.
>     ps> * Write-back caching on the RAID controller (lowest latency).
> 
> Did you find a good way to disable this case so you could distinguish
> between the second two?
Yes. I disabled things specifically and got the expected results
latency wise. In particular, with the RAID controller cache disabled
and drive caches not explicitly disabled, I got latencies indicating
the drives did caching (too slow to be the raid controller, too fast
to be on physical disk). This I then confirmed to be the case even
according to the administrative tool.
> like, I thought there was some type of SYNCHRONIZE CACHE with a
> certain flag-bit set, which demands a flush to disk not to NVRAM, and
> that years ago ZFS was mistakenly sending this overly aggressive
> command instead of the normal ``just make it persistent''''
sync, so
> there was that stale best-practice advice to lobotomize the array by
> ordering it to treat the two commands equivalent.
This is something I''m interested in, since my preception so far has
been that there is only one. Some driver writer has the opinion that
"flush cache" means to flush the cache, while the file system writer
uses "flush cache" to mean "I want a write barrier here, or even
perhaps durable persistence, but I have no way to express that so I''m
going to ask for a cache flush request which I assume a battery backed
RAID controller will honor by battery-backed cache rather than
actually flushing drives".

Hence the impedance mismatch and a whole bunch of problems.

Is it the case that SCSI defines different "levels" of
"forcefulness"
to flushing? If so, I''d love to hear any specific so I can then raise
the question with relevant operating systems as to why there is no
distinction between these cases at the block device level in the
kernel(s).

Could you be referring to FUA/Force Unit Access perhaps, rather than a
second type of cache flush?
> Maybe it would be possible to send that old SYNC command on purpose.
> Then you could start the tool by comparing speeds with to-disk-SYNC
> and normal-nvramallowed-SYNC: if they''re the same speed and oddly
> fast, then you know the array controller is lobotomized, and the
> second half of the test is thus invalid.  If they''re different
speeds,
> then you can trust the second half is actually testing the disks, so
> lnog as you send old-SYNC.  If they''re the same speed but slow,
then
> you don''t have NVRAM.
True, though te absolute speeds should tell you quite a lot even
without the comparison.
>     ps> you could write an ever increasing sequence of values to
>     ps> deterministic but pseudo-random pages in some larger file,
>     ps> such that you can, after a powerfail test, read them back in
>     ps> and test the sequence of numbers (after sorting it) for the
>     ps> existence of holes.
> 
> yeah, the perl script I linked to requires a ``server''''
which is not
> rebooted and a ``client'''' which is rebooted during the
test, and the
> client checks in its behavior with the server.  I think the server
> should be unnecessary---the script should just know itself, know in
> the check phase what it would have written.  I guess the original
> script author is thinking more of the SYNC comand and less of the
> write barrier, but in terms of losing pools or _corrupting_ databases,
> it''s really only barriers that matter, and SYNC matters only
because
> it''s also an implicit barrier, doesn''t matter exactly
when it returns.
Correct. You need the external server to test durability, assuming you
are not satisfied with timing based tests. And as you point out, the
write barrier test is fundamentally different.
> so....I guess you would need the listening-server to test SYNC is not
> returning early, like if you want to detect that someone has disabled
> the ZIL, or if you have an n-tier database system with retries at
> higher tiers or a system that''s distributed or doing replication,
then
> you do care when SYNC returns and need the not-rebooted
> listening-server.  But you should be able to make a serverless tool
> just to check write barriers and thus corruption-proofness.
Agreed.

Btw, a great example of a "non-enterprisy" case where you do care
about persistence, is the pretty common case of simply running a mail
server. Just for anyone reading the above paragraph and concluding it
doesn''t matter to mere mortals ;)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/ca75b5c0/attachment-0007.bin>

David Collier-Brown

2009-Feb-10 21:56 UTC

head link

[zfs-discuss] Does your device honor write barriers?

Peter Schuller wrote:> It would actually be nice in general I think, not just for ZFS, to
> have some standard "run this tool" that will give you a check
list of
> successes/failures that specifically target storage
> correctness. Though correctness cannot be proven, you can at least
> test for common cases of systematic incorrect behavior.
A tiny niggle: for an operation set of moderate size, you can
generate an exhaustive set of tests.  I''ve done so for APIs,
but unless you have infinite spare time, you want to
generate the test set with a tool (;-))

--dave (who hasn''t even Copious Spare Time, much less Infinite) c-b
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#

Miles Nordin

2009-Feb-10 22:56 UTC

head link

[zfs-discuss] Does your device honor write barriers?

>>>>> "ps" == Peter Schuller <peter.schuller at
infidyne.com> writes:
ps> This is something I''m interested in, since my preception so
ps> far has been that there is only one. Some driver writer has
ps> the opinion that "flush cache" means to flush the cache,
while
ps> the file system writer uses "flush cache" to mean "I
want a
ps> write barrier here, or even perhaps durable persistence, but I
ps> have no way to express that so I''m going to ask for a cache
ps> flush request which I assume a battery backed RAID controller
ps> will honor by battery-backed cache rather than actually
ps> flushing drives".

well....if you want a write barrier, you can issue a flush-cache and
wait for a reply before releasing writes behind the barrier. You will
get what you want by doing this for certain. so a flush-cache is more
forceful than a barrier, as long as you wait for the reply.

If you have a barrier command, though, you could insert it into the
command stream and NOT wait for the reply, confident nothing would be
reordered across it. This way you can preserve ordering without
draining the write pipe.

I guess if you mistook a cache-flush for a barrier, and just threw it
in there thinking ``it''ll act a s a barrier---I don''t have to
wait for
a reply'''', that could mess things up if someone else in the
storage
stack doesn''t agree that flushes imply barriers.

Here''s a pathological case which may be disconnected from reality in
a few spots but is interesting.

The OS thinks:

* SYNC implies a write barrier. No WRITE issued after the SYNC will
be performed until all WRITE issued before the SYNC are done.
Also, all WRITE issued before the SYNC will be persistent, once
the SYNC has returned.

This is a SYNC that includes the idea of a write barrier. You can
see the idea has two pieces.

The drive thinks:

* To avoid tricky problems, let us use the cargo-cult behavior of
always acknowledge commands in the same order we receive them. Of
course even if it''s not necessary to do this, there''s no
reason to
DISallow it.

* SYNC should not return until all the writes issued before the SYNC
are on disk. WRITE''s issued after the SYNC do not need to be on
disk before returning, but they can be, because otherwise why
would the host have sent them? It makes no sense. After all the
goal is to get as much onto the disk as possible, isn''t it? It
might be Critical Business Data, so we should write it fast.

This SYNC does not include an implicit barrier. It matches what
userland programmers expect from fsync(), because they really have
no choice---there is not a tagged syscall queue! :)

Anyway, the fsync() interpretation is not the only possible
interpretation of what SYNC could mean, but it seems to be the one
closest to what our drive follows.

initiator says disk says disk does

t 1: WRITE A --->
| 2: WRITE B ---> writes A
| 3: <--- WRITE A is done
| 4: SYNC --->
v 5: WRITE C ---> writes C
6: WRITE D --->
7: WRITE E ---> writes B
8: <--- WRITE B is done
9: <--- SYNC is also done
10: <--- and WRITE C is done!
11: WRITE F ---> writes E
12: <--- WRITE E is done

In this case the disk is not ``ignoring'''' the SYNC command.
The disk
obeys its version of the rules, but ''C'' is suprise-written
before the
initiator expects. If the initiator knew of the disk''s rule
interpretation, it would implement the write barrier this way and not
be surprised:

initiator says disk says disk does

t 1: WRITE A --->
| 2: WRITE B ---> writes A
| 3: <--- WRITE A is done
| 4: SYNC --->
v 5: nothing
6: nothing
7: nothing writes B
8: <--- WRITE B is done
9: <--- SYNC is also done
10: WRITE C --->
11: WRITE D ---> writes C
12: <--- WRITE C is done

of course this is slower, maybe MUCH slower if there is a long
communication delay.

The two kinds of synchronize-cache I was talking about were one
bit-setting which writes to NVRAM, another which deamnds write to disk
even when there is NVRAM. I am not sure why the second kind of flush
exists at all---probably standards-committee-creep. It is not really
any of the filesystem''s business. but for making a single easy-to-use
tool where you don''t have or don''t trust NVRAM knobs inside
the RAID
admin tool, the two kinds of sync command could be useful!

A barrier command is hypothetical. I don''t know if it exists, and
would be a third kind of command that I don''t know if it''s
possible at
all to issue it from userland---it was probably considered ``none of
userland''s business.'''' or maybe the spec says
it''s implied by SYNC
like the first initiator thinks---if so, I hope no iSCSI or FC stacks
are confused like that disk was.

ps> Is it the case that SCSI defines different "levels" of
ps> "forcefulness" to flushing?

I think it is true there are levels of forcefulness based on the old
sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice.
But I don''t think there is ever a case where the OS has business
asking for the more forceful kind of NVRAM-disallowed flush. The
barriers stuff is separate from that.

ps> Btw, a great example of a "non-enterprisy" case where you
do
ps> care about persistence [instead of just barriers], is the
ps> pretty common case of simply running a mail server.

yeah. in that case, you have to send a ``message accepted, your
message''s ID in my queue is ASDFGHJ123'''' to the
sending MTA. Until
the receiver sends this message, the sending MTA is still obligated to
resend, and the receiver is allowed to harmlessly lose the message.
so it''s sort of like NFSv3 batched commits or a replicated database,
where the ``when'''' matters across two systems, not just the
ordering
within one system.

But for the ``lost my whole ZFS pool'''' it''s only
barriers that matter.
I think barriers get tangled up with the durability/persistence stuff
because a cheap way for a disk driver to implement a barrier is to
send the persistence command, then delay all writes after the barrier
until the persistence command returns. I''m not sure this is the only
way to make a barrier, though---I don''t know SCSI well enough.

There is another thing we could worry about---maybe this other
disk-level barrier command I do not know about does exist, for drives
that have NCQ or TCQ, or for other parts of the stack like FC or the
iSCSI initiator-to-target interface or AVS. It might mean ``do not
reorder writes across this barrier but, I don''t particularly need a
reply.'''' It should in theory be faster to use this command
where
possible because you don''t have to drain the device''s work
queue as
you do while waiting for a reply to SYNCHRONIZE CACHE---if the
ordering of the queue can be pushed all the way down to the inside of
the hard drive, the latency of restarting writes after the barrier can
be much less than draining the entire pipe''s write stream including FC
or iSCSI as well, so there is significant incentive, especially on
modern high throughput*latency storage, to use a barrier command
instead of plain SYNCHRONIZE CACHE whenver possible. But what if some
part of the stack ignores these hypothetical barriers, but *does*
respect the simple SYNCHRONIZE CACHE persistence command? This first
round of fsync()-based tools wouldn''t catch it!

Here is another bit of FUD to worry about: the common advice for the
lost SAN pools is, use multi-vdev pools. Well, that creepily matches
just the scenario I described: if you need to make a write barrier
that''s valid across devices, the only way to do it is with the
SYNCHRONIZE CACHE persistence command, because you need a reply from
Device 1 before you can release writes behind the barrier to Device 2.
You cannot perform that optimisation I described in the last paragraph
of pushing the barrier paast the high-latency link down into the
device, because your initiator is the only thing these two devices
have in common. Keeping the two disks in sync would in effect force
the initiator to interpret the SYNC command as in my second example.
However if you have just one device, you could write the filesystem to
use this hypothetical barrier command instead of the persistence
command for higher performance, maybe significantly higher on
high-latency SAN. I don''t guess that''s actually
what''s going on
though, just an interesting creepy speculation.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/f02b2ad6/attachment-0006.bin>

Peter Schuller

2009-Feb-10 23:45 UTC

head link

[zfs-discuss] Does your device honor write barriers?

> well....if you want a write barrier, you can issue a flush-cache and
> wait for a reply before releasing writes behind the barrier.  You will
> get what you want by doing this for certain.  so a flush-cache is more
> forceful than a barrier, as long as you wait for the reply.
Yes, this is another peeve of mine since in many cases it is just so
wasteful. Running an ACID compliant database on ZFS on non-battery
backed storage is one example. (I started a brief conversation about
fbarrier() on this list a while back. I really wish something like
that would be adopted by some major OS:es, so that applications, not
just kernel code, can make the distinction.)
> If you have a barrier command, though, you could insert it into the
> command stream and NOT wait for the reply, confident nothing would be
> reordered across it.  This way you can preserve ordering without
> draining the write pipe.
Also known as nirvana :)
> Here''s a pathological case which may be disconnected from reality
in
> a few spots but is interesting.
> 
> The OS thinks:
> 
>   * SYNC implies a write barrier.  No WRITE issued after the SYNC will
>     be performed until all WRITE issued before the SYNC are done.
>     Also, all WRITE issued before the SYNC will be persistent, once
>     the SYNC has returned.
> 
>     This is a SYNC that includes the idea of a write barrier.  You can
>     see the idea has two pieces.
Yes. The complaint in my practical situation was that the driver had
to be tweaked to not forward syncs in order to get decent performance,
but not ignoring it meant an *actual* cache flush regardless of
battery backed cache. Normally correctness was achieved but
expensively because not only was write barriers enforced by way of
syncs, the syncs were literally interpreted as ''flush the
cache'' syncs
even if the controller had battery-backed cache with the appropriate
settings to allow it to cache.
>   * SYNC should not return until all the writes issued before the SYNC
>     are on disk.  WRITE''s issued after the SYNC do not need to be
on
>     disk before returning, but they can be, because otherwise why
>     would the host have sent them?  It makes no sense.  After all the
>     goal is to get as much onto the disk as possible, isn''t it? 
It
>     might be Critical Business Data, so we should write it fast.
> 
>     This SYNC does not include an implicit barrier.  It matches what
>     userland programmers expect from fsync(), because they really have
>     no choice---there is not a tagged syscall queue! :) 
Well, the SYNC did not include the barrier, but the context in which
you use an fsync() to enforce a barrier is one where the application
actually does wait for it to return before issueing dependent
I/O. Have you seen this particular mode of operation be a problem in
practice?

As far as I can tell any assumptions on the part of an application
that calling fsync(), rather than fsync() actually returning, implies
a write barrier, would be severely broken and likely to break pretty
quick in practice on most setups.

[snip example]
> In this case the disk is not ``ignoring'''' the SYNC
command.  The disk
> obeys its version of the rules, but ''C'' is
suprise-written before the
> initiator expects.
Note that even if the disk/controller didn''t do this, the operating
system''s buffer cache is highly likely to introduce similar behavior
internally. So unless you are using direct I/O, if you make this
assumption on fsync() you''re going to be toast even before the drive
or storage controller become involved, in many practical setups.

[snip correct case example]
> of course this is slower, maybe MUCH slower if there is a long
> communication delay.
It''s pretty intersting that the only commonly available method of
introducing a write barrier is to use fsync() which is a more
demanding operation. At the same time, fsync() as actually implemented
is very rarely useful to begin with, *except* in the context of a
write barrier. That is, whenever you actually *do* want fsync() for
persistence purposes, you almost always want some kind of write
barrier functionality to go with it (in a preceeding and/or subsequent
operation). Normally simply committing a bunch of data to disk is not
interesting unless you can have certain guarantees with respect to the
consistency of that data.

So the commen case of needing a write barrier is hindered by the only
call available being a much more demanding operation, while the actual
more demanding operation is not even useful that often in the absence
of the previously mentioned less demanding barrier operation.

Doesn''t feel that efficient that the entire world is relying on
fsync(), does it...
> The two kinds of synchronize-cache I was talking about were one
> bit-setting which writes to NVRAM, another which deamnds write to disk
> even when there is NVRAM. 
That was my understanding, but I had never previously gotten the
impression that there was such a distinction. At least not at the
typical OS/block device layer - I am very weak on SCSI. For example
most recently I considered this in the case of FreeBSD where there is
BIO_FLUSH, but I''m not aware of any distinction such as the above.

It is the cas that SCSI has this, but that most OS:es simply don''t use
the more forceful version?
> I am not sure why the second kind of flush
> exists at all---probably standards-committee-creep.  It is not really
> any of the filesystem''s business.  but for making a single
easy-to-use
> tool where you don''t have or don''t trust NVRAM knobs
inside the RAID
> admin tool, the two kinds of sync command could be useful!
This is exactly my conclusion as well. I can see "really REALLY flush
the cache" being useful as an administratively initiated command,
decided upon by a human - similarly to issueing a ''sync''
command to
globally sync all buffers no matter what. But for
filesystems/databases/other applications it truly should be completely
irrelevant.
> A barrier command is hypothetical.  I don''t know if it exists, and
> would be a third kind of command that I don''t know if
it''s possible at
> all to issue it from userland---it was probably considered ``none of
> userland''s business.''''  or maybe the spec says
it''s implied by SYNC
> like the first initiator thinks---if so, I hope no iSCSI or FC stacks
> are confused like that disk was.
If it was considered none of userlands business I wholeheartedly
disagree ;)

The conclusion from the previous discussion where I brought up
fbarrier() seems to be that effectively you have an implicit
fbarrier() in between each write() with ZFS. Imagine how nice it would
be if the fbarrier() interface had been available, even if mapped to
fsync() in most cases.

(Mapping fbarrier()->fsync() would not be a problem as long as
fbarrier() is allowed to block.)
> I think it is true there are levels of forcefulness based on the old
> sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice.
I become paranoid by such advice. What''s stopping a RAID device from,
for example, ACK:ing an I/O before it is even in the cache? I have not
designed RAID controller firmware so I am not sure how likely that is,
but I don''t see it as an impossibility. Disabling flushing because you
have battery backed nvram implies that your battery-backed nvram
guarantees ordering of all writes, and that nothing is ever placed in
said battery backed cache out of order. Can this assumption really be
made safely under all circumstances without intimite knowledge of the
controller? I would expect not.

[snip]
> There is another thing we could worry about---maybe this other
> disk-level barrier command I do not know about does exist, for drives
> that have NCQ or TCQ, or for other parts of the stack like FC or the
> iSCSI initiator-to-target interface or AVS.  It might mean ``do not
I have been under the very vague-but-not-well-supported understanding
that there is some kind of barrier support going on with SCSI. But it
has never been such an issue for me; I have been more concerned with
enabling applications to have such a thing propagated down the
operationg system stack at all to begin with.

Before it becomes relevant to me to start worrying about barriers at
the SCSI level and whether it is implemented efficiently by certain
drives or controllers, I have to see that propagation working to begin
with. And as long as all the user land stuff does is an fsync(), we''re
not there yet.

The exception again is the in-kernel stuff which stands a better
chance. On FreeBSD, last time I read ML posts/code about this, it''s
just a BIO_FLUSH and AFAIK there is no distinction made so ZFS is not
able to communicate a barrier-as-oppsed-to-sync.
> reorder writes across this barrier but, I don''t particularly need
a
> reply.''''  It should in theory be faster to use this
command where
> possible because you don''t have to drain the device''s
work queue as
> you do while waiting for a reply to SYNCHRONIZE CACHE---if the
> ordering of the queue can be pushed all the way down to the inside of
> the hard drive, the latency of restarting writes after the barrier can
> be much less than draining the entire pipe''s write stream
including FC
> or iSCSI as well, so there is significant incentive, especially on
> modern high throughput*latency storage, to use a barrier command
> instead of plain SYNCHRONIZE CACHE whenver possible. 
Now further imagine identifying and tagging distinct streams of I/O,
such that your fsync() (where you want durability) of a handful of
pages of data need not wait for those 50 MB of crap some other process
wrote when copying some file. ;)

First thing''s first...
> But what if some
> part of the stack ignores these hypothetical barriers, but *does*
> respect the simple SYNCHRONIZE CACHE persistence command?  This first
> round of fsync()-based tools wouldn''t catch it!
On the other hand as a practical matter you can always choose to err
on the side of caution and have barriers imply sync+wait still. If one
is worried about these issues, and if the practical situation is such
that you cannot trust the hardware/software involved, I suppose there
is no other way out other than testing and adjusting until it works.
> Here is another bit of FUD to worry about: the common advice for the
> lost SAN pools is, use multi-vdev pools.  Well, that creepily matches
> just the scenario I described: if you need to make a write barrier
> that''s valid across devices, the only way to do it is with the
> SYNCHRONIZE CACHE persistence command, because you need a reply from
> Device 1 before you can release writes behind the barrier to Device 2.
> You cannot perform that optimisation I described in the last paragraph
> of pushing the barrier paast the high-latency link down into the
> device, because your initiator is the only thing these two devices
> have in common.  Keeping the two disks in sync would in effect force
> the initiator to interpret the SYNC command as in my second example.
> However if you have just one device, you could write the filesystem to
> use this hypothetical barrier command instead of the persistence
> command for higher performance, maybe significantly higher on
> high-latency SAN.  I don''t guess that''s actually
what''s going on
> though, just an interesting creepy speculation.
This would be another case where battery-backed (local to the machine)
NVRAM fundamentally helps even in a situation where you are only
concerned with the barrier, since there is no problem having a
battery-backed controller sort out the disk-local problems itself by
whatever combination of syncs/barriers, while giving instant barrier
support (by effectively implementing synch-and-wait) to the operating
system.

(Referring now to individual drives being battery-backed, not using a
hardware raid volume.)


-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/952d0530/attachment-0006.bin>

Bob Friesenhahn

2009-Feb-11 00:08 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On Tue, 10 Feb 2009, Tim wrote:>
> You apparently have not used apple''s disk.  It''s nothing
remotely resembling
> "enterprise-type" disk.
That is not true of Apple''s only server system "Xserve".  It
uses SAS
disks similar to the ones in the enterprise offerings of Sun, IBM, 
etc, and at a similar price point.  The main reason to want to pay 
$1,200 for Apples OS OS-X server offering is so that you can run it on 
this 1U Xserve hardware.  I am not sure why anyone would want to do 
that, but I have worked for an outfit for which this was the most 
important priority.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jeff Bonwick

2009-Feb-11 00:41 UTC

head link

[zfs-discuss] Does your device honor write barriers?

> well....if you want a write barrier, you can issue a flush-cache and
> wait for a reply before releasing writes behind the barrier.  You will
> get what you want by doing this for certain.
Not if the disk drive just *ignores* barrier and flush-cache commands
and returns success.  Some consumer drives really do exactly that.
That''s the issue that people are asking ZFS to work around.

But it''s important to understand that this failure mode (silently
ignoring SCSI commands) is truly a case of broken-by-design hardware.
If a disk doesn''t honor these commands, then no synchronous operation
is ever truly synchronous -- it''d be like your OS just ignoring O_SYNC.
This means you can''t use such disks for (say) a database or NFS server,
because it is *impossible* to know when the data is on stable storage.

If it were possible to detect such disks, I''d add code to ZFS that
would simply refuse to use them.  Unfortunately, there is no reliable
way to test the functioning of synchonize-cache programmatically.

Jeff

Toby Thain

2009-Feb-11 01:23 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On 10-Feb-09, at 7:41 PM, Jeff Bonwick wrote:
>> well....if you want a write barrier, you can issue a flush-cache and
>> wait for a reply before releasing writes behind the barrier.  You  
>> will
>> get what you want by doing this for certain.
>
> Not if the disk drive just *ignores* barrier and flush-cache commands
> and returns success.  Some consumer drives really do exactly that.
> That''s the issue that people are asking ZFS to work around.
>
> But it''s important to understand that this failure mode (silently
> ignoring SCSI commands) is truly a case of broken-by-design hardware.
> If a disk doesn''t honor these commands, then no synchronous
operation
> is ever truly synchronous -- it''d be like your OS just ignoring  
> O_SYNC.
> This means you can''t use such disks for (say) a database or NFS  
> server,
> because it is *impossible* to know when the data is on stable storage.
This applies equally to virtual disks, of course (can we get  
VirtualBox to NOT ignore flushes by default?)

--Toby
>
> If it were possible to detect such disks, I''d add code to ZFS that
> would simply refuse to use them.  Unfortunately, there is no reliable
> way to test the functioning of synchonize-cache programmatically.
>
> Jeff
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Miles Nordin

2009-Feb-11 02:10 UTC

head link

[zfs-discuss] Does your device honor write barriers?

>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at
sun.com> writes:
>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
jb> Not if the disk drive just *ignores* barrier and flush-cache
jb> commands and returns success. Some consumer drives really do
jb> exactly that. That''s the issue that people are asking ZFS
to
jb> work around.

Some are asking ZFS to work around the issue, which I think is not
crazy: ZFS is already designed around failures clustered together in
space, so why not failures clustered together in time as well? But
I''m not in their camp, not asking for that workaround. It
couldn''t
ever deliver the kind if integrity to which the checksum tree aspires.
I''m asking for a solution to the overall problem, mostly outing,
avoiding, fixing the broken devices and storage stacks.

jb> If it were possible to detect such disks, I''d add code to
ZFS
jb> that would simply refuse to use them. Unfortunately, there is
jb> no reliable way to test the functioning of synchonize-cache
jb> programmatically.

I think the situation''s closer to: there''s no way to test for
it upon
adding/attaching/replacing a device, so quickly that the user doesn''t
realize it''s happening, and with few enough false positives that you
don''t mind supporting it when it goes wrong, and don''t mind
defending
its correctness when it damages vendor relationships.

However I think developing a qualification _procedure_ that sysadmins
can actually follow, possibly involving cord-yanking, and one that''s
decisive enough we can start sharing results instead of saying ``a
major vendor'''' and covering our asses all the time, is quite
within
reach. And I think it''s all but certain to uncover all sorts of
problems which are not in devices, too.

tt> This applies equally to virtual disks, of course (can we get
tt> VirtualBox to NOT ignore flushes by default?)

haha but then people would say it performs so much worse than VMWare! :)

To be honest I have not absolutely verified this problem. I just
hazily remember reading an email here or a bug report about it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/50ddd61f/attachment-0005.bin>

Frank Cusack

2009-Feb-11 03:36 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick <Jeff.Bonwick at
sun.com>
wrote:> Not if the disk drive just *ignores* barrier and flush-cache commands
> and returns success.  Some consumer drives really do exactly that.
ouch.
> If it were possible to detect such disks, I''d add code to ZFS that
> would simply refuse to use them.  Unfortunately, there is no reliable
> way to test the functioning of synchonize-cache programmatically.
How about a database of known bad drives?  Like the format.dat of old.

-frank

Toby Thain

2009-Feb-11 04:53 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On 10-Feb-09, at 10:36 PM, Frank Cusack wrote:
> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick  
> <Jeff.Bonwick at sun.com> wrote:
>> Not if the disk drive just *ignores* barrier and flush-cache commands
>> and returns success.  Some consumer drives really do exactly that.
>
> ouch.
>
>> If it were possible to detect such disks, I''d add code to ZFS
that
>> would simply refuse to use them.  Unfortunately, there is no reliable
>> way to test the functioning of synchonize-cache programmatically.
>
> How about a database of known bad drives?  Like the format.dat of old.
The intransigence of disk makers is incredible. Name and shame might  
work, though.

--Toby
>
> -frank
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bryant Eadon

2009-Feb-11 06:28 UTC

head link

[zfs-discuss] Does your device honor write barriers?

Toby Thain wrote:> On 10-Feb-09, at 10:36 PM, Frank Cusack wrote:
>> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick 
>> <Jeff.Bonwick at sun.com> wrote:
>>> Not if the disk drive just *ignores* barrier and flush-cache
commands
>>> and returns success.  Some consumer drives really do exactly that.
>>
>> ouch.
>>
>>> If it were possible to detect such disks, I''d add code to
ZFS that
>>> would simply refuse to use them.  Unfortunately, there is no
reliable
>>> way to test the functioning of synchonize-cache programmatically.
>>
>> How about a database of known bad drives?  Like the format.dat of old.
> 
> The intransigence of disk makers is incredible. Name and shame might 
> work, though.
I do like the idea of a ''known bad'' DB, just a quick reference
for people to
check on and drop an email to $vendor indicating someone''s added $drive
to the
list based on $test ?  It''s a lot of work to keep updated though.  :-/

JB> because it is *impossible* to know when the data is on stable storage.

Pardon the ignorance to in-depth drive internals for a moment, would it be 
possible to time a write of X to the drive, time a write of X to the drive again
w/ a sync, power it off immediately after the sync returns (physically ? 
programmatically ?) then back on to re-read data that was just written ?  If 
it''s there, then the sync didn''t lie, otherwise the drive
failed the test.  Many
BIOS support powering off the machine on shutdown, could the same command be 
issued to hose the drive in this scenario skipping a ''proper''
shutdown procedure
? Or would the PSU continue supplying it with power long enough for it to finish
writing ?  I suppose it would have to be a sufficiently large write...

Alternatively, timing the tested writes across various sectors of the disk would
give you a good baseline of how long writes take.  Would forcing a sync 
immediately after the writes to the same locations give you an indication if the
sync was doing as it is supposed to ?  If there''s a noticeable (*vague)
increase
in delay then we assume the sync worked ?

Eric D. Mudama

2009-Feb-11 08:33 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On Tue, Feb 10 at 16:41, Jeff Bonwick wrote:>Not if the disk drive just *ignores* barrier and flush-cache commands
>and returns success.  Some consumer drives really do exactly that.
>That''s the issue that people are asking ZFS to work around.
Can someone please name a specific device (vendor + model + firmware
revision) that does this?  I see this claim thrown around like fact
repeatedly, and yet I''ve never personally experienced an actual
"consumer" device that discarded FLUSH CACHE (EXT) before, and nobody
I know can name one that did.

The only exceptions that might "appear" to be ignoring a barrier that
I''ve witnessed are "high fly" writes in rotating drives,
where the
servo system couldn''t detect that the head struck a defect and was
deflected too high to write, and devices that don''t support the
command at all (and thus abort 51/04 attempts to flush the cache).

BTW, funky/busted bridge hardware in external USB devices don''t count.
I''m more interested in major rotating drive vendors... Seagate/Maxtor,
WD, Hitachi/IBM, Fujitsu, Toshiba, etc.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

David Dyer-Bennet

2009-Feb-11 15:27 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On Wed, February 11, 2009 02:33, Eric D. Mudama wrote:
> BTW, funky/busted bridge hardware in external USB devices don''t
count.
They do for me; I''m currently using external USB drives for my backup
datasets (in the process of converting to use zfs send/recv to get the
data there).  My normal procedure even involves yanking the USB cables (in
theory long after the backup is completed and the pool is exported, but if
the script fails/hangs I might well yank the cable in the morning without
verifying the results of the script overnight first).
> I''m more interested in major rotating drive vendors...
Seagate/Maxtor,
> WD, Hitachi/IBM, Fujitsu, Toshiba, etc.
There are a larger set of disaster modes if the problem is at that level,
of course.  I''d really like to see a reasonably cook-book disk
qualification procedure that could detect these problems.  Might have to
involve a timed disconnect of some sort, which might require hot-swap
hardware, but that''s something one could live with.  And if the
qualification procedure were widely believed to be good, aggregating
results would be useful.

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Frank Cusack

2009-Feb-11 16:24 UTC

head link

[zfs-discuss] Does your device honor write barriers?

On February 10, 2009 11:53:39 PM -0500 Toby Thain 
<toby at telegraphics.com.au> wrote:>
> On 10-Feb-09, at 10:36 PM, Frank Cusack wrote:
>
>> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick
>> <Jeff.Bonwick at sun.com> wrote:
>>> Not if the disk drive just *ignores* barrier and flush-cache
commands
>>> and returns success.  Some consumer drives really do exactly that.
>>
>> ouch.
>>
>>> If it were possible to detect such disks, I''d add code to
ZFS that
>>> would simply refuse to use them.  Unfortunately, there is no
reliable
>>> way to test the functioning of synchonize-cache programmatically.
>>
>> How about a database of known bad drives?  Like the format.dat of old.
>
> The intransigence of disk makers is incredible. Name and shame might
> work, though.
>
I, for one, don''t really care about shaming any vendor.  I care about
not using broken products.  The database need not be compiled by Sun,
but it should (ideally) be distributed by them (in OpenSolaris) and
supported by zfs.

-frank

zfs discuss - Feb 2009 - Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?

[zfs-discuss] Does your device honor write barriers?