All, I''ve been following the thread titled ''ZFS: unreliable for professional use'' and I''ve learned a few things. Put simply, external devices don''t behave like internal ones. >From JB : >The good news is that ZFS is getting popular enough on consumer-grade >hardware. The bad news is that said hardware has a different set of >failure modes, so it takes a bit of work to become resilient to them. >This is pretty high on my short list. >from PS : >I had a cheap-o USB enclosure that definitely did ignore such >commands. On every txg commit I''d get a warning in dmesg (this was on >FreeBSD) about the device not implementing the relevant SCSI command. I use 3 external devices of on 2 models of external enclosures (eSATA and USB consumer grade)-- how can I test this write barrier issue on these 2 ?? Is it worthwhile adding to a wiki (table) somewhere what has or has not been tested ? Given that ZFS is planned to be used in Snow Leopard, is it worth setting something up for consumer grade appliance vendors to ''certify'' against? ("Ok, you play nice with ZFS by doing the right things", etc.. ) Maybe you can give them a ''Gold Star'' == ''Supports ZFS'' . That''ll give them a selling point to consumers and Sun some free marketing ? Thoughts ? Thanks, Bryant
> I use 3 external devices of on 2 models of external enclosures (eSATA and USB > consumer grade)-- how can I test this write barrier issue on these 2 ?? Is it > worthwhile adding to a wiki (table) somewhere what has or has not been tested ?It depends on circumstances. If write barriers are enforced by instructing the device to flush caches, and assuming there is no battery-backed cache, a good way is to make sure that the latency of an fsync() is in fact what it is expected to be. A test I did was to write a minimalistic program that simply appended one block (8k in this case), fsync():ing in between, timing each fsync(). In my case I was able to detect three distinct modes: * Write-back caching on the RAID controller (lowest latency). * Write-through on the RAID controller but write-back on the drives (medium latency). * Write-through on the RAID controller and the drive (highest latency, as expected by rotational delay and seek delay of drives). This was useful to test that things "seemed" to behave properly. Of course you only test that it is not systematically mis-behaving, not that it will actually behave correctly under all circumstances. However this test boils down to testing durable persistence. If you want to specifically test write barriers regardless of durable persistence, you can write a tool that performs I/O:s in a way where you can determine, after the fact, whether they happened in order. For example you could write an ever increasing sequence of values to deterministic but pseudo-random pages in some larger file, such that you can, after a powerfail test, read them back in and test the sequence of numbers (after sorting it) for the existence of holes.> Given that ZFS is planned to be used in Snow Leopard, is it worth setting > something up for consumer grade appliance vendors to ''certify'' against? ("Ok, > you play nice with ZFS by doing the right things", etc.. ) Maybe you can give > them a ''Gold Star'' == ''Supports ZFS'' . That''ll give them a selling point to > consumers and Sun some free marketing ?It would actually be nice in general I think, not just for ZFS, to have some standard "run this tool" that will give you a check list of successes/failures that specifically target storage correctness. Though correctness cannot be proven, you can at least test for common cases of systematic incorrect behavior. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/68ef2d1a/attachment-0013.bin>
>>>>> "ps" == Peter Schuller <peter.schuller at infidyne.com> writes:ps> A test I did was to write a minimalistic program that simply ps> appended one block (8k in this case), fsync():ing in between, ps> timing each fsync(). were you the one that suggested writing backwards to make the difference bigger? I guess you found that trick unnecessary---speeds differed enough when writing forwards? ps> * Write-back caching on the RAID controller (lowest latency). Did you find a good way to disable this case so you could distinguish between the second two? like, I thought there was some type of SYNCHRONIZE CACHE with a certain flag-bit set, which demands a flush to disk not to NVRAM, and that years ago ZFS was mistakenly sending this overly aggressive command instead of the normal ``just make it persistent'''' sync, so there was that stale best-practice advice to lobotomize the array by ordering it to treat the two commands equivalent. Maybe it would be possible to send that old SYNC command on purpose. Then you could start the tool by comparing speeds with to-disk-SYNC and normal-nvramallowed-SYNC: if they''re the same speed and oddly fast, then you know the array controller is lobotomized, and the second half of the test is thus invalid. If they''re different speeds, then you can trust the second half is actually testing the disks, so lnog as you send old-SYNC. If they''re the same speed but slow, then you don''t have NVRAM. ps> you could write an ever increasing sequence of values to ps> deterministic but pseudo-random pages in some larger file, ps> such that you can, after a powerfail test, read them back in ps> and test the sequence of numbers (after sorting it) for the ps> existence of holes. yeah, the perl script I linked to requires a ``server'''' which is not rebooted and a ``client'''' which is rebooted during the test, and the client checks in its behavior with the server. I think the server should be unnecessary---the script should just know itself, know in the check phase what it would have written. I guess the original script author is thinking more of the SYNC comand and less of the write barrier, but in terms of losing pools or _corrupting_ databases, it''s really only barriers that matter, and SYNC matters only because it''s also an implicit barrier, doesn''t matter exactly when it returns. so....I guess you would need the listening-server to test SYNC is not returning early, like if you want to detect that someone has disabled the ZIL, or if you have an n-tier database system with retries at higher tiers or a system that''s distributed or doing replication, then you do care when SYNC returns and need the not-rebooted listening-server. But you should be able to make a serverless tool just to check write barriers and thus corruption-proofness. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/3bb4b20f/attachment-0013.bin>
On 10 Feb 2009, at 18:35, Bryant Eadon wrote:> Given that ZFS is planned to be used in Snow Leopard, is it worth > setting something up for consumer grade appliance vendors to > ''certify'' against? ("Ok, you play nice with ZFS by doing the right > things", etc.. ) Maybe you can give them a ''Gold Star'' == ''Supports > ZFS'' . That''ll give them a selling point to consumers and Sun some > free marketing ?Curiously though, Apple''s only mentioning ZFS in the context of Snow Leopard *Server*, so that''s probably enterprise-type disks again. Cheers, Chris
On Tue, Feb 10, 2009 at 1:27 PM, Chris Ridd <chrisridd at mac.com> wrote:> On 10 Feb 2009, at 18:35, Bryant Eadon wrote: > > Given that ZFS is planned to be used in Snow Leopard, is it worth setting >> something up for consumer grade appliance vendors to ''certify'' against? >> ("Ok, you play nice with ZFS by doing the right things", etc.. ) Maybe you >> can give them a ''Gold Star'' == ''Supports ZFS'' . That''ll give them a selling >> point to consumers and Sun some free marketing ? >> > > Curiously though, Apple''s only mentioning ZFS in the context of Snow > Leopard *Server*, so that''s probably enterprise-type disks again. > > Cheers, > > Chris >You apparently have not used apple''s disk. It''s nothing remotely resembling "enterprise-type" disk. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/7ea1c775/attachment-0011.html>
> ps> A test I did was to write a minimalistic program that simply > ps> appended one block (8k in this case), fsync():ing in between, > ps> timing each fsync(). > > were you the one that suggested writing backwards to make the > difference bigger? I guess you found that trick unnecessary---speeds > differed enough when writing forwards?No, that must have been someone else. In this case I did a sequential test exactly because any trivial optimizations done by caching drives or raid controllers, should trivially be able to optimize this particular use case of sequential writing. In other words, I wanted to maximize the chance of hitting the optimization in case caching is in fact disabled.> ps> * Write-back caching on the RAID controller (lowest latency). > > Did you find a good way to disable this case so you could distinguish > between the second two?Yes. I disabled things specifically and got the expected results latency wise. In particular, with the RAID controller cache disabled and drive caches not explicitly disabled, I got latencies indicating the drives did caching (too slow to be the raid controller, too fast to be on physical disk). This I then confirmed to be the case even according to the administrative tool.> like, I thought there was some type of SYNCHRONIZE CACHE with a > certain flag-bit set, which demands a flush to disk not to NVRAM, and > that years ago ZFS was mistakenly sending this overly aggressive > command instead of the normal ``just make it persistent'''' sync, so > there was that stale best-practice advice to lobotomize the array by > ordering it to treat the two commands equivalent.This is something I''m interested in, since my preception so far has been that there is only one. Some driver writer has the opinion that "flush cache" means to flush the cache, while the file system writer uses "flush cache" to mean "I want a write barrier here, or even perhaps durable persistence, but I have no way to express that so I''m going to ask for a cache flush request which I assume a battery backed RAID controller will honor by battery-backed cache rather than actually flushing drives". Hence the impedance mismatch and a whole bunch of problems. Is it the case that SCSI defines different "levels" of "forcefulness" to flushing? If so, I''d love to hear any specific so I can then raise the question with relevant operating systems as to why there is no distinction between these cases at the block device level in the kernel(s). Could you be referring to FUA/Force Unit Access perhaps, rather than a second type of cache flush?> Maybe it would be possible to send that old SYNC command on purpose. > Then you could start the tool by comparing speeds with to-disk-SYNC > and normal-nvramallowed-SYNC: if they''re the same speed and oddly > fast, then you know the array controller is lobotomized, and the > second half of the test is thus invalid. If they''re different speeds, > then you can trust the second half is actually testing the disks, so > lnog as you send old-SYNC. If they''re the same speed but slow, then > you don''t have NVRAM.True, though te absolute speeds should tell you quite a lot even without the comparison.> ps> you could write an ever increasing sequence of values to > ps> deterministic but pseudo-random pages in some larger file, > ps> such that you can, after a powerfail test, read them back in > ps> and test the sequence of numbers (after sorting it) for the > ps> existence of holes. > > yeah, the perl script I linked to requires a ``server'''' which is not > rebooted and a ``client'''' which is rebooted during the test, and the > client checks in its behavior with the server. I think the server > should be unnecessary---the script should just know itself, know in > the check phase what it would have written. I guess the original > script author is thinking more of the SYNC comand and less of the > write barrier, but in terms of losing pools or _corrupting_ databases, > it''s really only barriers that matter, and SYNC matters only because > it''s also an implicit barrier, doesn''t matter exactly when it returns.Correct. You need the external server to test durability, assuming you are not satisfied with timing based tests. And as you point out, the write barrier test is fundamentally different.> so....I guess you would need the listening-server to test SYNC is not > returning early, like if you want to detect that someone has disabled > the ZIL, or if you have an n-tier database system with retries at > higher tiers or a system that''s distributed or doing replication, then > you do care when SYNC returns and need the not-rebooted > listening-server. But you should be able to make a serverless tool > just to check write barriers and thus corruption-proofness.Agreed. Btw, a great example of a "non-enterprisy" case where you do care about persistence, is the pretty common case of simply running a mail server. Just for anyone reading the above paragraph and concluding it doesn''t matter to mere mortals ;) -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/ca75b5c0/attachment-0007.bin>
David Collier-Brown
2009-Feb-10 21:56 UTC
[zfs-discuss] Does your device honor write barriers?
Peter Schuller wrote:> It would actually be nice in general I think, not just for ZFS, to > have some standard "run this tool" that will give you a check list of > successes/failures that specifically target storage > correctness. Though correctness cannot be proven, you can at least > test for common cases of systematic incorrect behavior.A tiny niggle: for an operation set of moderate size, you can generate an exhaustive set of tests. I''ve done so for APIs, but unless you have infinite spare time, you want to generate the test set with a tool (;-)) --dave (who hasn''t even Copious Spare Time, much less Infinite) c-b -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
>>>>> "ps" == Peter Schuller <peter.schuller at infidyne.com> writes:ps> This is something I''m interested in, since my preception so ps> far has been that there is only one. Some driver writer has ps> the opinion that "flush cache" means to flush the cache, while ps> the file system writer uses "flush cache" to mean "I want a ps> write barrier here, or even perhaps durable persistence, but I ps> have no way to express that so I''m going to ask for a cache ps> flush request which I assume a battery backed RAID controller ps> will honor by battery-backed cache rather than actually ps> flushing drives". well....if you want a write barrier, you can issue a flush-cache and wait for a reply before releasing writes behind the barrier. You will get what you want by doing this for certain. so a flush-cache is more forceful than a barrier, as long as you wait for the reply. If you have a barrier command, though, you could insert it into the command stream and NOT wait for the reply, confident nothing would be reordered across it. This way you can preserve ordering without draining the write pipe. I guess if you mistook a cache-flush for a barrier, and just threw it in there thinking ``it''ll act a s a barrier---I don''t have to wait for a reply'''', that could mess things up if someone else in the storage stack doesn''t agree that flushes imply barriers. Here''s a pathological case which may be disconnected from reality in a few spots but is interesting. The OS thinks: * SYNC implies a write barrier. No WRITE issued after the SYNC will be performed until all WRITE issued before the SYNC are done. Also, all WRITE issued before the SYNC will be persistent, once the SYNC has returned. This is a SYNC that includes the idea of a write barrier. You can see the idea has two pieces. The drive thinks: * To avoid tricky problems, let us use the cargo-cult behavior of always acknowledge commands in the same order we receive them. Of course even if it''s not necessary to do this, there''s no reason to DISallow it. * SYNC should not return until all the writes issued before the SYNC are on disk. WRITE''s issued after the SYNC do not need to be on disk before returning, but they can be, because otherwise why would the host have sent them? It makes no sense. After all the goal is to get as much onto the disk as possible, isn''t it? It might be Critical Business Data, so we should write it fast. This SYNC does not include an implicit barrier. It matches what userland programmers expect from fsync(), because they really have no choice---there is not a tagged syscall queue! :) Anyway, the fsync() interpretation is not the only possible interpretation of what SYNC could mean, but it seems to be the one closest to what our drive follows. initiator says disk says disk does t 1: WRITE A ---> | 2: WRITE B ---> writes A | 3: <--- WRITE A is done | 4: SYNC ---> v 5: WRITE C ---> writes C 6: WRITE D ---> 7: WRITE E ---> writes B 8: <--- WRITE B is done 9: <--- SYNC is also done 10: <--- and WRITE C is done! 11: WRITE F ---> writes E 12: <--- WRITE E is done In this case the disk is not ``ignoring'''' the SYNC command. The disk obeys its version of the rules, but ''C'' is suprise-written before the initiator expects. If the initiator knew of the disk''s rule interpretation, it would implement the write barrier this way and not be surprised: initiator says disk says disk does t 1: WRITE A ---> | 2: WRITE B ---> writes A | 3: <--- WRITE A is done | 4: SYNC ---> v 5: nothing 6: nothing 7: nothing writes B 8: <--- WRITE B is done 9: <--- SYNC is also done 10: WRITE C ---> 11: WRITE D ---> writes C 12: <--- WRITE C is done of course this is slower, maybe MUCH slower if there is a long communication delay. The two kinds of synchronize-cache I was talking about were one bit-setting which writes to NVRAM, another which deamnds write to disk even when there is NVRAM. I am not sure why the second kind of flush exists at all---probably standards-committee-creep. It is not really any of the filesystem''s business. but for making a single easy-to-use tool where you don''t have or don''t trust NVRAM knobs inside the RAID admin tool, the two kinds of sync command could be useful! A barrier command is hypothetical. I don''t know if it exists, and would be a third kind of command that I don''t know if it''s possible at all to issue it from userland---it was probably considered ``none of userland''s business.'''' or maybe the spec says it''s implied by SYNC like the first initiator thinks---if so, I hope no iSCSI or FC stacks are confused like that disk was. ps> Is it the case that SCSI defines different "levels" of ps> "forcefulness" to flushing? I think it is true there are levels of forcefulness based on the old sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice. But I don''t think there is ever a case where the OS has business asking for the more forceful kind of NVRAM-disallowed flush. The barriers stuff is separate from that. ps> Btw, a great example of a "non-enterprisy" case where you do ps> care about persistence [instead of just barriers], is the ps> pretty common case of simply running a mail server. yeah. in that case, you have to send a ``message accepted, your message''s ID in my queue is ASDFGHJ123'''' to the sending MTA. Until the receiver sends this message, the sending MTA is still obligated to resend, and the receiver is allowed to harmlessly lose the message. so it''s sort of like NFSv3 batched commits or a replicated database, where the ``when'''' matters across two systems, not just the ordering within one system. But for the ``lost my whole ZFS pool'''' it''s only barriers that matter. I think barriers get tangled up with the durability/persistence stuff because a cheap way for a disk driver to implement a barrier is to send the persistence command, then delay all writes after the barrier until the persistence command returns. I''m not sure this is the only way to make a barrier, though---I don''t know SCSI well enough. There is another thing we could worry about---maybe this other disk-level barrier command I do not know about does exist, for drives that have NCQ or TCQ, or for other parts of the stack like FC or the iSCSI initiator-to-target interface or AVS. It might mean ``do not reorder writes across this barrier but, I don''t particularly need a reply.'''' It should in theory be faster to use this command where possible because you don''t have to drain the device''s work queue as you do while waiting for a reply to SYNCHRONIZE CACHE---if the ordering of the queue can be pushed all the way down to the inside of the hard drive, the latency of restarting writes after the barrier can be much less than draining the entire pipe''s write stream including FC or iSCSI as well, so there is significant incentive, especially on modern high throughput*latency storage, to use a barrier command instead of plain SYNCHRONIZE CACHE whenver possible. But what if some part of the stack ignores these hypothetical barriers, but *does* respect the simple SYNCHRONIZE CACHE persistence command? This first round of fsync()-based tools wouldn''t catch it! Here is another bit of FUD to worry about: the common advice for the lost SAN pools is, use multi-vdev pools. Well, that creepily matches just the scenario I described: if you need to make a write barrier that''s valid across devices, the only way to do it is with the SYNCHRONIZE CACHE persistence command, because you need a reply from Device 1 before you can release writes behind the barrier to Device 2. You cannot perform that optimisation I described in the last paragraph of pushing the barrier paast the high-latency link down into the device, because your initiator is the only thing these two devices have in common. Keeping the two disks in sync would in effect force the initiator to interpret the SYNC command as in my second example. However if you have just one device, you could write the filesystem to use this hypothetical barrier command instead of the persistence command for higher performance, maybe significantly higher on high-latency SAN. I don''t guess that''s actually what''s going on though, just an interesting creepy speculation. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/f02b2ad6/attachment-0006.bin>
> well....if you want a write barrier, you can issue a flush-cache and > wait for a reply before releasing writes behind the barrier. You will > get what you want by doing this for certain. so a flush-cache is more > forceful than a barrier, as long as you wait for the reply.Yes, this is another peeve of mine since in many cases it is just so wasteful. Running an ACID compliant database on ZFS on non-battery backed storage is one example. (I started a brief conversation about fbarrier() on this list a while back. I really wish something like that would be adopted by some major OS:es, so that applications, not just kernel code, can make the distinction.)> If you have a barrier command, though, you could insert it into the > command stream and NOT wait for the reply, confident nothing would be > reordered across it. This way you can preserve ordering without > draining the write pipe.Also known as nirvana :)> Here''s a pathological case which may be disconnected from reality in > a few spots but is interesting. > > The OS thinks: > > * SYNC implies a write barrier. No WRITE issued after the SYNC will > be performed until all WRITE issued before the SYNC are done. > Also, all WRITE issued before the SYNC will be persistent, once > the SYNC has returned. > > This is a SYNC that includes the idea of a write barrier. You can > see the idea has two pieces.Yes. The complaint in my practical situation was that the driver had to be tweaked to not forward syncs in order to get decent performance, but not ignoring it meant an *actual* cache flush regardless of battery backed cache. Normally correctness was achieved but expensively because not only was write barriers enforced by way of syncs, the syncs were literally interpreted as ''flush the cache'' syncs even if the controller had battery-backed cache with the appropriate settings to allow it to cache.> * SYNC should not return until all the writes issued before the SYNC > are on disk. WRITE''s issued after the SYNC do not need to be on > disk before returning, but they can be, because otherwise why > would the host have sent them? It makes no sense. After all the > goal is to get as much onto the disk as possible, isn''t it? It > might be Critical Business Data, so we should write it fast. > > This SYNC does not include an implicit barrier. It matches what > userland programmers expect from fsync(), because they really have > no choice---there is not a tagged syscall queue! :)Well, the SYNC did not include the barrier, but the context in which you use an fsync() to enforce a barrier is one where the application actually does wait for it to return before issueing dependent I/O. Have you seen this particular mode of operation be a problem in practice? As far as I can tell any assumptions on the part of an application that calling fsync(), rather than fsync() actually returning, implies a write barrier, would be severely broken and likely to break pretty quick in practice on most setups. [snip example]> In this case the disk is not ``ignoring'''' the SYNC command. The disk > obeys its version of the rules, but ''C'' is suprise-written before the > initiator expects.Note that even if the disk/controller didn''t do this, the operating system''s buffer cache is highly likely to introduce similar behavior internally. So unless you are using direct I/O, if you make this assumption on fsync() you''re going to be toast even before the drive or storage controller become involved, in many practical setups. [snip correct case example]> of course this is slower, maybe MUCH slower if there is a long > communication delay.It''s pretty intersting that the only commonly available method of introducing a write barrier is to use fsync() which is a more demanding operation. At the same time, fsync() as actually implemented is very rarely useful to begin with, *except* in the context of a write barrier. That is, whenever you actually *do* want fsync() for persistence purposes, you almost always want some kind of write barrier functionality to go with it (in a preceeding and/or subsequent operation). Normally simply committing a bunch of data to disk is not interesting unless you can have certain guarantees with respect to the consistency of that data. So the commen case of needing a write barrier is hindered by the only call available being a much more demanding operation, while the actual more demanding operation is not even useful that often in the absence of the previously mentioned less demanding barrier operation. Doesn''t feel that efficient that the entire world is relying on fsync(), does it...> The two kinds of synchronize-cache I was talking about were one > bit-setting which writes to NVRAM, another which deamnds write to disk > even when there is NVRAM.That was my understanding, but I had never previously gotten the impression that there was such a distinction. At least not at the typical OS/block device layer - I am very weak on SCSI. For example most recently I considered this in the case of FreeBSD where there is BIO_FLUSH, but I''m not aware of any distinction such as the above. It is the cas that SCSI has this, but that most OS:es simply don''t use the more forceful version?> I am not sure why the second kind of flush > exists at all---probably standards-committee-creep. It is not really > any of the filesystem''s business. but for making a single easy-to-use > tool where you don''t have or don''t trust NVRAM knobs inside the RAID > admin tool, the two kinds of sync command could be useful!This is exactly my conclusion as well. I can see "really REALLY flush the cache" being useful as an administratively initiated command, decided upon by a human - similarly to issueing a ''sync'' command to globally sync all buffers no matter what. But for filesystems/databases/other applications it truly should be completely irrelevant.> A barrier command is hypothetical. I don''t know if it exists, and > would be a third kind of command that I don''t know if it''s possible at > all to issue it from userland---it was probably considered ``none of > userland''s business.'''' or maybe the spec says it''s implied by SYNC > like the first initiator thinks---if so, I hope no iSCSI or FC stacks > are confused like that disk was.If it was considered none of userlands business I wholeheartedly disagree ;) The conclusion from the previous discussion where I brought up fbarrier() seems to be that effectively you have an implicit fbarrier() in between each write() with ZFS. Imagine how nice it would be if the fbarrier() interface had been available, even if mapped to fsync() in most cases. (Mapping fbarrier()->fsync() would not be a problem as long as fbarrier() is allowed to block.)> I think it is true there are levels of forcefulness based on the old > sometimes-you-must disable-cache-flushes-if-you-have-nvram ZFS-advice.I become paranoid by such advice. What''s stopping a RAID device from, for example, ACK:ing an I/O before it is even in the cache? I have not designed RAID controller firmware so I am not sure how likely that is, but I don''t see it as an impossibility. Disabling flushing because you have battery backed nvram implies that your battery-backed nvram guarantees ordering of all writes, and that nothing is ever placed in said battery backed cache out of order. Can this assumption really be made safely under all circumstances without intimite knowledge of the controller? I would expect not. [snip]> There is another thing we could worry about---maybe this other > disk-level barrier command I do not know about does exist, for drives > that have NCQ or TCQ, or for other parts of the stack like FC or the > iSCSI initiator-to-target interface or AVS. It might mean ``do notI have been under the very vague-but-not-well-supported understanding that there is some kind of barrier support going on with SCSI. But it has never been such an issue for me; I have been more concerned with enabling applications to have such a thing propagated down the operationg system stack at all to begin with. Before it becomes relevant to me to start worrying about barriers at the SCSI level and whether it is implemented efficiently by certain drives or controllers, I have to see that propagation working to begin with. And as long as all the user land stuff does is an fsync(), we''re not there yet. The exception again is the in-kernel stuff which stands a better chance. On FreeBSD, last time I read ML posts/code about this, it''s just a BIO_FLUSH and AFAIK there is no distinction made so ZFS is not able to communicate a barrier-as-oppsed-to-sync.> reorder writes across this barrier but, I don''t particularly need a > reply.'''' It should in theory be faster to use this command where > possible because you don''t have to drain the device''s work queue as > you do while waiting for a reply to SYNCHRONIZE CACHE---if the > ordering of the queue can be pushed all the way down to the inside of > the hard drive, the latency of restarting writes after the barrier can > be much less than draining the entire pipe''s write stream including FC > or iSCSI as well, so there is significant incentive, especially on > modern high throughput*latency storage, to use a barrier command > instead of plain SYNCHRONIZE CACHE whenver possible.Now further imagine identifying and tagging distinct streams of I/O, such that your fsync() (where you want durability) of a handful of pages of data need not wait for those 50 MB of crap some other process wrote when copying some file. ;) First thing''s first...> But what if some > part of the stack ignores these hypothetical barriers, but *does* > respect the simple SYNCHRONIZE CACHE persistence command? This first > round of fsync()-based tools wouldn''t catch it!On the other hand as a practical matter you can always choose to err on the side of caution and have barriers imply sync+wait still. If one is worried about these issues, and if the practical situation is such that you cannot trust the hardware/software involved, I suppose there is no other way out other than testing and adjusting until it works.> Here is another bit of FUD to worry about: the common advice for the > lost SAN pools is, use multi-vdev pools. Well, that creepily matches > just the scenario I described: if you need to make a write barrier > that''s valid across devices, the only way to do it is with the > SYNCHRONIZE CACHE persistence command, because you need a reply from > Device 1 before you can release writes behind the barrier to Device 2. > You cannot perform that optimisation I described in the last paragraph > of pushing the barrier paast the high-latency link down into the > device, because your initiator is the only thing these two devices > have in common. Keeping the two disks in sync would in effect force > the initiator to interpret the SYNC command as in my second example. > However if you have just one device, you could write the filesystem to > use this hypothetical barrier command instead of the persistence > command for higher performance, maybe significantly higher on > high-latency SAN. I don''t guess that''s actually what''s going on > though, just an interesting creepy speculation.This would be another case where battery-backed (local to the machine) NVRAM fundamentally helps even in a situation where you are only concerned with the barrier, since there is no problem having a battery-backed controller sort out the disk-local problems itself by whatever combination of syncs/barriers, while giving instant barrier support (by effectively implementing synch-and-wait) to the operating system. (Referring now to individual drives being battery-backed, not using a hardware raid volume.) -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/952d0530/attachment-0006.bin>
Bob Friesenhahn
2009-Feb-11 00:08 UTC
[zfs-discuss] Does your device honor write barriers?
On Tue, 10 Feb 2009, Tim wrote:> > You apparently have not used apple''s disk. It''s nothing remotely resembling > "enterprise-type" disk.That is not true of Apple''s only server system "Xserve". It uses SAS disks similar to the ones in the enterprise offerings of Sun, IBM, etc, and at a similar price point. The main reason to want to pay $1,200 for Apples OS OS-X server offering is so that you can run it on this 1U Xserve hardware. I am not sure why anyone would want to do that, but I have worked for an outfit for which this was the most important priority. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> well....if you want a write barrier, you can issue a flush-cache and > wait for a reply before releasing writes behind the barrier. You will > get what you want by doing this for certain.Not if the disk drive just *ignores* barrier and flush-cache commands and returns success. Some consumer drives really do exactly that. That''s the issue that people are asking ZFS to work around. But it''s important to understand that this failure mode (silently ignoring SCSI commands) is truly a case of broken-by-design hardware. If a disk doesn''t honor these commands, then no synchronous operation is ever truly synchronous -- it''d be like your OS just ignoring O_SYNC. This means you can''t use such disks for (say) a database or NFS server, because it is *impossible* to know when the data is on stable storage. If it were possible to detect such disks, I''d add code to ZFS that would simply refuse to use them. Unfortunately, there is no reliable way to test the functioning of synchonize-cache programmatically. Jeff
On 10-Feb-09, at 7:41 PM, Jeff Bonwick wrote:>> well....if you want a write barrier, you can issue a flush-cache and >> wait for a reply before releasing writes behind the barrier. You >> will >> get what you want by doing this for certain. > > Not if the disk drive just *ignores* barrier and flush-cache commands > and returns success. Some consumer drives really do exactly that. > That''s the issue that people are asking ZFS to work around. > > But it''s important to understand that this failure mode (silently > ignoring SCSI commands) is truly a case of broken-by-design hardware. > If a disk doesn''t honor these commands, then no synchronous operation > is ever truly synchronous -- it''d be like your OS just ignoring > O_SYNC. > This means you can''t use such disks for (say) a database or NFS > server, > because it is *impossible* to know when the data is on stable storage.This applies equally to virtual disks, of course (can we get VirtualBox to NOT ignore flushes by default?) --Toby> > If it were possible to detect such disks, I''d add code to ZFS that > would simply refuse to use them. Unfortunately, there is no reliable > way to test the functioning of synchonize-cache programmatically. > > Jeff > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at sun.com> writes: >>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes:jb> Not if the disk drive just *ignores* barrier and flush-cache jb> commands and returns success. Some consumer drives really do jb> exactly that. That''s the issue that people are asking ZFS to jb> work around. Some are asking ZFS to work around the issue, which I think is not crazy: ZFS is already designed around failures clustered together in space, so why not failures clustered together in time as well? But I''m not in their camp, not asking for that workaround. It couldn''t ever deliver the kind if integrity to which the checksum tree aspires. I''m asking for a solution to the overall problem, mostly outing, avoiding, fixing the broken devices and storage stacks. jb> If it were possible to detect such disks, I''d add code to ZFS jb> that would simply refuse to use them. Unfortunately, there is jb> no reliable way to test the functioning of synchonize-cache jb> programmatically. I think the situation''s closer to: there''s no way to test for it upon adding/attaching/replacing a device, so quickly that the user doesn''t realize it''s happening, and with few enough false positives that you don''t mind supporting it when it goes wrong, and don''t mind defending its correctness when it damages vendor relationships. However I think developing a qualification _procedure_ that sysadmins can actually follow, possibly involving cord-yanking, and one that''s decisive enough we can start sharing results instead of saying ``a major vendor'''' and covering our asses all the time, is quite within reach. And I think it''s all but certain to uncover all sorts of problems which are not in devices, too. tt> This applies equally to virtual disks, of course (can we get tt> VirtualBox to NOT ignore flushes by default?) haha but then people would say it performs so much worse than VMWare! :) To be honest I have not absolutely verified this problem. I just hazily remember reading an email here or a bug report about it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/50ddd61f/attachment-0005.bin>
On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> Not if the disk drive just *ignores* barrier and flush-cache commands > and returns success. Some consumer drives really do exactly that.ouch.> If it were possible to detect such disks, I''d add code to ZFS that > would simply refuse to use them. Unfortunately, there is no reliable > way to test the functioning of synchonize-cache programmatically.How about a database of known bad drives? Like the format.dat of old. -frank
On 10-Feb-09, at 10:36 PM, Frank Cusack wrote:> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick > <Jeff.Bonwick at sun.com> wrote: >> Not if the disk drive just *ignores* barrier and flush-cache commands >> and returns success. Some consumer drives really do exactly that. > > ouch. > >> If it were possible to detect such disks, I''d add code to ZFS that >> would simply refuse to use them. Unfortunately, there is no reliable >> way to test the functioning of synchonize-cache programmatically. > > How about a database of known bad drives? Like the format.dat of old.The intransigence of disk makers is incredible. Name and shame might work, though. --Toby> > -frank > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Toby Thain wrote:> On 10-Feb-09, at 10:36 PM, Frank Cusack wrote: >> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick >> <Jeff.Bonwick at sun.com> wrote: >>> Not if the disk drive just *ignores* barrier and flush-cache commands >>> and returns success. Some consumer drives really do exactly that. >> >> ouch. >> >>> If it were possible to detect such disks, I''d add code to ZFS that >>> would simply refuse to use them. Unfortunately, there is no reliable >>> way to test the functioning of synchonize-cache programmatically. >> >> How about a database of known bad drives? Like the format.dat of old. > > The intransigence of disk makers is incredible. Name and shame might > work, though.I do like the idea of a ''known bad'' DB, just a quick reference for people to check on and drop an email to $vendor indicating someone''s added $drive to the list based on $test ? It''s a lot of work to keep updated though. :-/ JB> because it is *impossible* to know when the data is on stable storage. Pardon the ignorance to in-depth drive internals for a moment, would it be possible to time a write of X to the drive, time a write of X to the drive again w/ a sync, power it off immediately after the sync returns (physically ? programmatically ?) then back on to re-read data that was just written ? If it''s there, then the sync didn''t lie, otherwise the drive failed the test. Many BIOS support powering off the machine on shutdown, could the same command be issued to hose the drive in this scenario skipping a ''proper'' shutdown procedure ? Or would the PSU continue supplying it with power long enough for it to finish writing ? I suppose it would have to be a sufficiently large write... Alternatively, timing the tested writes across various sectors of the disk would give you a good baseline of how long writes take. Would forcing a sync immediately after the writes to the same locations give you an indication if the sync was doing as it is supposed to ? If there''s a noticeable (*vague) increase in delay then we assume the sync worked ?
On Tue, Feb 10 at 16:41, Jeff Bonwick wrote:>Not if the disk drive just *ignores* barrier and flush-cache commands >and returns success. Some consumer drives really do exactly that. >That''s the issue that people are asking ZFS to work around.Can someone please name a specific device (vendor + model + firmware revision) that does this? I see this claim thrown around like fact repeatedly, and yet I''ve never personally experienced an actual "consumer" device that discarded FLUSH CACHE (EXT) before, and nobody I know can name one that did. The only exceptions that might "appear" to be ignoring a barrier that I''ve witnessed are "high fly" writes in rotating drives, where the servo system couldn''t detect that the head struck a defect and was deflected too high to write, and devices that don''t support the command at all (and thus abort 51/04 attempts to flush the cache). BTW, funky/busted bridge hardware in external USB devices don''t count. I''m more interested in major rotating drive vendors... Seagate/Maxtor, WD, Hitachi/IBM, Fujitsu, Toshiba, etc. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
David Dyer-Bennet
2009-Feb-11 15:27 UTC
[zfs-discuss] Does your device honor write barriers?
On Wed, February 11, 2009 02:33, Eric D. Mudama wrote:> BTW, funky/busted bridge hardware in external USB devices don''t count.They do for me; I''m currently using external USB drives for my backup datasets (in the process of converting to use zfs send/recv to get the data there). My normal procedure even involves yanking the USB cables (in theory long after the backup is completed and the pool is exported, but if the script fails/hangs I might well yank the cable in the morning without verifying the results of the script overnight first).> I''m more interested in major rotating drive vendors... Seagate/Maxtor, > WD, Hitachi/IBM, Fujitsu, Toshiba, etc.There are a larger set of disaster modes if the problem is at that level, of course. I''d really like to see a reasonably cook-book disk qualification procedure that could detect these problems. Might have to involve a timed disconnect of some sort, which might require hot-swap hardware, but that''s something one could live with. And if the qualification procedure were widely believed to be good, aggregating results would be useful. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On February 10, 2009 11:53:39 PM -0500 Toby Thain <toby at telegraphics.com.au> wrote:> > On 10-Feb-09, at 10:36 PM, Frank Cusack wrote: > >> On February 10, 2009 4:41:35 PM -0800 Jeff Bonwick >> <Jeff.Bonwick at sun.com> wrote: >>> Not if the disk drive just *ignores* barrier and flush-cache commands >>> and returns success. Some consumer drives really do exactly that. >> >> ouch. >> >>> If it were possible to detect such disks, I''d add code to ZFS that >>> would simply refuse to use them. Unfortunately, there is no reliable >>> way to test the functioning of synchonize-cache programmatically. >> >> How about a database of known bad drives? Like the format.dat of old. > > The intransigence of disk makers is incredible. Name and shame might > work, though. >I, for one, don''t really care about shaming any vendor. I care about not using broken products. The database need not be compiled by Sun, but it should (ideally) be distributed by them (in OpenSolaris) and supported by zfs. -frank