Brad
2010-Jan-12 10:53 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
Has anyone worked with a x4500/x4540 and know if the internal raid controllers have a bbu? I''m concern that we won''t be able to turn off the write-cache on the internal hds and SSDs to prevent data corruption in case of a power failure. -- This message posted from opensolaris.org
Toby Thain
2010-Jan-12 15:39 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
On 12-Jan-10, at 5:53 AM, Brad wrote:> Has anyone worked with a x4500/x4540 and know if the internal raid > controllers have a bbu? I''m concern that we won''t be able to turn > off the write-cache on the internal hds and SSDs to prevent data > corruption in case of a power failure.A power fail won''t corrupt data even with write cache enabled, under the assumptions about device behaviour recently mentioned on the list. (Caching isn''t the problem; ordering is.) The Sun machines must be tested and qualified for correct behaviour. --Toby> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2010-Jan-12 16:13 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
On Jan 12, 2010, at 2:53 AM, Brad wrote:> Has anyone worked with a x4500/x4540 and know if the internal raid controllers have a bbu? I''m concern that we won''t be able to turn off the write-cache on the internal hds and SSDs to prevent data corruption in case of a power failure.Yes, write cache is enabled by default, depending on the pool configuration. This is true for all disks that support write caches. The key to make this work is whether the device honors the cache flush command. The disks qualified for X4500/X4540 will honor the cache flush command. -- richard
Brad
2010-Jan-13 03:40 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
"(Caching isn''t the problem; ordering is.)" Weird I was reading about a problem where using SSDs (intel x25-e) if the power goes out and the data in cache is not flushed, you would have loss of data. Could you elaborate on "ordering"? -- This message posted from opensolaris.org
Brad
2010-Jan-13 03:46 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
Richard, "Yes, write cache is enabled by default, depending on the pool configuration." Is it enabled for a striped (mirrored configuration) zpool? I''m asking because of a concern I''ve read on this forum about a problem with SSDs (and disks) where if a power outage occurs any data in cache would be lost if it hasn''t been flushed to disk. -- This message posted from opensolaris.org
Toby Thain
2010-Jan-13 04:56 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
On 12-Jan-10, at 10:40 PM, Brad wrote:> "(Caching isn''t the problem; ordering is.)" > > Weird I was reading about a problem where using SSDs (intel x25-e) > if the power goes out and the data in cache is not flushed, you > would have loss of data. > > Could you elaborate on "ordering"?ZFS integrity is maintained if the device correctly respects flush/ barrier semantics, which, as required, enforce an ordering of operations. The synchronous completion of flush guarantees that prior writes have durably completed. This is irrespective of write caching. When a device does not properly flush, all bets are off, because inflight data (including unwritten data in the write cache) is not written in any determinate manner (you cannot know what was written, or in what order). The precondition for an atomic ?berblock update is that the tree of blocks it references has been fully written. This has been mentioned periodically on the list. I thought somebody (Richard Elling?) did a nice capsule summary recently but I can''t find it, so here are some other past list snippets by more knowledgeable people than I. Neil Perrin, 6 Dec, 2009:> ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing. > Transactions enter in Open. Quiescing is where a new Open stage has > started and waits for transactions that have yet to commit to finish. > Syncing is where all the completed transactions are pushed to the pool > in an atomic manner with the last write being the root of the new tree > of blocks (uberblock). > > All the guarantees assume good hardware. As part of the new > uberblock update > we flush the write caches of the pool devices. If this is broken > all bets > are off.14 Oct, 2009, James R. Van Artsdalen:> ZFS is different because it uses a different "superblock" every few > seconds (every transaction commit), and more importantly, the top > levels > of the filesystem and some pool metadata are moved too. After > every tx > commit the uberblock is in a different place and some of its pointers > are to different places. > > Moreover, blocks that were freed by this process are rapidly > reclaimed. > The uberblock itself is not reclaimed for another 127 commits - > several > minutes - but the things it points to are. In other words as soon > as tx > group N is committed, blocks from N-1 that are no longer referenced > are > reclaimed as free space. > > What goes wrong when the write fence / cache flush doesn''t happen: > > As soon as the uberblock for tx group N is written everything from N-1 > that is no longer referenced is marked free for reallocation, and > these > newly-freed blocks often contain part of the top levels of the N-1 > pool > / filesystems and metadata. > > If the uberblock for N is _not_ written to media when it was > supposed to > be then ZFS may happily reuse the blocks from N-1 while the uberblock > for N-1 is still the most recent on media, instead of N as ZFS > expects. > In other words there might be a window where the most recent uberblock > on disk media (N-1) points to a toplevel directory block that is > overwritten with unrelated data - disaster. > > That window closes once uberblock N hits media. Unfortunately with no > write fence it might be a long time before that happens. ...10 Oct, 2009, James Relph quotes Dominic Giampaolo:> "Last, I do not believe that the crash protection scheme used > by ZFS can ever work reliably on drives that drop the flush > track cache request. The only approach that is guaranteed to > work is to keep enough data in a log that when you remount the > drive, you can replay more data than the drive could have kept > cached."Nicolas Williams, 13 Feb, 2009:> Also, note that ignoring barriers is effectively as bad as dropping > writes if there''s any chance that some writes will never hit the disk > because of, say, power failures. Imagine 100 txgs, but some writes > from > the first txg never hitting the disk because the drive keeps them > in the > cache without flushing them for too long, then you pull out the > disk, or > power fails -- in that case not even fallback to older txgs will help > you, there''d be nothing that ZFS could do to help you.Peter Schuller, 10 Feb, 2009:> What''s stopping a RAID device from, > for example, ACK:ing an I/O before it is even in the cache? I have not > designed RAID controller firmware so I am not sure how likely that is, > but I don''t see it as an impossibility. Disabling flushing because you > have battery backed nvram implies that your battery-backed nvram > guarantees ordering of all writes, and that nothing is ever placed in > said battery backed cache out of order.Jeff Bonwick, 12 Feb, 2007:> Even if you disable the intent log, the transactional nature > of ZFS ensures preservation of event ordering. Note that disk caches > don''t come into it: ZFS builds up a wad of transactions in memory, > then pushes them out as a transaction group. That entire group will > either commit or not. ZFS writes all the new data to new locations, > then flushes all disk write caches, then writes the new uberblock, > then flushes the caches again. Thus you can lose power at any point > in the middle of committing transaction group N, and you''re guaranteed > that upon reboot, everything will either be at state N or state N-1. > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > thing is that on ZFS, fbarrier() is a no-op. It''s implicit after > every system call.(This issue also arises with respect to the questionable VirtualBox default setting of "Ignore Flush".) --Toby> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2010-Jan-13 17:00 UTC
[zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
On Jan 12, 2010, at 7:46 PM, Brad wrote:> Richard, > > "Yes, write cache is enabled by default, depending on the pool configuration." > Is it enabled for a striped (mirrored configuration) zpool? I''m asking because of a concern I''ve read on this forum about a problem with SSDs (and disks) where if a power outage occurs any data in cache would be lost if it hasn''t been flushed to disk.If the vdev is a whole disk (for Solaris == not a slice), then ZFS will attempt to set the write cache enable. By default, Solaris will not set write cache enable on disks, in part because it causes bad juju for UFS. This is independent of the data protection configuration. -- richard