Chad Mynhier
2005-Dec-29 02:24 UTC
[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID
Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on top of hardware RAID, given that that guarantee is based on the provably-atomic uberblock update, and given that hardware RAID can''t guarantee that atomicity? Thanks, Chad Mynhier
Eric Schrock
2005-Dec-29 22:34 UTC
[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID
On Wed, Dec 28, 2005 at 09:24:22PM -0500, Chad Mynhier wrote:> Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on > top of hardware RAID, given that that guarantee is based on the > provably-atomic uberblock update, and given that hardware RAID can''t > guarantee that atomicity?In general, ZFS can only guarantee this for well behaved hardware. In particular, we issue the appropriate write cache flush commands to make sure that we behave well with write caching on. But if you backed your pool with files, you could imagine the filesystem choosing to cache every uberblock update, but letting subsequent re-use of blocks go straight to disk (despite calls to fsync). When you lose power, you''ve lost your uberblocks and have possibly mucked your data (when you ''roll back'' to a previous uberblock). This is why we never encourage users to create pools backed by files instead of devices. That being said, there is no such window of vulnerability for devices. All that ZFS requires is that after the uberblock update, all data has been written to disk before continuing on. It accomplishes this via the DKIOCFLUSHWRITECACHE ioctl[1] or fsync() call (for files). It doesn''t matter if the write fails before or during the uberblock update, and whether that failure is due to disk failure or hardware RAID parity update failure. In either case, when you come back up, we''ll fail to validate the uberblock and fall back to the last known good version. We also do a somewhat complex label dance (see vdev_label.c) that provides additional redundancy and atomicty above and beyond the hardware and/or pool replication. There are really 4 copies of the uberblock on the first disk of the pool (possibly with additional redundancy), written in a two-phase pass that guarantees atomicity. Note that we simulate partial uberblock writes all the time during our testing, and ZFS has never experienced any inconsitencies. Hope that helps. - Eric [1] If a disk has write caching enabled but refuses to acknowledge the flush command, you have a firmware problem. -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Casper.Dik at Sun.COM
2005-Dec-31 14:31 UTC
[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID
>Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on >top of hardware RAID, given that that guarantee is based on the >provably-atomic uberblock update, and given that hardware RAID can''t >guarantee that atomicity?If the hardware RAID does not have a mechanism which allows it to report "all data written prior to time X has been commited to persistent storage"; then I''d maintain that such a hardware RAID is completely useless for any filesystem. Casper
Chad Mynhier
2005-Dec-31 18:02 UTC
[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID
Thanks for the response. (And thanks for indulging me in answering a question that I could have answered for myself after about ten minutes of perusing the ZFS on-disk specification.) Chad Mynhier On 12/29/05, Eric Schrock <eric.schrock at sun.com> wrote:> On Wed, Dec 28, 2005 at 09:24:22PM -0500, Chad Mynhier wrote: > > Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on > > top of hardware RAID, given that that guarantee is based on the > > provably-atomic uberblock update, and given that hardware RAID can''t > > guarantee that atomicity? > > In general, ZFS can only guarantee this for well behaved hardware. In > particular, we issue the appropriate write cache flush commands to make > sure that we behave well with write caching on. But if you backed your > pool with files, you could imagine the filesystem choosing to cache > every uberblock update, but letting subsequent re-use of blocks go > straight to disk (despite calls to fsync). When you lose power, you''ve > lost your uberblocks and have possibly mucked your data (when you ''roll > back'' to a previous uberblock). This is why we never encourage users to > create pools backed by files instead of devices. > > That being said, there is no such window of vulnerability for devices. > All that ZFS requires is that after the uberblock update, all data has > been written to disk before continuing on. It accomplishes this via the > DKIOCFLUSHWRITECACHE ioctl[1] or fsync() call (for files). It doesn''t > matter if the write fails before or during the uberblock update, and > whether that failure is due to disk failure or hardware RAID parity > update failure. In either case, when you come back up, we''ll fail to > validate the uberblock and fall back to the last known good version. > > We also do a somewhat complex label dance (see vdev_label.c) that > provides additional redundancy and atomicty above and beyond the > hardware and/or pool replication. There are really 4 copies of the > uberblock on the first disk of the pool (possibly with additional > redundancy), written in a two-phase pass that guarantees atomicity. > Note that we simulate partial uberblock writes all the time during our > testing, and ZFS has never experienced any inconsitencies. > > Hope that helps. > > - Eric > > [1] If a disk has write caching enabled but refuses to acknowledge the > flush command, you have a firmware problem. > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock >
Chad Mynhier
2005-Dec-31 18:40 UTC
[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID
On 12/31/05, Casper.Dik at sun.com <Casper.Dik at sun.com> wrote:> > >Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on > >top of hardware RAID, given that that guarantee is based on the > >provably-atomic uberblock update, and given that hardware RAID can''t > >guarantee that atomicity? > > If the hardware RAID does not have a mechanism which allows it > to report "all data written prior to time X has been commited to > persistent storage"; then I''d maintain that such a hardware RAID > is completely useless for any filesystem. > > CasperI would agree, but my question had more to do with the fact that uberblock updates are done in place rather than being COW. My at-the-time naive understanding was that the atomicity of the uberblock update was guaranteed by the fact that it was a single disk block. Were that the case, LVM''s and hardware RAID would break that guarantee because the uberblock would necessarily be part of a stripe, and a stripe write can not be atomic. But as Eric and the ZFS on-disk specification point out, there''s enough redundancy to guarantee that there''s always a good copy of a recent uberblock, assuming the hardware''s just not outright bad. Thanks, Chad Mynhier