thr3ads.net - zfs discuss - [zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Chad Mynhier

2005-Dec-29 02:24 UTC

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used on
top of hardware RAID, given that that guarantee is based on the
provably-atomic uberblock update, and given that hardware RAID can''t
guarantee that atomicity?

Thanks,
Chad Mynhier

Eric Schrock

2005-Dec-29 22:34 UTC

head link

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

On Wed, Dec 28, 2005 at 09:24:22PM -0500, Chad Mynhier
wrote:> Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used
on
> top of hardware RAID, given that that guarantee is based on the
> provably-atomic uberblock update, and given that hardware RAID
can''t
> guarantee that atomicity?
In general, ZFS can only guarantee this for well behaved hardware.  In
particular, we issue the appropriate write cache flush commands to make
sure that we behave well with write caching on.  But if you backed your
pool with files, you could imagine the filesystem choosing to cache
every uberblock update, but letting subsequent re-use of blocks go
straight to disk (despite calls to fsync).  When you lose power, you''ve
lost your uberblocks and have possibly mucked your data (when you ''roll
back'' to a previous uberblock).  This is why we never encourage users
to
create pools backed by files instead of devices.

That being said, there is no such window of vulnerability for devices.
All that ZFS requires is that after the uberblock update, all data has
been written to disk before continuing on.  It accomplishes this via the
DKIOCFLUSHWRITECACHE ioctl[1] or fsync() call (for files).  It doesn''t
matter if the write fails before or during the uberblock update, and
whether that failure is due to disk failure or hardware RAID parity
update failure.  In either case, when you come back up, we''ll fail to
validate the uberblock and fall back to the last known good version.

We also do a somewhat complex label dance (see vdev_label.c) that
provides additional redundancy and atomicty above and beyond the
hardware and/or pool replication.  There are really 4 copies of the
uberblock on the first disk of the pool (possibly with additional
redundancy), written in a two-phase pass that guarantees atomicity.
Note that we simulate partial uberblock writes all the time during our
testing, and ZFS has never experienced any inconsitencies.

Hope that helps.

- Eric

[1] If a disk has write caching enabled but refuses to acknowledge the
flush command, you have a firmware problem.  

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Casper.Dik at Sun.COM

2005-Dec-31 14:31 UTC

head link

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

>Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is used
on
>top of hardware RAID, given that that guarantee is based on the
>provably-atomic uberblock update, and given that hardware RAID
can''t
>guarantee that atomicity?
If the hardware RAID does not have a mechanism which allows it
to report "all data written prior to time X has been commited to
persistent storage"; then I''d maintain that such a hardware RAID
is completely useless for any filesystem.

Casper

Chad Mynhier

2005-Dec-31 18:02 UTC

head link

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

Thanks for the response.  (And thanks for indulging me in answering a
question that I could have answered for myself after about ten minutes
of perusing the ZFS on-disk specification.)

Chad Mynhier

On 12/29/05, Eric Schrock <eric.schrock at sun.com>
wrote:> On Wed, Dec 28, 2005 at 09:24:22PM -0500, Chad Mynhier wrote:
> > Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is
used on
> > top of hardware RAID, given that that guarantee is based on the
> > provably-atomic uberblock update, and given that hardware RAID
can''t
> > guarantee that atomicity?
>
> In general, ZFS can only guarantee this for well behaved hardware.  In
> particular, we issue the appropriate write cache flush commands to make
> sure that we behave well with write caching on.  But if you backed your
> pool with files, you could imagine the filesystem choosing to cache
> every uberblock update, but letting subsequent re-use of blocks go
> straight to disk (despite calls to fsync).  When you lose power,
you''ve
> lost your uberblocks and have possibly mucked your data (when you
''roll
> back'' to a previous uberblock).  This is why we never encourage
users to
> create pools backed by files instead of devices.
>
> That being said, there is no such window of vulnerability for devices.
> All that ZFS requires is that after the uberblock update, all data has
> been written to disk before continuing on.  It accomplishes this via the
> DKIOCFLUSHWRITECACHE ioctl[1] or fsync() call (for files).  It
doesn''t
> matter if the write fails before or during the uberblock update, and
> whether that failure is due to disk failure or hardware RAID parity
> update failure.  In either case, when you come back up, we''ll fail
to
> validate the uberblock and fall back to the last known good version.
>
> We also do a somewhat complex label dance (see vdev_label.c) that
> provides additional redundancy and atomicty above and beyond the
> hardware and/or pool replication.  There are really 4 copies of the
> uberblock on the first disk of the pool (possibly with additional
> redundancy), written in a two-phase pass that guarantees atomicity.
> Note that we simulate partial uberblock writes all the time during our
> testing, and ZFS has never experienced any inconsitencies.
>
> Hope that helps.
>
> - Eric
>
> [1] If a disk has write caching enabled but refuses to acknowledge the
> flush command, you have a firmware problem.
>
> --
> Eric Schrock, Solaris Kernel Development      
http://blogs.sun.com/eschrock
>

Chad Mynhier

2005-Dec-31 18:40 UTC

head link

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

On 12/31/05, Casper.Dik at sun.com <Casper.Dik at sun.com>
wrote:>
> >Does ZFS'' always-consistent-on-disk guarantee fail if ZFS is
used on
> >top of hardware RAID, given that that guarantee is based on the
> >provably-atomic uberblock update, and given that hardware RAID
can''t
> >guarantee that atomicity?
>
> If the hardware RAID does not have a mechanism which allows it
> to report "all data written prior to time X has been commited to
> persistent storage"; then I''d maintain that such a hardware
RAID
> is completely useless for any filesystem.
>
> Casper
I would agree, but my question had more to do with the fact that
uberblock updates are done in place rather than being COW.   My
at-the-time naive understanding was that the atomicity of the
uberblock update was guaranteed by the fact that it was a single disk
block.  Were that the case, LVM''s and hardware RAID would break that
guarantee because the uberblock would necessarily be part of a stripe,
and a stripe write can not be atomic.

But as Eric and the ZFS on-disk specification point out, there''s
enough redundancy to guarantee that there''s always a good copy of a
recent uberblock, assuming the hardware''s just not outright bad.

Thanks,
Chad Mynhier

zfs discuss - Dec 2005 - ZFS'' always-consistent-on-disk guarantee and hardware RAID

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID

[zfs-discuss] ZFS'' always-consistent-on-disk guarantee and hardware RAID