thr3ads.net - zfs discuss - [zfs-discuss] Transactional RAID-Z? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

David Abrahams

2006-Jul-11 18:34 UTC

[zfs-discuss] Transactional RAID-Z?

Hi,

I''ve been trying to understand how transactional writes work in
RAID-Z.  I think I understand the ZFS system for transactional writes
in general (the only place I could find that info was wikipedia;
someone should fix that!).  For RAID-Z it seems to me that the only
way to make it transactional is to have a single tree of blocks
spanning all disks, so that, e.g., the blocks in a stripe across three
disks will all share a single parent block.  Is that how it works?

Oh, also doesn''t the model (non-RAID, even) imply every write
generates at least log(N) writes (the parent node changes because it
stores child pointers, and it propagates up the tree).  I understand
that transaction groups would limit that effect, of course.


Thanks,

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Robert Milkowski

2006-Jul-11 22:12 UTC

head link

[zfs-discuss] Transactional RAID-Z?

Hello David,

Tuesday, July 11, 2006, 8:34:10 PM, you wrote:

DA> Hi,

DA> I''ve been trying to understand how transactional writes work in
DA> RAID-Z.  I think I understand the ZFS system for transactional writes
DA> in general (the only place I could find that info was wikipedia;
DA> someone should fix that!).  For RAID-Z it seems to me that the only
DA> way to make it transactional is to have a single tree of blocks
DA> spanning all disks, so that, e.g., the blocks in a stripe across three
DA> disks will all share a single parent block.  Is that how it works?

With RAID-Z one FS block is spread onto N-1 disks + one parity.
So you have one FS block which is divided to smaller N-1 "device"
blocks. It''s clever "trick" with it''s own pros and
cons.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

David Abrahams

2006-Jul-12 03:03 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

Robert Milkowski <rmilkowski at task.gda.pl> writes:
> Hello David,
>
> Tuesday, July 11, 2006, 8:34:10 PM, you wrote:
>
> DA> Hi,
>
> DA> I''ve been trying to understand how transactional writes
work in
> DA> RAID-Z.  I think I understand the ZFS system for transactional
writes
> DA> in general (the only place I could find that info was wikipedia;
> DA> someone should fix that!).  For RAID-Z it seems to me that the only
> DA> way to make it transactional is to have a single tree of blocks
> DA> spanning all disks, so that, e.g., the blocks in a stripe across
three
> DA> disks will all share a single parent block.  Is that how it works?
>
> With RAID-Z one FS block is spread onto N-1 disks + one parity.
> So you have one FS block which is divided to smaller N-1 "device"
> blocks. It''s clever "trick" with it''s own pros
and cons.
It''s not unfamiliar to me; that''s standard RAID5-ish striping
AFAICT.

But you''re not answering my question:

     How can RAID-Z preserve transactional semantics when a single
     FS block write requires writing to multiple physical devices?

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Bill Moore

2006-Jul-12 03:24 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

On Tue, Jul 11, 2006 at 11:03:17PM -0400, David Abrahams
wrote:>      How can RAID-Z preserve transactional semantics when a single
>      FS block write requires writing to multiple physical devices?
ZFS uses a technique that''s been used in databases for years: phase
trees.  First you write all subtrees that you''re updating to disk (to
currently free space - this is the COW part), wait for them to sync,
then update the tree''s root (the uberblock) in a 2-phase commit.

It doesn''t matter if you''re doing it to multiple independent
disks,
or to multiple disks in a RAID-Z stripe.  The individual writes don''t
need to be atomic.  Just the update to the root of the tree.

The other trick is that with RAID-Z, every logical filesystem block
(512B - 128KB) is it''s own stripe with it''s own parity.  So by
writing a
new block, you''re not messing up the parity of any old blocks.  See
Jeff
Bonwick''s block on RAID-Z to learn more:

    http://blogs.sun.com/roller/page/bonwick?entry=raid_z


--Bill

Casper.Dik at Sun.COM

2006-Jul-12 07:01 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

>But you''re not answering my question:
>
>     How can RAID-Z preserve transactional semantics when a single
>     FS block write requires writing to multiple physical devices?
Since transactions in ZFS are committed until the ueberblock is written,
this boils down to:

    "How is the ueberblock committed atomically in a RAID-Z
configuration?"

Correct?

Casper

Jeff Bonwick

2006-Jul-12 08:12 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

> Since transactions in ZFS are committed until the ueberblock is written,
> this boils down to:
> 
>     "How is the ueberblock committed atomically in a RAID-Z
configuration?"
RAID-Z isn''t even necessary to have this issue; all you need is a disk
that doesn''t actually guarantee atomicity of single-sector writes.
Which, of course, we have to cope with.

The key is that there''s actually a ring of 128 uberblocks, indexed
by transaction group number (mod 128).  When we open a storage pool,
we read every uberblock; among those that have a valid SHA-256
checksum, we take the one with the highest transaction group (txg).
That will be, by definition, the uberblock for the last txg that
successfully committed to disk.

If we lose power in the middle of writing an uberblock, then that
uberblock won''t checksum, so we''ll use the one from the
previous txg,
i.e. the last one that synced completely.

Jeff

Casper.Dik at Sun.COM

2006-Jul-12 08:23 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

>RAID-Z isn''t even necessary to have this issue; all you need is a
disk
>that doesn''t actually guarantee atomicity of single-sector writes.
>Which, of course, we have to cope with.
>
>The key is that there''s actually a ring of 128 uberblocks, indexed
>by transaction group number (mod 128).  When we open a storage pool,
>we read every uberblock; among those that have a valid SHA-256
>checksum, we take the one with the highest transaction group (txg).
>That will be, by definition, the uberblock for the last txg that
>successfully committed to disk.
>
>If we lose power in the middle of writing an uberblock, then that
>uberblock won''t checksum, so we''ll use the one from the
previous txg,
>i.e. the last one that synced completely.

Thanks for providing this last bit of my mental ZFS picture.

Does ZFS keep statistics on how many ueberblocks are bad when
it imports a pool?  Or is it the case that when fewer than 128
ueberblocks have ever been committed, the remainder will be bogus?

Casper

Jeff Bonwick

2006-Jul-12 08:41 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

> Thanks for providing this last bit of my mental ZFS picture.
> 
> Does ZFS keep statistics on how many ueberblocks are bad when
> it imports a pool?
No.  We could, of course, but I''m not sure how it would be useful.
> Or is it the case that when fewer than 128
> ueberblocks have ever been committed, the remainder will be bogus?
Right.  We actually initialize them all during pool creation,
so that they''re not just random garbage; but they all have
birth times of zero, indicating that they''re not valid.

Jeff

Casper.Dik at Sun.COM

2006-Jul-12 08:44 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

>> Thanks for providing this last bit of my mental ZFS picture.
>> 
>> Does ZFS keep statistics on how many ueberblocks are bad when
>> it imports a pool?
>
>No.  We could, of course, but I''m not sure how it would be useful.
I would probably be one of those "nothing is wrong" call generators
induced by strange error messages and bogus statistics.
>> Or is it the case that when fewer than 128
>> ueberblocks have ever been committed, the remainder will be bogus?
>
>Right.  We actually initialize them all during pool creation,
>so that they''re not just random garbage; but they all have
>birth times of zero, indicating that they''re not valid.
You''d probably have to initialize them for fear of them being valid
leftovers from old pools.

Casper

David Abrahams

2006-Jul-12 12:28 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

Casper.Dik at Sun.COM writes:
>>But you''re not answering my question:
>>
>>     How can RAID-Z preserve transactional semantics when a single
>>     FS block write requires writing to multiple physical devices?
>
> Since transactions in ZFS are committed until the ueberblock is written,
> this boils down to:
>
>     "How is the ueberblock committed atomically in a RAID-Z
configuration?"
>
> Correct?
Correct.  And thanks to Jeff for the cogent answer.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

David Dyer-Bennet

2006-Jul-12 15:12 UTC

head link

[zfs-discuss] Re: Transactional RAID-Z?

Jeff Bonwick <bonwick at zion.eng.sun.com> writes:
> > Since transactions in ZFS are committed until the ueberblock is
written,
> > this boils down to:
> > 
> >     "How is the ueberblock committed atomically in a RAID-Z
configuration?"
> 
> RAID-Z isn''t even necessary to have this issue; all you need is a
disk
> that doesn''t actually guarantee atomicity of single-sector writes.
> Which, of course, we have to cope with.
> 
> The key is that there''s actually a ring of 128 uberblocks, indexed
> by transaction group number (mod 128).  When we open a storage pool,
> we read every uberblock; among those that have a valid SHA-256
> checksum, we take the one with the highest transaction group (txg).
> That will be, by definition, the uberblock for the last txg that
> successfully committed to disk.
> 
> If we lose power in the middle of writing an uberblock, then that
> uberblock won''t checksum, so we''ll use the one from the
previous txg,
> i.e. the last one that synced completely.
I *love* information!  

I''m not a DB or filesystem engineer, so much of this is not stuff I
work with every day; and I''ve always wondered how people got around
issues like power failures and other uncontrollable shutdowns really
reliably and cleanly.  I think this is a way I haven''t read about
before, and it makes perfect sense and seems fairly cheap (you only
have to look at all 128 on startup).
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://dd-b.lighthunters.net/>
<http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

zfs discuss - Jul 2006 - Transactional RAID-Z?

[zfs-discuss] Transactional RAID-Z?

[zfs-discuss] Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?

[zfs-discuss] Re: Transactional RAID-Z?