thr3ads.net - zfs discuss - [zfs-discuss] ZFS Questions. (RAID-Z questions actually) [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Steven Sim

2006-Jul-03 15:13 UTC

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)

Hello Gurus;

I''ve been playing with ZFS and reading the materials, BLOGS and FAQs.

It''s an awesome FS and I just wish that Sun would evangelize a little 
bit more. But that''s another story.

I''m writing here to ask a few very simple questions.

I am able to understand the RAID-5 write hole and it''s implications.

I am however, not able to grasp the concept of RAID-Z. More specifically 
the following statements which were repeated over and over again across 
many BLOGS, FAQ and reading materials...

 From Jeff Bonwick''s weblog 
(http://blogs.sun.com/roller/page/bonwick/20051118)

"RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe 
width. Every block is its own RAID-Z stripe, regardless of blocksize. 
This means that every RAID-Z write is a full-stripe write. This, when 
combined with the copy-on-write transactional semantics of ZFS, 
completely eliminates the RAID write hole. RAID-Z is also faster than 
traditional RAID because it never has to do read-modify-write."

I understand the copy-on-write thing. That was very well illustrated in 
"ZFS The Last Word in File Systems" by Jeff Bonwick.

But if every block is it''s own RAID-Z stripe, if the block is lost, how
does ZFS recover the block???

Is the stripe parity (as opposed to block checksum which I understand) 
stored somewhere else or within the same black????

But how exactly does "every RAID-Z write is a full stripe write"
works?
More specifically, if in a 3 disk RAID-Z configuration, if one disk 
fails completely and is replaced, exactly how does the "metadata driven 
reconstruction" recover the newly replaced disk?

It goes on...(and very similar statements from other sites and materials..)

"....Well, the tricky bit here is RAID-Z reconstruction. Because the 
stripes are all different sizes, there''s no simple formula like
"all the
disks XOR to zero." You have to traverse the filesystem metadata to 
determine the RAID-Z geometry. Note that this would be impossible if the 
filesystem and the RAID array were separate products, which is why 
there''s nothing like RAID-Z in the storage market today. You really
need
an integrated view of the logical and physical structure of the data to 
pull it off."

Every stripe is different size? Is this because ZFS adapts to the nature 
of the I/O coming to it?

Could someone elaborate more on the statement "metadata drives 
reconstruction"...

(I am familiar with metadata. More specifically, I am familiar with UFS 
and it''s methodology. But the above statement I am having a little 
difficulty....)

The following from zfs admin 0525..

"In RAID-Z,ZFS uses variable-width RAID stripes so that all writes are 
full-stripe writes.This design is only possible because ZFS integrates  
le system and device management in such a way that the  le system  s 
metadata has enough information about the underlying data replication 
model to handle variable-width RAID stripes."

I could use a little help here...

I apologies if these questions are elementary ones....

Warmest Regards
Steven Sim




Fujitsu Asia Pte. Ltd.
_____________________________________________________

This e-mail is confidential and may also be privileged. If you are not the
intended recipient, please notify us immediately. You should not copy or use it
for any purpose, nor disclose its contents to any other person.

Opinions, conclusions and other information in this message that do not relate
to the official business of my firm shall be understood as neither given nor
endorsed by it.

Casper.Dik at Sun.COM

2006-Jul-03 15:26 UTC

head link

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)

>I understand the copy-on-write thing. That was very well illustrated in 
>"ZFS The Last Word in File Systems" by Jeff Bonwick.
>
>But if every block is it''s own RAID-Z stripe, if the block is lost,
how
>does ZFS recover the block???
You should perhaps not take "block" literally; the block is written as
part of a single transaction on all disks of the RAID-Z group.

Only when the block is stored on disk, the bits referencing them will
be written.  For the whole block to be lost, all disks need to be lost
or the transaction must not occur.
>Is the stripe parity (as opposed to block checksum which I understand) 
>stored somewhere else or within the same black????
Parts of the block are written to each disk; the parity is written to
the parity disk.
>But how exactly does "every RAID-Z write is a full stripe write"
works?
>More specifically, if in a 3 disk RAID-Z configuration, if one disk 
>fails completely and is replaced, exactly how does the "metadata driven
>reconstruction" recover the newly replaced disk?
The metadata driven reconstruction will take the ueberblock and from there
it will re-read the other disks and reconstruct the parity while also
verifying checksums.

Not all data needs to be read and not all parity needs to be computed;
only the bits of disks which are actually in use are verified and have
their parity recomputed.

>"....Well, the tricky bit here is RAID-Z reconstruction. Because the 
>stripes are all different sizes, there''s no simple formula like
"all the
>disks XOR to zero." You have to traverse the filesystem metadata to 
>determine the RAID-Z geometry. Note that this would be impossible if the 
>filesystem and the RAID array were separate products, which is why 
>there''s nothing like RAID-Z in the storage market today. You really
need
>an integrated view of the logical and physical structure of the data to 
>pull it off."
>
>Every stripe is different size? Is this because ZFS adapts to the nature 
>of the I/O coming to it?
It''s because the blocks written are all of different sizes.



So if you write a 128K block on a 3 way RAID-Z, this can be written as
2x64K of data + 1x64K of parity.

(Though I must admit that in such a scheme the disks still XOR to zero, at 
least the bits of disk used)

Casper

Nicolas Williams

2006-Jul-04 06:50 UTC

head link

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)

On Mon, Jul 03, 2006 at 11:13:33PM +0800, Steven Sim
wrote:> Could someone elaborate more on the statement "metadata drives 
> reconstruction"...
ZFS starts from the ubberblock and works its way down (think recursive
tree traversal) the metadata to find all live blocks and rebuilds the
replaced vdev''s contents accordingly.  No need to rebuild unused
blocks.

zfs discuss - Jul 2006 - ZFS Questions. (RAID-Z questions actually)

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)

[zfs-discuss] ZFS Questions. (RAID-Z questions actually)