thr3ads.net - zfs discuss - Trying to understand zfs RAID-Z [May 2007]

If this information is useful, please help other people find it:
Share via:

Steven Sim

2007-May-17 01:55 UTC

Trying to understand zfs RAID-Z

Gurus;

I am exceedingly impressed by the ZFS although it is my humble opinion
that Sun is not doing enough evangelizing for it.

But that''s beside the point.

I am writing to seek help in understanding the RAID-Z concept.

Jeff Bonwick''s weblog states the following;

"
RAID-Z is a data/parity scheme like RAID-5, but it uses
dynamic stripe width.
Every block is its own RAID-Z stripe, regardless of blocksize. This
means
that every RAID-Z write is a full-stripe write. This, when
combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the
RAID write hole. RAID-Z is also faster than traditional RAID because it
never has to do
read-modify-write."

I am unable to relate the above statement to the diagram shown in the
PDF file ''zfs_last.pdf'' entitled "ZFS THE LAST WORD IN
FILE
SYSTEMS" (also by Jeff Bonwick), on page 11.

I was wondering whether Jeff or some one knowledgeable would elaborate
further on the above and also answer the following questions;

  The green and blue "blocks" shown in the diagram on page 11, do
the represent actual physical blocks on individual disks or a single
RAID-Z stripe write across multiple disks???

  The parity for RAID-Z, where is it?? Surely not the checksum
stored together in the upper level direct and indirect block pointer?
And if not and it is written as a separate block on another disks, then
.......I am afraid I do not understand....

  Could someone please elaborate more on the statement "Every block
is it''s own RAID-Z stripe"??? The block being referred to is a
single
block across multiple disks or a single disk?

My sincere apologies if the above questions seem trivial.
But I am really struggling to reconcile the statement and the diagram.

Warmest Regards

Steven Sim

Fujitsu Asia Pte. Ltd.

_____________________________________________________

This e-mail is confidential and may also be privileged. If you are not the
intended recipient, please notify us immediately. You should not copy or use it
for any purpose, nor disclose its contents to any other person.

Opinions, conclusions and other information in this message that do not relate
to the official business of my firm shall be understood as neither given nor
endorsed by it.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Darren Dunham

2007-May-17 05:31 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

> <i>RAID-Z is a data/parity scheme like RAID-5, <u><b>but
it uses
> dynamic stripe width.
> Every block is its own RAID-Z stripe, regardless of blocksize. This
> means
> that every RAID-Z write is a full-stripe write.</b></u> This,
when
> combined with the
> copy-on-write transactional semantics of ZFS, completely eliminates the
> RAID write hole. RAID-Z is also faster than traditional RAID because it
> never has to do
> read-modify-write.</i>"<br>
> <br>
> I am unable to relate the above statement to the diagram shown in the
> PDF file ''<b>zfs_last.pdf</b>'' entitled
"<b>ZFS THE LAST WORD IN FILE
> SYSTEMS</b>" (also by Jeff Bonwick), on page 11.<br>
On the copy I''m looking at page 11 is "Copy-On-Write
Transactions".
Note that this page is showing only the "copy on write" stuff (which
is
used on all pools) and doesn''t show anything about raidz.  
> I was wondering whether Jeff or some one knowledgeable would elaborate
> further on the above and also answer the following questions;<br>
> <br>
> <ul>
>   <li>The green and blue "blocks" shown in the diagram on
page 11, do
> the represent actual physical blocks on individual disks or a single
> RAID-Z stripe write across multiple disks???</li>
They represent filesystem "data" (the leaves) and filesystem
"metadata"
(the blocks above the leaves).  They''re written to a pool that may have
some form of redundancy (mirroring, raidz), but that level is not
presented in this slide.
>   <li>The parity for RAID-Z, where is it??
Mentioned on page 17, but no diagram on this link.

Bill Moore presented this talk to BayLisa several months ago, and used a
very similar presentation, but it had a diagram on the "RAID-Z" slide
(the one on page 17) showing the data and parity blocks used by a raidz
pool.  Looking through google, I see many links to similar ZFS
presentations, but none with the diagram on that page.  

Ah, found one...
http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf
Page 13 in that stack.
> Surely not the checksum
> stored together in the upper level direct and indirect block pointer?
> And if not and it is written as a separate block on another disks, then
> .......I am afraid I do not understand....<br>
The parity is written as a separate block on a separate disk.  It''s
very
similar to how RAID4/RAID5 would write parity on another disk.  

It''s just that for R4/R5, any given data block on disk can be
immediately calculated to be part of a particular stripe on the storage
(which has a particular parity block).  In the case of raidz, the stripe
has a maximum length set by the raidz columns, but it may be smaller
than that.
>   <li>Could someone please elaborate more on the statement
"Every block
> is it''s own RAID-Z stripe"??? The block being referred to is
a single
> block across multiple disks or a single disk?<br>
Every filesystem block (not disk block).  So a single filesystem block
that spans multiple disks.


-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Henk Langeveld

2007-May-17 10:59 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

I''ll make an attempt to keep it simple, and tell what is true in
''most''
cases.  For some values of ''most'' ;-)

The words used are at  times confusing.  "Block" mostly refers to
a logical filesystem block, which can be variable in size.
There''s also "checksum" and "parity", which are
completely
independent.
>     * The green and blue "blocks" shown in the diagram on page
11, do
>       the represent actual physical blocks on individual disks or a
>       single RAID-Z stripe write across multiple disks???
See Page 17: These are logical blocks, and can be variable in size.
>     * The parity for RAID-Z, where is it?? Surely not the checksum
>       stored together in the upper level direct and indirect block
>       pointer? And if not and it is written as a separate block on
>       another disks, then .......I am afraid I do not understand....
z-raid Parity vs zfs checksum

The parity is just a chunk of xor-ed data written for redundancy, and
is part of the same I/O transaction.

The checksum is a much smaller digest of the data used for detecting
the various modes of data corruption.  This is what goes into the
metadata (logical) blocks above.  A zfs file system always has checksums
and can function without parity.
>     * Could someone please elaborate more on the statement "Every
block
>       is it''s own RAID-Z stripe"??? The block being referred
to is a
>       single block across multiple disks or a single disk?
If the storage pool will use an n-way raid-z configuration, the 
(logical) block is first split into n-1 chunks, and an nth chunk
is added before any actual I/O takes place.   Each chunk goes to
a separate disk.

This goes hand in hand with the answer to question 2.  Because it''s
Copy-on-Write, we only worry about new data when computing parity.
> *My sincere apologies if the above questions seem trivial* . But I am 
> really struggling to reconcile the statement and the diagram.
Example

Logical block: (1 6k block of fs data)  Could be any size <= 128k
	|0|1|_|_|_|5|_|_|_|_|0|_| (12 x 512b blocks)  --> ::checksum::


This is split into a single 4x 2k stripe:

3 chunks of 2k:
	|00|01|02|03|	--> disk1	(4 sectors)
	|04|05|06|07|	--> disk2	(4 sectors)
	|08|09|10|11|	--> disk3	(4 sectors)

1 chunk of parity:
	|12|13|14|15|	--> disk4	(4 sectors)


::checksum:: is then recorded in the metadata, which gets written
in a separate stripe.  This is recursed for the metadata checksum,
until we reach the ueberblock, for which I won''t explain the
redundancy and replication here.


Cheers,
Henk

Steven Sim

2007-May-17 13:30 UTC

head link

Re: Trying to understand zfs RAID-Z

Darren and Henk;

Firstly, thank you very much for both of your replies. I am very
grateful indeed for you all taking time off to answer my questions.

I understand RAID-5 quite well and from both of your RAID-Z
description, I see that the RAID-Z parity is also a separate block on a
separate disk. Very well. This is just like RAID-5.

My confusion is simple. Would this not then give rise also to the
write-hole vulnerability of RAID-5?

Jeff Bonwick states "that there''s no way to update two or more
disks
atomically, so RAID stripes can become damaged during a crash or power
outage."

If I understand correctly, then the parity block for RAID-Z are also
written in two different atomic operations. As per RAID-5. (the only
difference being each can be of a different stripe size).

How then does it fit into Jeff''s statement that "Every block is
its own RAID-Z stripe?"  ( Perhaps I
misunderstood but I now think this statement is rather for the fact
that RAID-Z has a variable stripe size rather than meaning each block
holding it''s own RAID-Z parity within itself. )

If the write-hole power outage situation as described by Jeff Bonwick
occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability?

Through each block''s independant checksum held one level above in the
metadata block? Is this correct? Or am I completely off course?

Henk Langeveld wonderful character based diagrams describes what is
basically a standard RAID-5 layout on 4 disks. How is RAID-Z any
different from RAID-5? (except for the ability to stripe different
sizes which gives allows RAID-Z to never have to do a
read-modify-write. This increases performance very significantly but I
am unable to relate this to the write-hole vulnerability issue).

Warmest Regards

Steven Sim

Darren Dunham wrote:

RAID-Z is a data/parity scheme like RAID-5, but it uses
dynamic stripe width.
Every block is its own RAID-Z stripe, regardless of blocksize. This
means
that every RAID-Z write is a full-stripe write. This, when
combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the
RAID write hole. RAID-Z is also faster than traditional RAID because it
never has to do
read-modify-write."

I am unable to relate the above statement to the diagram shown in the
PDF file ''zfs_last.pdf'' entitled "ZFS THE LAST WORD IN
FILE
SYSTEMS" (also by Jeff Bonwick), on page 11.

On the copy I''m looking at page 11 is "Copy-On-Write
Transactions".
Note that this page is showing only the "copy on write" stuff (which
is
used on all pools) and doesn''t show anything about raidz.  

I was wondering whether Jeff or some one knowledgeable would elaborate
further on the above and also answer the following questions;

  The green and blue "blocks" shown in the diagram on page 11, do
the represent actual physical blocks on individual disks or a single
RAID-Z stripe write across multiple disks???

They represent filesystem "data" (the leaves) and filesystem
"metadata"
(the blocks above the leaves).  They''re written to a pool that may have
some form of redundancy (mirroring, raidz), but that level is not
presented in this slide.

  The parity for RAID-Z, where is it??

Mentioned on page 17, but no diagram on this link.

Bill Moore presented this talk to BayLisa several months ago, and used a
very similar presentation, but it had a diagram on the "RAID-Z" slide
(the one on page 17) showing the data and parity blocks used by a raidz
pool.  Looking through google, I see many links to similar ZFS
presentations, but none with the diagram on that page.  

Ah, found one...
http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf
Page 13 in that stack.

Surely not the checksum
stored together in the upper level direct and indirect block pointer?
And if not and it is written as a separate block on another disks, then
.......I am afraid I do not understand....

The parity is written as a separate block on a separate disk.  It''s
very
similar to how RAID4/RAID5 would write parity on another disk.  

It''s just that for R4/R5, any given data block on disk can be
immediately calculated to be part of a particular stripe on the storage
(which has a particular parity block).  In the case of raidz, the stripe
has a maximum length set by the raidz columns, but it may be smaller
than that.

  Could someone please elaborate more on the statement "Every block
is it''s own RAID-Z stripe"??? The block being referred to is a
single
block across multiple disks or a single disk?

Every filesystem block (not disk block).  So a single filesystem block
that spans multiple disks.

Fujitsu Asia Pte. Ltd.

_____________________________________________________

This e-mail is confidential and may also be privileged. If you are not the
intended recipient, please notify us immediately. You should not copy or use it
for any purpose, nor disclose its contents to any other person.

Opinions, conclusions and other information in this message that do not relate
to the official business of my firm shall be understood as neither given nor
endorsed by it.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Fred Zlotnick

2007-May-17 13:51 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

Steven Sim wrote:> Darren and Henk;
> 
> Firstly, thank you very much for both of your replies. I am very 
> grateful indeed for you all taking time off to answer my questions.
> 
> I understand RAID-5 quite well and from both of your RAID-Z description, 
> I see that the RAID-Z parity is also a separate block on a separate 
> disk. Very well. This is just like RAID-5.
> 
> My confusion is simple. Would this not then give rise also to the 
> write-hole vulnerability of RAID-5?
> 
> Jeff Bonwick states "/that there''s no way to update two or
more disks
> atomically, so RAID stripes can become damaged during a crash or power 
> outage./"
> 
> If I understand correctly, then the parity block for RAID-Z are also 
> written in two different atomic operations. As per RAID-5. (the only 
> difference being each can be of a different stripe size).
> 
> How then does it fit into Jeff''s statement that "/_Every
block is its
> own RAID-Z stripe?*"*_*  ( */Perhaps I misunderstood but I now think 
> this statement is rather for the fact that RAID-Z has a variable stripe 
> size rather than meaning each block holding it''s own RAID-Z parity
> within itself. )
> 
> If the write-hole power outage situation as described by Jeff Bonwick 
> occur, how does RAID-Z "beat" the RAID-5 write-hole
vulnerability?
>
Recall that no written blocks are actually part of the file system until
all the metadata blocks above them are also written. This includes the
uberblock, whose write is atomic.  So although the physical writing of
the blocks on different physical disks is not atomic, if a crash occurs
between such writes it is as if none of the writes ever occurred.

> Through each block''s independant checksum held one level above in
the
> metadata block? Is this correct? Or am I completely off course?
Yes, that is correct.

HTH,
Fred> 
> Henk Langeveld wonderful character based diagrams describes what is 
> basically a standard RAID-5 layout on 4 disks. How is RAID-Z any 
> different from RAID-5? (except for the ability to stripe different sizes 
> which gives allows RAID-Z to never have to do a read-modify-write. This 
> increases performance very significantly but I am unable to relate this 
> to the write-hole vulnerability issue).
> 
> Warmest Regards
> Steven Sim
> 
> Darren Dunham wrote:
>>> <i>RAID-Z is a data/parity scheme like RAID-5,
<u><b>but it uses
>>> dynamic stripe width.
>>> Every block is its own RAID-Z stripe, regardless of blocksize. This
>>> means
>>> that every RAID-Z write is a full-stripe write.</b></u>
This, when
>>> combined with the
>>> copy-on-write transactional semantics of ZFS, completely eliminates
the
>>> RAID write hole. RAID-Z is also faster than traditional RAID
because it
>>> never has to do
>>> read-modify-write.</i>"<br>
>>> <br>
>>> I am unable to relate the above statement to the diagram shown in
the
>>> PDF file ''<b>zfs_last.pdf</b>''
entitled "<b>ZFS THE LAST WORD IN FILE
>>> SYSTEMS</b>" (also by Jeff Bonwick), on page
11.<br>
>>>     
>>
>> On the copy I''m looking at page 11 is "Copy-On-Write
Transactions".
>> Note that this page is showing only the "copy on write" stuff
(which is
>> used on all pools) and doesn''t show anything about raidz.  
>>
>>   
>>> I was wondering whether Jeff or some one knowledgeable would
elaborate
>>> further on the above and also answer the following
questions;<br>
>>> <br>
>>> <ul>
>>>   <li>The green and blue "blocks" shown in the
diagram on page 11, do
>>> the represent actual physical blocks on individual disks or a
single
>>> RAID-Z stripe write across multiple disks???</li>
>>>     
>>
>> They represent filesystem "data" (the leaves) and filesystem
"metadata"
>> (the blocks above the leaves).  They''re written to a pool that
may have
>> some form of redundancy (mirroring, raidz), but that level is not
>> presented in this slide.
>>
>>   
>>>   <li>The parity for RAID-Z, where is it??
>>>     
>>
>> Mentioned on page 17, but no diagram on this link.
>>
>> Bill Moore presented this talk to BayLisa several months ago, and used
a
>> very similar presentation, but it had a diagram on the
"RAID-Z" slide
>> (the one on page 17) showing the data and parity blocks used by a raidz
>> pool.  Looking through google, I see many links to similar ZFS
>> presentations, but none with the diagram on that page.  
>>
>> Ah, found one...
>>
http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf
>> Page 13 in that stack.
>>
>>   
>>> Surely not the checksum
>>> stored together in the upper level direct and indirect block
pointer?
>>> And if not and it is written as a separate block on another disks,
then
>>> .......I am afraid I do not understand....<br>
>>>     
>>
>> The parity is written as a separate block on a separate disk. 
It''s very
>> similar to how RAID4/RAID5 would write parity on another disk.  
>>
>> It''s just that for R4/R5, any given data block on disk can be
>> immediately calculated to be part of a particular stripe on the storage
>> (which has a particular parity block).  In the case of raidz, the
stripe
>> has a maximum length set by the raidz columns, but it may be smaller
>> than that.
>>
>>   
>>>   <li>Could someone please elaborate more on the statement
"Every block
>>> is it''s own RAID-Z stripe"??? The block being
referred to is a single
>>> block across multiple disks or a single disk?<br>
>>>     
>>
>> Every filesystem block (not disk block).  So a single filesystem block
>> that spans multiple disks.
>>
>>
>>   
> 
> 
> 
> 
> Fujitsu Asia Pte. Ltd.
> _____________________________________________________
> 
> This e-mail is confidential and may also be privileged. If you are not 
> the intended recipient, please notify us immediately. You should not 
> copy or use it for any purpose, nor disclose its contents to any other 
> person.
> 
> Opinions, conclusions and other information in this message that do not 
> relate to the official business of my firm shall be understood as 
> neither given nor endorsed by it.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Fred Zlotnick
Director, Solaris Data Technology
Sun Microsystems, Inc.
fred.zlotnick at sun.com
x85006/+1 650 786 5006

Victor Latushkin

2007-May-17 14:07 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

Hi Steven,

Steven Sim wrote:> My confusion is simple. Would this not then give rise also to the 
> write-hole vulnerability of RAID-5?
> 
> Jeff Bonwick states "/that there''s no way to update two or
more disks
> atomically, so RAID stripes can become damaged during a crash or power 
> outage./"
> 
> If I understand correctly, then the parity block for RAID-Z are also 
> written in two different atomic operations. As per RAID-5. (the only 
> difference being each can be of a different stripe size).Yes, this is correct, writes to RAID-Z member disks are not atomic. But 
for the new block to become part of filesystem you need to update all 
the indirect blocks up to uber-block. Uber-block size is exactly one 
sector, so it can be written atomically.
> How then does it fit into Jeff''s statement that "/_Every
block is its
> own RAID-Z stripe?*"*_*  ( */Perhaps I misunderstood but I now think 
> this statement is rather for the fact that RAID-Z has a variable stripe 
> size rather than meaning each block holding it''s own RAID-Z parity
> within itself. )Block here is logical filesystem block. When it is written to RAID-Z 
vdev of N disks it is split into no more than N-1 parts, and one more 
part with parity information is computed from these parts. Then parts 
are written to disks in the RAID-Z vdev one part to one disk.
> If the write-hole power outage situation as described by Jeff Bonwick 
> occur, how does RAID-Z "beat" the RAID-5 write-hole
vulnerability?ZFS never writes new data over old one. New data is always written into 
unoccupied space.

Regular RAID-5 writes new data over old one, and since it is possible 
for outage to occur between two writes (you cannot write two sectors to 
two drives atomically), after that you stripe will be corrupted and 
there will be no way to tell what is correct - parity or data sector.

Hope this helps,
victor

Henk Langeveld

2007-May-17 22:09 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

Steven Sim wrote:
> I understand RAID-5 quite well and from both of your RAID-Z description, 
> I see that the RAID-Z parity is also a separate block on a separate 
> disk. Very well. This is just like RAID-5.Yup.  But there''s a little bit of magic, which I''ll
try to explain below.  With more ascii art! ;-)
> My confusion is simple. Would this not then give rise also to the 
> write-hole vulnerability of RAID-5?Copy-on-Write
> Jeff Bonwick states "/that there''s no way to update two or
more disks
> atomically, so RAID stripes can become damaged during a crash or power 
> outage./"Traditional RAID writes in-place.  Each disk requires a separate write
action.  Once the first partial block (chunk) is written, the original
data and parity are effectively lost.
> If I understand correctly, then the parity block for RAID-Z are also 
> written in two different atomic operations. As per RAID-5. (the only 
> difference being each can be of a different stripe size).
As with Raid-5 on  a four disk stripe, there are four independant
writes, and they don''t need to be atomic, as Copy-on-Write implies
that the new blocks are written elsewhere on disk, while maintaining
the original data.  Only after all four writes return and are flushed
to disk can you proceed and update the metadata.
> How then does it fit into Jeff''s statement that "/ Every
block is its
> own RAID-Z stripe?*"* *  ( */Perhaps I misunderstood but I now think 
> this statement is rather for the fact that RAID-Z has a variable stripe 
> size rather than meaning each block holding it''s own RAID-Z parity
> within itself. )

I think what is meant here is that the stripe is generated from 
filesystem data.

Parity can be computed from file system data long before you go down to
the device level.

	Of course, if you''re writing into the middle of a file, you will
	need that portion in the file system cache,  but when an
	application is busy on a file, we may assume we have (the
	relevant portions of) that file in the cache, so no additional
	disk reads are needed.

> If the write-hole power outage situation as described by Jeff Bonwick 
> occur, how does RAID-Z "beat" the RAID-5 write-hole
vulnerability?
Because it writes somewhere else, due to Copy-on-Write.

The original data is still on disk, untouched.
> Through each block''s independant checksum held one level above in
the
> metadata block? Is this correct? Or am I completely off course?Correct.

> Henk Langeveld wonderful character based diagrams describes what is 
> basically a standard RAID-5 layout on 4 disks. How is RAID-Z any 
> different from RAID-5? (except for the ability to stripe different sizes 
> which gives allows RAID-Z to never have to do a read-modify-write. This 
> increases performance very significantly but I am unable to relate this 
> to the write-hole vulnerability issue).

  disk 1     disk 2     disk 3     disk 4     checksum
  _ _ _ _    _ _ _ _    _ _ _ _    _ _ _ _
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|x|x|x|x|  |x|x|x|x|  |x|x|x|x|  |X|X|X|X|   cccccccc1
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|

Filesystem cache

|x|x|x|x|  |x|x|x|x|  |x|x|x|x|              cccccccc1

Application writes:
    y.y.y.  .y.y.y.y.  .y.

Filesystem cache updated:

|x|y|y|y|  |y|y|y|y|  |y|x|x|x|              cccccccc2

Logical write results (computes parity)

|x|y|y|y|  |y|y|y|y|  |y|x|x|x|  |Y|Y|Y|Y|   cccccccc2

with four independant physical writes:

  disk 1     disk 2     disk 3     disk 4     checksum
  _ _ _ _    _ _ _ _    _ _ _ _    _ _ _ _
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|x|x|x|x|  |x|x|x|x|  |x|x|x|x|  |X|X|X|X|   cccccccc1
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|x|y|y|y|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |Y|Y|Y|Y|
|_|_|_|_|  |y|y|y|y|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |y|x|x|x|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|
|_|_|_|_|  |_|_|_|_|  |_|_|_|_|  |_|_|_|_|

Note that after these four writes, the original
data is still on disk, and allocated.  If no
snapshots are taken, those blocks can be
reallocated after the metadata and the uberblock
is written.

If the system crashes before the uberblock goes
down, the newly written data is effectively lost,
as if it was never written, but neither do we
lose the original data (and parity).

If a disk dies during a crash, we still have
the original parity.

Cheers,
Henk

Robert Milkowski

2007-May-17 22:27 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

Hello Henk,

Friday, May 18, 2007, 12:09:40 AM, you wrote:

>> If I understand correctly, then the parity block for RAID-Z are also 
>> written in two different atomic operations. As per RAID-5. (the only 
>> difference being each can be of a different stripe size).
HL> As with Raid-5 on  a four disk stripe, there are four independant
HL> writes, and they don''t need to be atomic, as Copy-on-Write
implies
HL> that the new blocks are written elsewhere on disk, while maintaining
HL> the original data.  Only after all four writes return and are flushed
HL> to disk can you proceed and update the metadata.

And to clear things - meta data are updated also in a spirit of COW -
so metadata are written to new locations and then uber block is
atomically updated pointing to new meta data.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Victor Latushkin

2007-May-18 00:22 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

>>> If I understand correctly, then the parity block for RAID-Z are
also
>>> written in two different atomic operations. As per RAID-5. (the
only
>>> difference being each can be of a different stripe size).
>>>       
>
> HL> As with Raid-5 on  a four disk stripe, there are four independant
> HL> writes, and they don''t need to be atomic, as Copy-on-Write
implies
> HL> that the new blocks are written elsewhere on disk, while maintaining
> HL> the original data.  Only after all four writes return and are
flushed
> HL> to disk can you proceed and update the metadata.
>
> And to clear things - meta data are updated also in a spirit of COW -
> so metadata are written to new locations and then uber block is
> atomically updated pointing to new meta dataWell, to add to this, uber-blocks are also updated in COW fashion - 
there is a circular array of 128 uber-blocks, and new uber-block is 
written to the next to current slot.


victor

Henk Langeveld

2007-May-18 09:57 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

HL>> And to clear things - meta data are updated also in a spirit of COW -
HL>> so metadata are written to new locations and then uber block is
HL>> atomically updated pointing to new meta data

Victor Latushkin wrote:> Well, to add to this, uber-blocks are also updated in COW fashion - 
> there is a circular array of 128 uber-blocks, and new uber-block is 
> written to the next to current slot.
Correct, I left it out because there''s more detail involved with the
uberblock.

We can deal with it when we get there.

Cheers,
Henk

David Bustos

2007-May-18 15:15 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM
+0800:>    Gurus;
>    I am exceedingly impressed by the ZFS although it is my humble opinion
>    that Sun is not doing enough evangelizing for it.
What else do you think we should be doing?


David

Ian Collins

2007-May-18 20:39 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

David Bustos wrote:> Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM +0800:
>   
>>    Gurus;
>>    I am exceedingly impressed by the ZFS although it is my humble
opinion
>>    that Sun is not doing enough evangelizing for it.
>>     
>
> What else do you think we should be doing?
>
>   Send Thumpers to every respectable journal for a review!  That''s
probably a problem for marketing, how to target the the publications the
people with the check books read to broaden the awareness of ZFS. 

Just about every x86 server manufacturer provides and promotes the
features of hardware RAID solutions, maybe Sun should make more of the
cost savings in storage ZFS offers to gain a cost advantage over the
competition, or even "save $ on HP servers by running Solaris an
removing the RAID". 

How about some JBOD only storage products?  Or at least make hardware
RAID a add on an option, to cater for a broader market.

Trying to break (especially windows) administrators and CIOs out of the
"hardware RAID is best" or even "hardware RAID is essential"
mindset is
a tough ask.  As hardware RAID drops in price and moves into consumer
grade products, ZFS will loose the cost advantage (just try and get a
JBOD only SATA card, I only know of one).

Ian

Toby Thain

2007-May-18 22:05 UTC

head link

[zfs-discuss] Trying to understand zfs RAID-Z

On 18-May-07, at 4:39 PM, Ian Collins wrote:
> David Bustos wrote:
> ... maybe Sun should make more of the
> cost savings in storage ZFS offers to gain a cost advantage over the
> competition,
Cheaper AND more robust+featureful is hard to beat.

--T

Martin

2007-May-19 22:22 UTC

head link

[zfs-discuss] Re: Trying to understand zfs RAID-Z

> Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM
> +0800:
> >    Gurus;
> >    I am exceedingly impressed by the ZFS although
> it is my humble opinion
> >    that Sun is not doing enough evangelizing for
> it.
> 
> What else do you think we should be doing?
> 
> 
> David
I''ll jump in here.  I am a huge fan of ZFS.  At the same time, I know
about some of its warts.

ZFS hints at adding agility to data management and is a wonderful system.  At
the same time, it operates on some assumptions which are antithetical to data
agility, including:
* inability to online restripe: add/remove data/parity disks
* inability to make effective use of varying sized disks

In one breath ZFS says, "Look how well you can dynamically alter filesystem
storage."

In another breath ZFS says, "Make sure that your pools have identical
spindles and you have accurately predicted future bandwidth, access time, vdev
size, and parity disks.  Because you can''t change any of that
later."

I know, down the road you can tack new vdevs onto the pool, but that really
misses the point.  Even so, if I accidentally add a vdev to a pool and then
realize my mistake, I am sunk.  Once a vdev is added to a pool, it is attached
to the pool forever.

Ideally I could provision a vdev, later decide that I need a disk/LUN from that
vdev and simply remove the disk/LUN, decreasing the vdev capacity.  I should
have the ability to decide that current redundancy needs are insufficient and
allocate [b]any[/b] number of new parity disks.  I should be able to have a pool
from a rack of 15x250GB disks and then later add a rack of 11x750GB disks [b]to
the vdev[/b], not by making another vdev.

I should have the luxury of deciding to put hot Oracle indexes on their own
vdev, deallocate spindles form an existing vdev and put those indexes on the new
vdev.  I should be able to change my mind later and put it all back.

Most importantly is the access time issue.  Since there are no partial-stripe
reads in ZFS, then access time for a RAIDZ vdev is the same as single-disk
access time, no matter how wide the stripe is.

How to evangelize better?

Get rid of the glaring "you can''t change it later" problems.

Another thought is that flash storage has all of the indicators of being a
disruptive technology described in [i]The Innovator''s Dilemma[/i]. 
What this means is that flash storage [b]will[/b] take over hard disks.  It is
inevitable.  ZFS has a weakness with access times but handles single-block
corruption very nicely.  ZFS also has the ability to do very wide RAIDZ stripes,
up to 256(?) devices, providing mind-numbing throughput.

Flash has near-zero access times and relatively low throughput.  Flash is also
prone to single-block failures once the erase-limit has been reached for a
block.

ZFS + Flash = near-zero access time, very high throughput and high data
integrity.

To answer the question: get rid of the limitations and build a Thumper-like
device using flash.  Market it for Oracle redo logs, temp space, swap space
(flash is now cheaper than RAM), anything that needs massive throughput and
ridiculous iops numbers, but not necessarily huge storage.

Each month, the cost of flash will fall 4% anyway, so get ahead of the curve
now.

My 2 cents, at least.

Marty
 
 
This message posted from opensolaris.org

zfs discuss - May 2007 - Trying to understand zfs RAID-Z

Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

Re: Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Trying to understand zfs RAID-Z

[zfs-discuss] Re: Trying to understand zfs RAID-Z