thr3ads.net - zfs discuss - [zfs-discuss] RAID-Z vs. RAID3. [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Pawel Jakub Dawidek

2005-Nov-19 00:10 UTC

[zfs-discuss] RAID-Z vs. RAID3.

Hi.

I read most of ZFS documentation and Jeff''s blog entry about RAID-Z:

http://blogs.sun.com/roller/page/bonwick?entry=raid_z

and I fail to understand the difference between RAID-Z and RAID3.
In RAID3 every read or write is full-stripe operation an there is no
need for read-modify-write.
Because I found many RAID3 descriptions I''ll try explain how it works.

The most important thing: raid3 device has bigger sector size. For 5
disks RAID3 array, where disk sector size is 512 bytes, we have a device
which has 2kB sector (4*512 + 512 bytes for parity).
Every read/write operation (multiple of 2kB) needs all components.

I implemented RAID3 for FreeBSD and it works very well, but some things
should be noted:
- linear/random (parallel) writing is faster than for RAID5,
- linear (parallel) reads are faster than for RAID5,
- random, parallel reads are slower than for RAID5.

It is possible to speed-up random reads by using parity component.

In case of power loss/system crash parity has to be synchronized still.
There is no guarantee, that one of the disks wasn''t slower and has
finished write.

After explaining this, my questions are:

Where is the difference between RAID-Z and RAID3?

How does it compare to RAID5 random parallel reads performance?
Have you tried to put ZFS on HW RAID5 array and compare performance with
ZFS''s RAID-Z for many parallel random reads?

Does RAID-Z depend on ZFS checksums? Is it still reliable if
checksumming is turned off?

If parity correctness verification is not done immediately when possible
(after power loss, on boot) the risk of data loss is bigger (one of the
raid''s component can fail leaving the data in an uncorrectable state if
the data was not read, so there was no chance to detect the problem).

And BTW. I fully understand ZFS''s goal of beeing easy to use, but do
you
plan to provide more flexibility of storage poll management?
For example, if I''ve 5 disks (3x40GB and 2x60GB), I''d like to
create two
stripes (RAID0): 3x40GB and 2x60GB, so I can then mirror both stripes
and get 120GB of RAID1-protected storage. It doesn''t seem to be
possible
currently.

Thanks in advance.

PS. ...and let me repeat what you already heard so many times: great work!

--
Pawel Jakub Dawidek http://www.wheel.pl
pjd at FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051119/604782f0/attachment.bin>

Jeff Bonwick

2005-Nov-21 08:30 UTC

head link

[zfs-discuss] Re: RAID-Z vs. RAID3.

<p>> I read most of ZFS documentation and Jeff''s blog entry about
RAID-Z
<br>> and I fail to understand the difference between RAID-Z and RAID3.</p><p>
Ah.  Let me try to clarify.
</p><p>
First, some background for folks who haven''t written RAID software
in the last few weeks (and I''m told such people really do exist):
</p><p>
RAID-3 can be thought of as RAID-4 with two constraints and a twist:
</p><p>
(1) the number of disks must be exactly 2^n + 1
<br>
(2) the minimum transfer size is 2^n sectors
<br>
(3) data placement is row-major rather than column-major (see below)
</p><p>
The power-of-two blocksize constraint isn''t imposed by RAID-3 proper,
but rather is necessary to make it useful to most filesystems.
</p><p>
By row-major vs. column-major I mean this:  think of the disks as
columns of a matrix, such that (M, N) is the Mth sector of disk N.
Disk I/O is always column-major, by definition.  The two most
common options for data placement are row-major and column-major.
</p><p>
For a 20-sector write at offset 40, this is row-major placement:
</p><p>
<tt>
	40	41	42	43	P
<br>
	44	45	46	47	P
<br>
	48	49	50	51	P
<br>
	52	53	54	55	P
<br>
	56	57	58	59	P
</tt>
</p>
<p>
And this is column-major placement:
</p><p>
<tt>
	40	45	50	55	P
<br>
	41	46	51	56	P
<br>
	42	47	52	57	P
<br>
	43	48	53	58	P
<br>
	44	49	54	59	P
</tt>
</p><p>
With row-major placement, you must either do a lot of scatter/gather setup
to get each sector to land where you want it, or you must interleave writes
and de-interleave reads by hand.  With column-major placement, I/O is trivial
because the data is stored on disk in logical order.
</p><p>
RAID-2 and RAID-3 both do row-major data placement.
</p><p>
All other RAID schemes I''m aware of do column-major data placement.
</p><p>
One might ask: why on earth would anyone do row-major placement?
Indeed, it seems a bit crazy for disks because it''s just extra work.
But in some cases the interleave may be a natural side-effect
of the way the hardware works -- memory, for example.  Parity memory
is essentially a form of RAID-2, albeit with different error semantics.
</p><p>
Now, to address the original question:
</p><p>
RAID-Z is like RAID-3 in that all writes are full-stripe writes.
This eliminates a number of thorny performance and correctness problems
such as the read-modify-write penalty and the RAID-5 write hole.
</p><p>
RAID-Z is like RAID-4 in that it supports any number of disks,
it''s sector-addressable, and it does column-major data placement. 
</p><p>
RAID-Z is like RAID-5 in that it doesn''t have a dedicated parity disk;
the parity is distributed across all disks to maximize bandwidth.
</p><p>
And RAID-Z is unique in that it can detect and correct silent
data corruption, as described
<a
href="http://blogs.sun.com/roller/page/bonwick?entry=raid_z">here</a>.
</p><p>> PS. ...and let me repeat what you already heard so many times: great work!</p><p>
Thank you!  Much appreciated.
</p>
This message posted from opensolaris.org

Jeff Bonwick

2005-Nov-21 08:41 UTC

head link

[zfs-discuss] Re: RAID-Z vs. RAID3.

> Does RAID-Z depend on ZFS checksums?
> Is it still reliable if checksumming is turned off?
The usual RAID capabilities (handling explicit I/O errors and/or whole-disk
failure)
do not rely on checksums.

The self-healing data feature (detecting and correcting silent data
corruption) does, of course, require checksums.
This message posted from opensolaris.org

Pawel Jakub Dawidek

2005-Dec-06 03:57 UTC

head link

[zfs-discuss] Re: RAID-Z vs. RAID3.

On Mon, Nov 21, 2005 at 12:41:50AM -0800, Jeff Bonwick wrote:
+> > Does RAID-Z depend on ZFS checksums?
+> > Is it still reliable if checksumming is turned off?
+> 
+> The usual RAID capabilities (handling explicit I/O errors and/or
whole-disk failure)
+> do not rely on checksums.
+> 
+> The self-healing data feature (detecting and correcting silent data
+> corruption) does, of course, require checksums.

I was pondering this parcitular situation:

ZFS receives write request, sends it to three RAID-Z components.
Request is delivered to two of them and you lost power before it is
delivered to the last component (you know, ZFS likes cheap disks and one
of them was slower:)).
How this issue is detected after the reboot when one doesn''t use
checksums?
Maybe this is one of the reasons checksums are there?

This issue is ignored by most software RAID implementations, but is very
real (rare, but real).

Thanks.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051206/58b91bdf/attachment.bin>

Jason Ozolins

2005-Dec-06 04:34 UTC

head link

[zfs-discuss] Re: RAID-Z vs. RAID3.

Pawel Jakub Dawidek wrote:> ZFS receives write request, sends it to three RAID-Z components.
> Request is delivered to two of them and you lost power before it is
> delivered to the last component (you know, ZFS likes cheap disks and one
> of them was slower:)).
> How this issue is detected after the reboot when one doesn''t use
> checksums?
> Maybe this is one of the reasons checksums are there?
ZFS treats disk updates like a transaction, so an interrupted write 
operation will not affect any of the reachable on-disk data.  When the 
system reboots, the blocks that were being written at the time of the 
power failure will not be reachable from the filesystem metadata, so the 
on-disk state of the filesystem will be fine.  And because ZFS knows 
more than an LVM about what space is free and what is in use, it can 
always do full-stripe writes using a variable stripe size, so it
doesn''t
matter what''s already on the disk.

This also means that you don''t have to wait to zero out the components 
of a RAID-Z vdev when you add it to a pool.
> This issue is ignored by most software RAID implementations, but is very
> real (rare, but real).
Hardware RAID isn''t enough by itself either.  I was asked to look at a 
box running RHEL which was acting a bit "funny".  Found a whole bunch
of
kernel messages going back a few months about inconsistent ext2 cylinder 
group free block bitmaps.  Yay, a whole bunch of files were sharing 
storage with each other, and it was impossible to tell how many others 
had been corrupted previous to that time.  Had to verify 500GB of
gzip''d
bioinformatics data and re-mirror many tens of gigabytes of compressed 
data that was unverifiable (there''s no checksum in .Z files).  I really
wanted ZFS that day...

-Jason

Srinivasan Viswanathan

2005-Dec-06 06:05 UTC

head link

[zfs-discuss] Re: RAID-Z vs. RAID3.

Pawel Jakub Dawidek wrote On 12/06/05 09:27,:
>On Mon, Nov 21, 2005 at 12:41:50AM -0800, Jeff Bonwick wrote:
>+> > Does RAID-Z depend on ZFS checksums?
>+> > Is it still reliable if checksumming is turned off?
>+> 
>+> The usual RAID capabilities (handling explicit I/O errors and/or
whole-disk failure)
>+> do not rely on checksums.
>+> 
>+> The self-healing data feature (detecting and correcting silent data
>+> corruption) does, of course, require checksums.
>
>I was pondering this parcitular situation:
>
>ZFS receives write request, sends it to three RAID-Z components.
>Request is delivered to two of them and you lost power before it is
>delivered to the last component (you know, ZFS likes cheap disks and one
>of them was slower:)).
>How this issue is detected after the reboot when one doesn''t use
>checksums?
>Maybe this is one of the reasons checksums are there?
>  
>
Matters for filesystems like UFS, ie., whenever system crashes some 
volume managers or raid arrays
resync every time while most of them log the "dirty blocks region"
onto
an NVRAM (or even to a region of disk synchronously)  and thus resync 
only those dirty blocks.

In case of zfs, it may not matter since zfs always writes to new 
locations and it is not committed until the root block  is written and 
there are multiple copies of it.

Srinivas.
>This issue is ignored by most software RAID implementations, but is very
>real (rare, but real).
>
>Thanks.
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>zfs-discuss mailing list
>zfs-discuss at opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  
>

zfs discuss - Nov 2005 - RAID-Z vs. RAID3.

[zfs-discuss] RAID-Z vs. RAID3.

[zfs-discuss] Re: RAID-Z vs. RAID3.

[zfs-discuss] Re: RAID-Z vs. RAID3.

[zfs-discuss] Re: RAID-Z vs. RAID3.

[zfs-discuss] Re: RAID-Z vs. RAID3.

[zfs-discuss] Re: RAID-Z vs. RAID3.