thr3ads.net - zfs discuss - [zfs-discuss] ZFS RAIDZ vs. RAID5. [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Pawel Jakub Dawidek

2007-Sep-10 10:41 UTC

[zfs-discuss] ZFS RAIDZ vs. RAID5.

Hi.

I''ve a prototype RAID5 implementation for ZFS. It only works in
non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
performance, as I suspected that RAIDZ, because of full-stripe
operations, doesn''t work well for random reads issued by many processes
in parallel.

There is of course write-hole problem, which can be mitigated by running
scrub after a power failure or system crash. Another idea is to store
dirty regions somewhere and regenerate parity only for dirty regions
after reboot, but I''m not yet sure how to make it efficient. Adding two
write-cache-flush requests to each write requests isn''t a good idea.
Anyway, people live with RAID5 write-hole today, so why don''t give them
an option?

My test environment was 5 SATA disks and ZFS on FreeBSD.

For testing I used raidtest tool, I written some time ago. It starts a
given number of processes, that do a given number of random I/O
requests.

I was using 8 processes, I/O size was a random value between 2kB and
32kB (with 2kB step), offset was a random value between 0 and 10GB (also
with 2kB step).

I was testing ZVOL created on top of my pool. I first run raidtest in
write-only mode, so it makes ZFS to allocate blocks, then I exported the
pool, imported it again (to flush cache) and started read-only test with
the same I/O requests.

Pool configuration for RAIDZ:

  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad1     ONLINE       0     0     0
            ad4     ONLINE       0     0     0
            ad5     ONLINE       0     0     0
            ad6     ONLINE       0     0     0
            ad7     ONLINE       0     0     0

errors: No known data errors

Pool configuration for RAID5:

  pool: tank
 state: ONLINE
 scrub: none requested
config: 

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raid5     ONLINE       0     0     0
            ad1     ONLINE       0     0     0
            ad4     ONLINE       0     0     0
            ad5     ONLINE       0     0     0
            ad6     ONLINE       0     0     0
            ad7     ONLINE       0     0     0

And here are the results:

RAIDZ:

	Number of READ requests: 40000.
	Number of WRITE requests: 0.
	Number of bytes to transmit: 695678976.
	Number of processes: 8.
	Bytes per second: 1305213
	Requests per second: 75

RAID5:

	Number of READ requests: 40000.
	Number of WRITE requests: 0.
	Number of bytes to transmit: 695678976.
	Number of processes: 8.
	Bytes per second: 2749719
	Requests per second: 158

In other words, in this particular test, RAID5 is 2.1 times faster than
RAIDZ. I expected even better results for bigger pools (with more disks
in one RAIDZ vdev).

My question is: Is there any interest in finishing RAID5/RAID6 for ZFS?
If there is no chance it will be integrated into ZFS at some point, I
won''t bother finishing it.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070910/3733b056/attachment.bin>

Robert Milkowski

2007-Sep-10 15:31 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

Hello Pawel,

    Excellent job!

    Now I guess it would be a good idea to get writes done properly,
    even if it means make them slow (like with SVM). The end result
    would be - do you want fast wrties/slow reads go ahead with
    raid-z; if you need fast reads/slow writes go with raid-5.

    btw: I''m just thinking loudly - for raid-5 writes,
couldn''t you
    somewhow utilize ZIL to make writes safe? I''m asking because
we''ve
    got an ability to put zil somewhere else like NVRAM card...


-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Darren Dunham

2007-Sep-10 16:06 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

>     Now I guess it would be a good idea to get writes done properly,
>     even if it means make them slow (like with SVM). The end result
>     would be - do you want fast wrties/slow reads go ahead with
>     raid-z; if you need fast reads/slow writes go with raid-5.
> 
>     btw: I''m just thinking loudly - for raid-5 writes,
couldn''t you
>     somewhow utilize ZIL to make writes safe? I''m asking because
we''ve
>     got an ability to put zil somewhere else like NVRAM card...
But the safety of raidz (and the overall on-disk consistency of the
pool) does not currently depend on the ZIL.

It instead depends on the fact that blocks are never modified in-place,
but written first, then activated atomically.  So I guess this depends
on how the R5 is implemented in ZFS.  As long as all writes cause a new
block to be written (which has a full R5 stripe?), then the activation
will be atomic and there is no write hole.  The only problem comes if
existing blocks were modified (and that would cause problems with
snapshots anyway, right?)

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Pawel Jakub Dawidek

2007-Sep-10 17:18 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski
wrote:> Hello Pawel,
> 
>     Excellent job!
> 
>     Now I guess it would be a good idea to get writes done properly,
>     even if it means make them slow (like with SVM). The end result
>     would be - do you want fast wrties/slow reads go ahead with
>     raid-z; if you need fast reads/slow writes go with raid-5.
Writes in non-degraded mode already works. Only non-degraded mode
doesn''t work. My implementation is based on RAIDZ, so I''m
planning to
support RAID6 as well.
>     btw: I''m just thinking loudly - for raid-5 writes,
couldn''t you
>     somewhow utilize ZIL to make writes safe? I''m asking because
we''ve
>     got an ability to put zil somewhere else like NVRAM card...
The problem with RAID5 is that different blocks share the same parity,
which is not the case for RAIDZ. When you write a block in RAIDZ, you
write the data and the parity, and then you switch the pointer in
uberblock. For RAID5, you write the data and you need to update parity,
which also protects some other data. Now if you write the data, but
don''t update the parity before a crash, you have a whole. If you update
you parity before the write and a crash, you have a inconsistent with
different block in the same stripe.

My idea was to have one sector every 1GB on each disk for a "journal"
to
keep list of blocks beeing updated. For example you want to write 2kB of
data at offset 1MB. You first store offset+size in this journal, then
write data and update parity and then remove offset+size from the
journal.  Unfortuantely, we would need to flush write cache twice: after
offset+size addition and before offset+size removal.
We could optimize it by doing lazy removal, eg. wait for ZFS to flush
write cache as a part of transaction and then remove old offset+size
paris.
But I still expect this to give too much overhead.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070910/8a3f2350/attachment.bin>

Robert Milkowski

2007-Sep-11 07:16 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

Hello Pawel,

Monday, September 10, 2007, 6:18:37 PM, you wrote:

PJD> On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski
wrote:>> Hello Pawel,
>> 
>>     Excellent job!
>> 
>>     Now I guess it would be a good idea to get writes done properly,
>>     even if it means make them slow (like with SVM). The end result
>>     would be - do you want fast wrties/slow reads go ahead with
>>     raid-z; if you need fast reads/slow writes go with raid-5.
PJD> Writes in non-degraded mode already works. Only non-degraded mode
PJD> doesn''t work. My implementation is based on RAIDZ, so
I''m planning to
PJD> support RAID6 as well.
>>     btw: I''m just thinking loudly - for raid-5 writes,
couldn''t you
>>     somewhow utilize ZIL to make writes safe? I''m asking
because we''ve
>>     got an ability to put zil somewhere else like NVRAM card...
PJD> The problem with RAID5 is that different blocks share the same parity,
PJD> which is not the case for RAIDZ. When you write a block in RAIDZ, you
PJD> write the data and the parity, and then you switch the pointer in
PJD> uberblock. For RAID5, you write the data and you need to update parity,
PJD> which also protects some other data. Now if you write the data, but
PJD> don''t update the parity before a crash, you have a whole. If
you update
PJD> you parity before the write and a crash, you have a inconsistent with
PJD> different block in the same stripe.

Are you overwriting old data? I hope you''re not...
I don''t think you should suffer from above problem in ZFS due to COW.
If you are not overwriting and you''re just writing to new locations
from the pool perspective those changes (both new data block and
checksum block) won''t be active until they are both flushed and uber
block is updated... right?


-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Pawel Jakub Dawidek

2007-Sep-11 07:42 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski
wrote:> Are you overwriting old data? I hope you''re not...
I am, I overwrite parity, this is the whole point. That''s why ZFS
designers used RAIDZ instead of RAID5, I think.
> I don''t think you should suffer from above problem in ZFS due to
COW.
I do, because autonomous blocks share the same parity block.
> If you are not overwriting and you''re just writing to new
locations
> from the pool perspective those changes (both new data block and
> checksum block) won''t be active until they are both flushed and
uber
> block is updated... right?
Assume 128kB stripe size in RAID5. You have three disks: A, B and C.
ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A
and parity into disk C (both at offset 0). Then, ZFS writes 128kB at
offset 128kB. RAID5 writes data into disk B (at offset 0) and updates
parity on disk C (also at offset 0).

As you can see, two independent ZFS blocks share one parity block.
COW won''t help you here, you would need to be sure that each ZFS
transaction goes to a different (and free) RAID5 row.

This is I belive the main reason why poor RAID5 wasn''t used in the
first
place.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070911/0bdb3b74/attachment.bin>

Jeff Bonwick

2007-Sep-11 07:53 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

> As you can see, two independent ZFS blocks share one parity block.
> COW won''t help you here, you would need to be sure that each ZFS
> transaction goes to a different (and free) RAID5 row.
> 
> This is I belive the main reason why poor RAID5 wasn''t used in the
first
> place.
Exactly right.  RAID-Z has different performance trade-offs than RAID-5,
but the deciding factor was correctness.

I''m really glad you''re doing these experiments!  It''s
good to know what
the trade-offs are, performance-wise, between RAID-Z and classic RAID-5.
At a minimum, it tells us what''s on the table, and what we''re
paying for
transactional semantics.  To be honest, I''m pleased that it''s
only 2x.
It wouldn''t have surprised me if it were Nx for an N+1 configuration.
A factor of 2 is something we can work with.

Jeff

2007-Sep-11 10:55 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

> My question is: Is there any interest in finishing RAID5/RAID6 for ZFS?
> If there is no chance it will be integrated into ZFS at some point, I
> won''t bother finishing it.
Your work is as pure an example as any of what OpenSolaris should be about.  I
think there should be no problem having a new feature like that integrated!...
as long as it is the level of quality that the community wants.
 
 
This message posted from opensolaris.org

Adam Leventhal

2007-Sep-12 21:24 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek
wrote:> And here are the results:
> 
> RAIDZ:
> 
> 	Number of READ requests: 40000.
> 	Number of WRITE requests: 0.
> 	Number of bytes to transmit: 695678976.
> 	Number of processes: 8.
> 	Bytes per second: 1305213
> 	Requests per second: 75
> 
> RAID5:
> 
> 	Number of READ requests: 40000.
> 	Number of WRITE requests: 0.
> 	Number of bytes to transmit: 695678976.
> 	Number of processes: 8.
> 	Bytes per second: 2749719
> 	Requests per second: 158
I''m a bit surprised by these results. Assuming relatively large blocks
written, RAID-Z and RAID-5 should be laid out on disk very similarly
resulting in similar read performance.

Did you compare the I/O characteristic of both? Was the bottleneck in
the software or the hardware?

Very interesting experiment...

Adam

-- 
Adam Leventhal, FishWorks                        http://blogs.sun.com/ahl

Peter Tribble

2007-Sep-12 22:20 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org>
wrote:> Hi.
>
> I''ve a prototype RAID5 implementation for ZFS. It only works in
> non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
> performance, as I suspected that RAIDZ, because of full-stripe
> operations, doesn''t work well for random reads issued by many
processes
> in parallel.
>
> There is of course write-hole problem, which can be mitigated by running
> scrub after a power failure or system crash.
If I read your suggestion correctly, your implementation is much
more like traditional raid-5, with a read-modify-write cycle?

My understanding of the raid-z performance issue is that it requires
full-stripe reads in order to validate the checksum. So to get better
random read performance, why not simply have a separate checksum
for each chunk in the stripe? You still eliminate the raid-5 write hole
(albeit at some loss in performance because you have to compute
and write extra checksums) but you allow multiple independent reads.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Pawel Jakub Dawidek

2007-Sep-12 22:24 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal
wrote:> On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
> > And here are the results:
> > 
> > RAIDZ:
> > 
> > 	Number of READ requests: 40000.
> > 	Number of WRITE requests: 0.
> > 	Number of bytes to transmit: 695678976.
> > 	Number of processes: 8.
> > 	Bytes per second: 1305213
> > 	Requests per second: 75
> > 
> > RAID5:
> > 
> > 	Number of READ requests: 40000.
> > 	Number of WRITE requests: 0.
> > 	Number of bytes to transmit: 695678976.
> > 	Number of processes: 8.
> > 	Bytes per second: 2749719
> > 	Requests per second: 158
> 
> I''m a bit surprised by these results. Assuming relatively large
blocks
> written, RAID-Z and RAID-5 should be laid out on disk very similarly
> resulting in similar read performance.
Hmm, no. The data was organized very differenly on disks. The smallest
block size used was 2kB, to ensure each block is written to all disks in
RAIDZ configuration. In RAID5 configuration however, 128kB stripe size
was used, which means each block was stored on one disk only.

Now when you read the data, RAIDZ need to read all disks for each block,
and RAID5 needs to read only one disk for each block.
> Did you compare the I/O characteristic of both? Was the bottleneck in
> the software or the hardware?
The bottleneck were definiatelly disks. CPU was like 96% idle.

To be honest I expected, just like Jeff, much bigger win for RAID5 case.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/3868da77/attachment.bin>

Nicolas Williams

2007-Sep-12 22:29 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal
wrote:> I''m a bit surprised by these results. Assuming relatively large
blocks
> written, RAID-Z and RAID-5 should be laid out on disk very similarly
> resulting in similar read performance.
> 
> Did you compare the I/O characteristic of both? Was the bottleneck in
> the software or the hardware?
Note that Pawel wrote:

Pawel> I was using 8 processes, I/O size was a random value between 2kB
Pawel> and 32kB (with 2kB step), offset was a random value between 0 and
Pawel> 10GB (also with 2kB step).

If the dataset''s record size was the default (Pawel didn''t
say, right?)
then the reason for the lousy read performance is clear: RAID-Z has to
read full blocks to verify the checksum, whereas RAID-5 need only read
as much as is requested (assuming aligned reads, which Pawel did seem to
indicate: "2KB steps").

Peter Tribble pointed out much the same thing already.

The crucial requirement is to match the dataset record size to the I/O
size done by the application.  If the app writes in bigger chunks than
it reads and you want to optimize for write performance then set the
record size to match the write size, else set the record size to match
the read size.

Where the dataset record size is not matched to the application''s I/O
size I guess we could say that RAID-Z trades off the RAID-5 write hole
for a read-hole.

Nico
--

Pawel Jakub Dawidek

2007-Sep-12 22:56 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble
wrote:> On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:
> > Hi.
> >
> > I''ve a prototype RAID5 implementation for ZFS. It only works
in
> > non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
> > performance, as I suspected that RAIDZ, because of full-stripe
> > operations, doesn''t work well for random reads issued by many
processes
> > in parallel.
> >
> > There is of course write-hole problem, which can be mitigated by
running
> > scrub after a power failure or system crash.
> 
> If I read your suggestion correctly, your implementation is much
> more like traditional raid-5, with a read-modify-write cycle?
> 
> My understanding of the raid-z performance issue is that it requires
> full-stripe reads in order to validate the checksum. [...]
No, checksum is independent thing, and this is not the reason why RAIDZ
needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t read
parity.

This is how RAIDZ fills the disks (follow the numbers):

	Disk0	Disk1	Disk2	Disk3

	D0	D1	D2	P3
	D4	D5	D6	P7
	D8	D9	D10	P11
	D12	D13	D14	P15
	D16	D17	D18	P19
	D20	D21	D22	P23

D is data, P is parity.

And RAID5 does this:

	Disk0	Disk1	Disk2	Disk3

	D0	D3	D6	P0,3,6
	D1	D4	D7	P1,4,7
	D2	D5	D8	P2,5,8
	D9	D12	D15	P9,12,15
	D10	D13	D16	P10,13,16
	D11	D14	D17	P11,14,17

As you can see even small block is stored on all disks in RAIDZ, where
on RAID5 small block can be stored on one disk only.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/98dff5fd/attachment.bin>

Nicolas Williams

2007-Sep-13 00:32 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Thu, Sep 13, 2007 at 12:56:44AM +0200, Pawel Jakub Dawidek
wrote:> On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
> > My understanding of the raid-z performance issue is that it requires
> > full-stripe reads in order to validate the checksum. [...]
> 
> No, checksum is independent thing, and this is not the reason why RAIDZ
> needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t
read
> parity.
I doubt reading the parity could cost all that much (particularly if
there''s enough I/O capacity).  It''s reading the full 128KB
that you have
to read, if a file''s record size is 128KB, in order to satisfy a 2KB
read.

And ZFS has to read full blocks in order to verify the checksum.

Nico
--

Al Hopper

2007-Sep-13 00:39 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Thu, 13 Sep 2007, Pawel Jakub Dawidek wrote:
> On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
>> On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:
>>> Hi.
>>>
>>> I''ve a prototype RAID5 implementation for ZFS. It only
works in
>>> non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
>>> performance, as I suspected that RAIDZ, because of full-stripe
>>> operations, doesn''t work well for random reads issued by
many processes
>>> in parallel.
>>>
>>> There is of course write-hole problem, which can be mitigated by
running
>>> scrub after a power failure or system crash.
>>
>> If I read your suggestion correctly, your implementation is much
>> more like traditional raid-5, with a read-modify-write cycle?
>>
>> My understanding of the raid-z performance issue is that it requires
>> full-stripe reads in order to validate the checksum. [...]
>
> No, checksum is independent thing, and this is not the reason why RAIDZ
> needs to do full-stripe reads - in non-degraded mode RAIDZ doesn''t
read
> parity.
>
> This is how RAIDZ fills the disks (follow the numbers):
>
> 	Disk0	Disk1	Disk2	Disk3
>
> 	D0	D1	D2	P3
> 	D4	D5	D6	P7
> 	D8	D9	D10	P11
> 	D12	D13	D14	P15
> 	D16	D17	D18	P19
> 	D20	D21	D22	P23
>
> D is data, P is parity.
>
> And RAID5 does this:
>
> 	Disk0	Disk1	Disk2	Disk3
>
> 	D0	D3	D6	P0,3,6
> 	D1	D4	D7	P1,4,7
> 	D2	D5	D8	P2,5,8
> 	D9	D12	D15	P9,12,15
> 	D10	D13	D16	P10,13,16
> 	D11	D14	D17	P11,14,17
Surely the above is not accurate?  You''ve showing the parity data only 
being written to disk3.  In RAID5 the parity is distributed across all 
disks in the RAID5 set.  What is illustrated above is RAID3.
> As you can see even small block is stored on all disks in RAIDZ, where
> on RAID5 small block can be stored on one disk only.
>
> --
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Pawel Jakub Dawidek

2007-Sep-13 04:47 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper
wrote:> >This is how RAIDZ fills the disks (follow the numbers):
> >
> >	Disk0	Disk1	Disk2	Disk3
> >
> >	D0	D1	D2	P3
> >	D4	D5	D6	P7
> >	D8	D9	D10	P11
> >	D12	D13	D14	P15
> >	D16	D17	D18	P19
> >	D20	D21	D22	P23
> >
> >D is data, P is parity.
> >
> >And RAID5 does this:
> >
> >	Disk0	Disk1	Disk2	Disk3
> >
> >	D0	D3	D6	P0,3,6
> >	D1	D4	D7	P1,4,7
> >	D2	D5	D8	P2,5,8
> >	D9	D12	D15	P9,12,15
> >	D10	D13	D16	P10,13,16
> >	D11	D14	D17	P11,14,17
> 
> Surely the above is not accurate?  You''ve showing the parity data
only
> being written to disk3.  In RAID5 the parity is distributed across all 
> disks in the RAID5 set.  What is illustrated above is RAID3.
It''s actually RAID4 (RAID3 would look the same as RAIDZ, but there are
differences in practice), but my point wasn''t how the parity is
distributed:) Ok, RAID5 once again:

	Disk0	Disk1	Disk2		Disk3

	D0	D3	D6		P0,3,6
	D1	D4	D7		P1,4,7
	D2	D5	D8		P2,5,8
	D9	D12	P9,12,15	D15
	D10	D13	P10,13,16	D16
	D11	D14	P11,14,17	D17

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/51c17924/attachment.bin>

Marc Bevand

2007-Sep-13 04:58 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

Pawel Jakub Dawidek <pjd <at> FreeBSD.org>
writes:> 
> This is how RAIDZ fills the disks (follow the numbers):
> 
> 	Disk0	Disk1	Disk2	Disk3
> 
> 	D0	D1	D2	P3
> 	D4	D5	D6	P7
> 	D8	D9	D10	P11
> 	D12	D13	D14	P15
> 	D16	D17	D18	P19
> 	D20	D21	D22	P23
> 
> D is data, P is parity.
This layout assumes of course that large stripes have been written to
the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
possible for a single logical block to span only 2 disks (for those who
don''t know what I am talking about, see the "red" block
occupying LBAs
D3 and E3 on page 13 of these ZFS slides [1]).

To read this logical block (and validate its checksum), only D_0 needs 
to be read (LBA E3). So in this very specific case, a RAIDZ read
operation is as cheap as a RAID5 read operation. The existence of these
small stripes could explain why RAIDZ doesn''t perform as bad as RAID5
in Pawel''s benchmark...

[1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf

-marc

Pawel Jakub Dawidek

2007-Sep-13 09:49 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On Thu, Sep 13, 2007 at 04:58:10AM +0000, Marc Bevand
wrote:> Pawel Jakub Dawidek <pjd <at> FreeBSD.org> writes:
> > 
> > This is how RAIDZ fills the disks (follow the numbers):
> > 
> > 	Disk0	Disk1	Disk2	Disk3
> > 
> > 	D0	D1	D2	P3
> > 	D4	D5	D6	P7
> > 	D8	D9	D10	P11
> > 	D12	D13	D14	P15
> > 	D16	D17	D18	P19
> > 	D20	D21	D22	P23
> > 
> > D is data, P is parity.
> 
> This layout assumes of course that large stripes have been written to
> the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
> possible for a single logical block to span only 2 disks (for those who
> don''t know what I am talking about, see the "red" block
occupying LBAs
> D3 and E3 on page 13 of these ZFS slides [1]).
Yes I''m aware of that.
> To read this logical block (and validate its checksum), only D_0 needs 
> to be read (LBA E3). So in this very specific case, a RAIDZ read
> operation is as cheap as a RAID5 read operation. [...]
If you do single sector writes - yes, but this is very inefficient,
because of two reasons:
1. Bandwidth - writting one sector at a time? Come on.
2. Space - when you write one sector and its parity you consume two
   sectors. You may have more than one parity column in that case, eg.
	Disk0	Disk1	Disk2	Disk3	Disk4	Disk5
	D0	P0	D1	P1	D2	P2
   In this case space overhead is the same as in mirror.
> [...] The existence of these
> small stripes could explain why RAIDZ doesn''t perform as bad as
RAID5
> in Pawel''s benchmark...
No, as I said, the smallest block I used was 2kB, which means four 512b
blocks plus one 512b of parity - each 2kB block uses all 5 disks.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070913/ff599a56/attachment.bin>

James Blackburn

2007-Sep-13 21:48 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On 9/12/07, Pawel Jakub Dawidek <pjd at freebsd.org>
wrote:> On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:
> > On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
> > I''m a bit surprised by these results. Assuming relatively
large blocks
> > written, RAID-Z and RAID-5 should be laid out on disk very similarly
> > resulting in similar read performance.
>
> Hmm, no. The data was organized very differenly on disks. The smallest
> block size used was 2kB, to ensure each block is written to all disks in
> RAIDZ configuration. In RAID5 configuration however, 128kB stripe size
> was used, which means each block was stored on one disk only.
>
> Now when you read the data, RAIDZ need to read all disks for each block,
> and RAID5 needs to read only one disk for each block.
>
> > Did you compare the I/O characteristic of both? Was the bottleneck in
> > the software or the hardware?
>
> The bottleneck were definiatelly disks. CPU was like 96% idle.
>
> To be honest I expected, just like Jeff, much bigger win for RAID5 case.
Well it depends.  In both configurations the available read bandwidth
is the same.  Presumably you''re expecting each disk to seek
independently and concurrently.  Is the spa aware that multiple,
offset dependent, reads can be issued concurrently to the RAID-5 vdev?

James

Tuomas Leikola

2007-Sep-15 15:38 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On 9/10/07, Pawel Jakub Dawidek <pjd at freebsd.org>
wrote:> The problem with RAID5 is that different blocks share the same parity,
> which is not the case for RAIDZ. When you write a block in RAIDZ, you
> write the data and the parity, and then you switch the pointer in
> uberblock. For RAID5, you write the data and you need to update parity,
> which also protects some other data. Now if you write the data, but
> don''t update the parity before a crash, you have a whole. If you
update
> you parity before the write and a crash, you have a inconsistent with
> different block in the same stripe.
This is why you should consider "old" data and parity as being
"live".
The old data (being overwritten) is live as it is needed for the
parity to be consistent - and the old parity is live because it
protects the other blocks.

What IMO should be done is object level raid - write new parity and
new data into blocks not yet used - and as the new parity protects
also the "neighbouring" data the old parity can be freed, and after it
no longer is live the "overwritten" data block can also be freed.

Note that this is very different from traditional raid5 as it requires
intimate knowledge about the FS structure. Traditional raids also keep
parity "in line" with the data blocks it protects - but that is not
necessary if the FS can store information about where the parity is
located.

Define "live data" well enough and you''re safe if you never
overwrite any of it.
> My idea was to have one sector every 1GB on each disk for a
"journal" to
> keep list of blocks beeing updated.
This would be called "write intent log" or "bitmap" (as in
linux
software raid). Speeds up recovery, but doesn''t protect against write
hole problems.

Roch - PAE

2007-Sep-20 15:57 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

Here is a different twist on your interesting scheme.  First
start with writting 3 blocks and parity in a full stripe.

   	Disk0	Disk1	Disk2	Disk3
   
   	D0	D1	D2	P0,1,2


Next application modifies D0 -> D0'' and also writes other
data D3, D4. Now you have 

   	Disk0	Disk1	Disk2	Disk3
   
   	D0	D1	D2	P0,1,2
	D0''	D3	D4	P0'',3,4

So file update combine with new data into new full stripes.
This is the trivial part. Now the hard part :

We have to deal with D0. D0 is free of data content
(subsided by D0''). However it holds parity information
protecting live data D1, D2. If workload
updates data in D1 and D2 the full stripe becomes free (this 
is the easy part).

But if D1 and D2 stays immutable for long time then we can
run out of pool blocks with D0 held down in an half-freed state.
So as we near full pool capacity, a scrubber would have to walk
the stripes  and look for partially freed ones. Then it
would need to do a scrubbing "read/write" on D1, D2 so that
they become part of a new stripe with some other data
freeing the full initial stripe.


-r

Tuomas Leikola

2007-Sep-30 09:16 UTC

head link

[zfs-discuss] ZFS RAIDZ vs. RAID5.

On 9/20/07, Roch - PAE <Roch.Bourbonnais at sun.com>
wrote:>
> Next application modifies D0 -> D0'' and also writes other
> data D3, D4. Now you have
>
>         Disk0   Disk1   Disk2   Disk3
>
>         D0      D1      D2      P0,1,2
>         D0''     D3      D4      P0'',3,4
>
> But if D1 and D2 stays immutable for long time then we can
> run out of pool blocks with D0 held down in an half-freed state.
> So as we near full pool capacity, a scrubber would have to walk
> the stripes  and look for partially freed ones. Then it
> would need to do a scrubbing "read/write" on D1, D2 so that
> they become part of a new stripe with some other data
> freeing the full initial stripe.
>
Or, given a list of partial stripes (and sufficient cache), next write
of D5 could be combined with D1,D2:

         Disk0   Disk1   Disk2   Disk3

         D0      D1      D2      P0,1,2
         D0''     D3      D4      P0'',3,4
         D5      free    free    P5,1,2

therefore freeing D0 and P012:

         Disk0   Disk1   Disk2   Disk3

         free    D1      D2      free
         D0''     D3      D4      P0'',3,4
         D5      free    free    P5,1,2

(I assumed no need for alignment). Performance-wise, i''m guessing it
might be beneficial to "quickly" write mirrored blocks on the disk and
later combine them, freeing the now unneeded mirrors.

zfs discuss - Sep 2007 - ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.

[zfs-discuss] ZFS RAIDZ vs. RAID5.