thr3ads.net - zfs discuss - [zfs-discuss] raid-z - not even iops distribution [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Robert Milkowski

2010-Jun-18 14:26 UTC

[zfs-discuss] raid-z - not even iops distribution

Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
                   raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
                   raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
                   raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
                   [...]
                   raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a 
randomread workload with 1 or more threads then disks on c2 and c3 
controllers are getting about 80% more reads. This happens both on 111b 
and snv_134. I would rather except all of them to get about the same 
number of iops.

Any idea why?


-- 
Robert Milkowski
http://milek.blogspot.com

Scott Meilicke

2010-Jun-23 15:46 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

Reaching into the dusty regions of my brain, I seem to recall that since RAIDz
does not work like a traditional RAID 5, particularly because of variably sized
stripes, that the data may not hit all of the disks, but it will always be
redundant.

I apologize for not having a reference for this assertion, so I may be
completely wrong.

I assume your hardware is recent, the controllers are on PCIe x4 buses, etc.

-Scott
-- 
This message posted from opensolaris.org

Adam Leventhal

2010-Jun-23 16:51 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:
> Hi,
> 
> 
> zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
>                  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
>                  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
>                  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
>                  [...]
>                  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0
> 
> zfs set atime=off test
> zfs set recordsize=16k test
> (I know...)
> 
> now if I create a one large file with filebench and simulate a randomread
workload with 1 or more threads then disks on c2 and c3 controllers are getting
about 80% more reads. This happens both on 111b and snv_134. I would rather
except all of them to get about the same number of iops.
> 
> Any idea why?
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Robert Milkowski

2010-Jun-23 17:48 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

128GB.

Does it mean that for dataset used for databases and similar 
environments where basically all blocks have fixed size and there is no 
other data all parity information will end-up on one (z1) or two (z2) 
specific disks?



On 23/06/2010 17:51, Adam Leventhal wrote:> Hey Robert,
>
> How big of a file are you making? RAID-Z does not explicitly do the parity
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths
to distribute IOPS.
>
> Adam
>
> On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:
>
>    
>> Hi,
>>
>>
>> zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
>>                   raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
>>                   raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
>>                   raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
>>                   [...]
>>                   raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0
>>
>> zfs set atime=off test
>> zfs set recordsize=16k test
>> (I know...)
>>
>> now if I create a one large file with filebench and simulate a
randomread workload with 1 or more threads then disks on c2 and c3 controllers
are getting about 80% more reads. This happens both on 111b and snv_134. I would
rather except all of them to get about the same number of iops.
>>
>> Any idea why?
>>
>>
>> -- 
>> Robert Milkowski
>> http://milek.blogspot.com
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>      
>
> --
> Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl
>
>
>

Adam Leventhal

2010-Jun-23 17:50 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

> Does it mean that for dataset used for databases and similar environments
where basically all blocks have fixed size and there is no other data all parity
information will end-up on one (z1) or two (z2) specific disks?
No. There are always smaller writes to metadata that will distribute parity.
What is the total width of your raidz1 stripe?

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Ross Walker

2010-Jun-23 18:29 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On Jun 23, 2010, at 1:48 PM, Robert Milkowski <milek at task.gda.pl>
wrote:
> 
> 128GB.
> 
> Does it mean that for dataset used for databases and similar environments
where basically all blocks have fixed size and there is no other data all parity
information will end-up on one (z1) or two (z2) specific disks?
What''s the record size on those datasets?

8k?

-Ross

Robert Milkowski

2010-Jun-24 09:40 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On 23/06/2010 18:50, Adam Leventhal wrote:>> Does it mean that for dataset used for databases and similar
environments where basically all blocks have fixed size and there is no other
data all parity information will end-up on one (z1) or two (z2) specific disks?
>>      
> No. There are always smaller writes to metadata that will distribute
parity. What is the total width of your raidz1 stripe?
>
>    
4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Jun-24 09:40 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On 23/06/2010 19:29, Ross Walker wrote:> On Jun 23, 2010, at 1:48 PM, Robert Milkowski<milek at task.gda.pl> 
wrote:
>
>    
>> 128GB.
>>
>> Does it mean that for dataset used for databases and similar
environments where basically all blocks have fixed size and there is no other
data all parity information will end-up on one (z1) or two (z2) specific disks?
>>      
> What''s the record size on those datasets?
>
> 8k?
>
>    
16K

Ross Walker

2010-Jun-24 13:32 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On Jun 24, 2010, at 5:40 AM, Robert Milkowski <milek at task.gda.pl>
wrote:
> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar
environments where basically all blocks have fixed size and there is no other
data all parity information will end-up on one (z1) or two (z2) specific disks?
>>>     
>> No. There are always smaller writes to metadata that will distribute
parity. What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
From what I gather each 16KB record (plus parity) is spread across the raidz
disks. This causes the total random IOPS (write AND read) of the raidz to be
that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good
random IO with raidz you need a zpool with X raidz vdevs where X = desired
IOPS/IOPS of single drive.

-Ross

Robert Milkowski

2010-Jun-24 14:42 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On 24/06/2010 14:32, Ross Walker wrote:> On Jun 24, 2010, at 5:40 AM, Robert Milkowski<milek at task.gda.pl> 
wrote:
>
>    
>> On 23/06/2010 18:50, Adam Leventhal wrote:
>>      
>>>> Does it mean that for dataset used for databases and similar
environments where basically all blocks have fixed size and there is no other
data all parity information will end-up on one (z1) or two (z2) specific disks?
>>>>
>>>>          
>>> No. There are always smaller writes to metadata that will
distribute parity. What is the total width of your raidz1 stripe?
>>>
>>>
>>>        
>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
>>      
>  From what I gather each 16KB record (plus parity) is spread across the
raidz disks. This causes the total random IOPS (write AND read) of the raidz to
be that of the slowest disk in the raidz.
>
> Raidz is definitely made for sequential IO patterns not random. To get good
random IO with raidz you need a zpool with X raidz vdevs where X = desired
IOPS/IOPS of single drive.
>    
I know that and it wasn''t mine question.

-- 
Robert Milkowski
http://milek.blogspot.com

Bob Friesenhahn

2010-Jun-24 14:54 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On Thu, 24 Jun 2010, Ross Walker wrote:>
> Raidz is definitely made for sequential IO patterns not random. To 
> get good random IO with raidz you need a zpool with X raidz vdevs 
> where X = desired IOPS/IOPS of single drive.
Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2010-Jun-24 15:26 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On 24/06/2010 15:54, Bob Friesenhahn wrote:> On Thu, 24 Jun 2010, Ross Walker wrote:
>>
>> Raidz is definitely made for sequential IO patterns not random. To 
>> get good random IO with raidz you need a zpool with X raidz vdevs 
>> where X = desired IOPS/IOPS of single drive.
>
> Remarkably, I have yet to see mention of someone testing a raidz which 
> is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
> particularly when reading.
I have.

Briefly:


   X4270 2x Quad-core 2.93GHz, 72GB RAM
   Open Solaris 2009.06 (snv_111b)
   ARC limited to 4GB
   44x SSD in a F5100.
   4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS 
channels in total), each to a different domain.


1. RAID-10 pool

     22x mirrors across domains
     ZFS: 16KB recordsize, atime=off
     randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.

     maximum performance when 128 threads: ~137,000 ops/s

2. RAID-Z pool

     11x 4-way RAID-z, each raid-z vdev across domains
     ZFS: recordsize=16k, atime=off
     randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.

     maximum performance when 64-128 threads: ~34,000 ops/s

     With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s.
     Larger ZFS record sizes produced worse results.



RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here.
SSDs do not make any fundamental chanage here and RAID-Z characteristics 
are basically the same whether it is configured out of SSDs or HDDs.

However SSDs could of course provide a good-enough performance even with 
RAID-Z, as at the end of a day it is not about benchmarks but your 
environment requirements.

A given number of SSDs in a RAID-Z configuration is able to deliver the 
same performance as a much greater number of disk drives in RAID-10 
configuration and if you don''t need much space it could make sense.


-- 
Robert Milkowski
http://milek.blogspot.com

Ross Walker

2010-Jun-24 17:10 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On Jun 24, 2010, at 10:42 AM, Robert Milkowski <milek at task.gda.pl>
wrote:
> On 24/06/2010 14:32, Ross Walker wrote:
>> On Jun 24, 2010, at 5:40 AM, Robert Milkowski<milek at
task.gda.pl>  wrote:
>> 
>>   
>>> On 23/06/2010 18:50, Adam Leventhal wrote:
>>>     
>>>>> Does it mean that for dataset used for databases and
similar environments where basically all blocks have fixed size and there is no
other data all parity information will end-up on one (z1) or two (z2) specific
disks?
>>>>> 
>>>>>         
>>>> No. There are always smaller writes to metadata that will
distribute parity. What is the total width of your raidz1 stripe?
>>>> 
>>>> 
>>>>       
>>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
>>>     
>> From what I gather each 16KB record (plus parity) is spread across the
raidz disks. This causes the total random IOPS (write AND read) of the raidz to
be that of the slowest disk in the raidz.
>> 
>> Raidz is definitely made for sequential IO patterns not random. To get
good random IO with raidz you need a zpool with X raidz vdevs where X = desired
IOPS/IOPS of single drive.
>>   
> 
> I know that and it wasn''t mine question.
Sorry, for the OP...

Adam Leventhal

2010-Jun-24 17:18 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

Hey Robert,

I''ve filed a bug to track this issue. We''ll try to reproduce
the problem and evaluate the cause. Thanks for bringing this to our attention.

Adam

On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote:
> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar
environments where basically all blocks have fixed size and there is no other
data all parity information will end-up on one (z1) or two (z2) specific disks?
>>>     
>> No. There are always smaller writes to metadata that will distribute
parity. What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> 

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Arne Jansen

2010-Jun-24 19:52 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

Ross Walker wrote:
> Raidz is definitely made for sequential IO patterns not random. To get good
random IO with raidz you need a zpool with X raidz vdevs where X = desired
IOPS/IOPS of single drive.
> 
I have seen statements like this repeated several times, though
I haven''t been able to find an in-depth discussion of why this
is the case. From what I''ve gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.

Thanks,
Arne

Robert Milkowski

2010-Jun-24 20:17 UTC

head link

[zfs-discuss] raid-z - not even iops distribution

On 24/06/2010 20:52, Arne Jansen wrote:> Ross Walker wrote:
>
>> Raidz is definitely made for sequential IO patterns not random. To 
>> get good random IO with raidz you need a zpool with X raidz vdevs 
>> where X = desired IOPS/IOPS of single drive.
>>
>
> I have seen statements like this repeated several times, though
> I haven''t been able to find an in-depth discussion of why this
> is the case. From what I''ve gathered every block (what is the
> correct term for this? zio block?) written is spread across the
> whole raid-z. But in what units? will a 4k write be split into
> 512 byte writes? And in the opposite direction, every block needs
> to be read fully, even if only parts of it are being requested,
> because the checksum needs to be checked? Will the parity be
> read, too?
> If this is all the case, I can see why raid-z reduces the performance
> of an array effectively to one device w.r.t. random reads.
>
http://blogs.sun.com/roch/entry/when_to_and_not_to

-- 
Robert Milkowski
http://milek.blogspot.com

zfs discuss - Jun 2010 - raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution

[zfs-discuss] raid-z - not even iops distribution