thr3ads.net - zfs discuss - [zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Ralf Bertling

2008-Jun-22 08:42 UTC

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

Hi list,
as this matter pops up every now and then in posts on this list I just  
want to clarify that the real performance of RaidZ (in its current  
implementation) is NOT anything that follows from raidz-style data  
efficient redundancy or the copy-on-write design used in ZFS.

In a M-Way mirrored setup of N disks you get the write performance of  
the worst disk and a read performance that is the sum of all disks  
(for streaming and random workloads, while latency is not improved)
Apart from the write performance you get very bad disk utilization  
from that scenario.

In Raid-Z currently we have to distinguish random reads from streaming  
reads:
- Write performance (with COW) is (N-M)*worst single disk write  
performance since all writes are streaming writes by design of ZFS  
(which is N-M-1 times faste than mirrored)
- Streaming read performance is N*worst read performance of a single  
disk (which is identical to mirrored if all disks have the same speed)
- The problem with the current implementation is that N-M disks in a  
vdev are currently taking part in reading a single byte from a it,  
which i turn results in the slowest performance of N-M disks in  
question.

Now lets see if this really has to be this way (this implies no,  
doesn''t it ;-)
When reading small blocks of data (as opposed to streams discussed  
earlier) the requested data resides on a single disk and thus reading  
it does not require to send read commands to all disks in the vdev.  
Without detailed knowledge of the ZFS code, I suspect the problem is  
the logical block size of any ZFS operation always uses the full  
stripe. If true, I think this is a design error.
Without that, random reads to a raid-z are almost as fast as mirrored  
data.
The theoretical disadvantages come from disks that have different  
speed (probably insignificant in any real-life scenario) and the  
statistical probability that by chance a few particular random reads  
do in fact have to access the same disk drive to be fulfilled. (In a  
mirrored setup, ZFS can choose from all idle devices, whereas in RAID- 
Z it has to wait for the disk that holds the data to be ready  
processing its current requests).
Looking more closely, this effect mostly affects latency (not  
performance) as random read-requests coming in should be distributed  
equally across all devices even bette if the queue of requests gets  
longer (this would however require ZFS to reorder requests for maximum  
performance.

Since this seems to be a real issue for many ZFS users, it would be  
nice if someone who has more time than me to look into the code, can  
comment on the amount of work required to boost RaidZ read performance.

Doing so would level the tradeoff between read- write- performance and  
disk utilization significantly.
Obviously if disk space (and resulting electricity costs) do not  
matter compared to getting maximum read performance, you will always  
be best of with 3 or even more way mirrors and a very large number of  
vdevs in your pool.

A further question that springs to mind is if copies=N is also used to  
improve read performance. If so, you could have some read-optimized  
filesystems in a pool while others use maximum storage efficiency (as  
for backups).

Regards,
	ralf
--
Ralf Bertling 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080622/45149efa/attachment.html>

Bob Friesenhahn

2008-Jun-22 15:37 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

On Sun, 22 Jun 2008, Ralf Bertling wrote:>
> Now lets see if this really has to be this way (this implies no,
doesn''t it
> ;-)
> When reading small blocks of data (as opposed to streams discussed earlier)
> the requested data resides on a single disk and thus reading it does not 
> require to send read commands to all disks in the vdev. Without detailed 
> knowledge of the ZFS code, I suspect the problem is the logical block size
of
> any ZFS operation always uses the full stripe. If true, I think this is a 
> design error.
> Without that, random reads to a raid-z are almost as fast as mirrored data.
Keep in mind that ZFS checksums all data, the checksum is stored in a 
different block than the data, and that if ZFS were to checksum on the 
stripe segment level, a lot more checksums would need to be stored. 
All these extra checksums would require more data access, more 
checksum computations, and more stress on the free block allocator 
since ZFS uses copy-on-write in all cases.

Perhaps the solution is to install more RAM in the system so that the 
stripe is fully cached and ZFS does not need to go back to disk prior 
to writing an update.  The need to read prior to write is clearly what 
kills ZFS update performance.  That is why using 8K blocks helps 
database performance.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Brian Hechinger

2008-Jun-22 15:43 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

On Sun, Jun 22, 2008 at 10:37:34AM -0500, Bob Friesenhahn
wrote:> 
> Perhaps the solution is to install more RAM in the system so that the 
> stripe is fully cached and ZFS does not need to go back to disk prior 
> to writing an update.  The need to read prior to write is clearly what 
> kills ZFS update performance.  That is why using 8K blocks helps 
> database performance.
How much do slogs/cache disks help in this case?  I''m thinking fast SSD
or
fast iRAM style devices (I really wish Gigabyte would update the iRAM to
SATA 3.0 and more ram, but I keep saying that, and it keeps not happening).

-brian
-- 
"Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you''ll end up with a cupboard full
of
pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)

Will Murnane

2008-Jun-22 15:55 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

On Sun, Jun 22, 2008 at 15:37, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> Keep in mind that ZFS checksums all data, the checksum is stored in a
> different block than the data, and that if ZFS were to checksum on the
> stripe segment level, a lot more checksums would need to be stored.
> All these extra checksums would require more data access, moreI think the question is more "why segment in the first place?".  If
ZFS kept everything in recordsize-blocks that reside on one disk each
(or two places, if there is mirroring going on) and made parity just
another recordsized-block, one could avoid the penalty of seeking
every disk for every read.

The downside of this scheme would be deletes---if you actually free
blocks, then the parity is useless.  So you''d need to do something
like keep the old useless block around and put its neighbors in the
parity in a list of blocks to be re-paritied.  Then when new parity
has been regenerated, you can actually free the block.

An advantage this would have would be changing width of raidz/z2
groups: if another disk is added, one can mark every block as needing
new parity of width N+1, and let the re-parity process do its thing.
This would take a while, of course, but it would add the expandability
that people have been asking for.
> Perhaps the solution is to install more RAM in the system so that the
> stripe is fully cached and ZFS does not need to go back to disk prior
> to writing an update.I don''t think the problem is that the stripe is falling out of cache,
but that it costs so much to get it into memory in the first place.

Will

Bob Friesenhahn

2008-Jun-22 16:21 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

On Sun, 22 Jun 2008, Brian Hechinger wrote:
> On Sun, Jun 22, 2008 at 10:37:34AM -0500, Bob Friesenhahn wrote:
>>
>> Perhaps the solution is to install more RAM in the system so that the
>> stripe is fully cached and ZFS does not need to go back to disk prior
>> to writing an update.  The need to read prior to write is clearly what
>> kills ZFS update performance.  That is why using 8K blocks helps
>> database performance.
>
> How much do slogs/cache disks help in this case?  I''m thinking
fast SSD or
> fast iRAM style devices (I really wish Gigabyte would update the iRAM to
> SATA 3.0 and more ram, but I keep saying that, and it keeps not happening).
To clarify, there are really two issues.  One is with updating small 
parts of a disk block without synchronous commit, while the other is 
updating parts of a disk block with synchronous commit.  Databases 
always want to sync their data.  When synchronous write is requested, 
the zfs in-memory recollection of that write can not be used for other 
purposes until the write is reported as completed since otherwise 
results could be incoherent.

More memory helps quite a lot in the cases where files are updated 
without requesting synchronization but is much less useful for the 
cases where the data needs to be committed to disk before proceeding.

Applications which want to update ZFS blocks and go fast at the same 
time will take care to make sure that the I/O is aligned to the start 
of the ZFS block, and that the I/O size is in multiples of the ZFS 
block size.  Testing shows that performance falls off a cliff for 
random I/O when the available ARC cache size is too small and the 
write is not properly aligned or the write is smaller than the ZFS 
block size.  If everything is perfectly aligned then ZFS still goes 
quite fast since it has no need to read the underlying data first. 
What this means for applications is that if they "own" the file, it 
may be worthwhile to read/write full ZFS blocks and do the final block 
update within the application rather than force ZFS to do it. 
However, if a small part of the the file is read and then immediately 
updated (i.e. record update), ZFS does a good job of caching in that 
case.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Jun-22 16:48 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

On Sun, 22 Jun 2008, Will Murnane wrote:>
>> Perhaps the solution is to install more RAM in the system so that the
>> stripe is fully cached and ZFS does not need to go back to disk prior
>> to writing an update.
> I don''t think the problem is that the stripe is falling out of
cache,
> but that it costs so much to get it into memory in the first place.
That makes sense and is demonstrated by measurements.

The following iozone Kbytes/sec throughput numbers are from a mirrored 
array rather than Raid-Z but it shows how sensitive ZFS becomes to 
block size once cache memory requirements start to exceed available 
memory.  Since throughput is a function of record size and latency 
this presentation tends to amplify the situation.

                                           random  random    bkwd  record 
stride
reclen   write rewrite    read    reread    read   write    read rewrite    read
      4  367953  143777   496378   488186    6242    2521  836293  786866  
30269
      8  249827  166847   621371   489279   12520    4130  929394 1508139  
41568
     16  273266  160537   555350   513444   24895    6991  928915 2473915  
32016
     32  293463  168727   595128   678359   48666   15831  818962 3708512  
43561
     64  284213  168007   694747   514942   99565   95703  705144 3774777 
270612
    128  273797  271583  1260035  1366050  187042  512312 1175683 4616660 
861089
    256  273265  272916  1259814  1394034  250743  480186  219927 4708927 
587602
    512  260630  262145   713797   743914  313429  535920  343209 2603492 
583120

Clearly random-read and random-write suffers the most.  Since 
sub-block updates cause ZFS to have to read the existing block, the 
random-write performance becomes bottlenecked by the random-read 
performance.  When the write is aligned and a multiple of the ZFS 
block size, then ZFS does not care what is already on disk and writes 
very quickly.  Notice that in the above results, random write became 
much faster than sequential write.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Jun-23 20:50 UTC

head link

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

Ralf Bertling wrote:> Hi list,
> as this matter pops up every now and then in posts on this list I just 
> want to clarify that the real performance of RaidZ (in its current 
> implementation) is NOT anything that follows from raidz-style data 
> efficient redundancy or the copy-on-write design used in ZFS.
>
> In a M-Way mirrored setup of N disks you get the write performance of 
> the worst disk and a read performance that is the sum of all disks 
> (for streaming and random workloads, while latency is not improved)
> Apart from the write performance you get very bad disk utilization 
> from that scenario.
I beg to differ with "very bad disk utilization."  IMHO you get
perfect
disk utilization for M-way redundancy :-)
> In Raid-Z currently we have to distinguish random reads from streaming 
> reads:
> - Write performance (with COW) is (N-M)*worst single disk write 
> performance since all writes are streaming writes by design of ZFS 
> (which is N-M-1 times faste than mirrored)
> - Streaming read performance is N*worst read performance of a single 
> disk (which is identical to mirrored if all disks have the same speed)
> - The problem with the current implementation is that N-M disks in a 
> vdev are currently taking part in reading a single byte from a it, 
> which i turn results in the slowest performance of N-M disks in question.
You will not be able to predict real-world write or sequential
read performance with this simple analysis because there are
many caches involved.  The caching effects will dominate for
many cases.  ZFS actually works well with write caches, so
it will be doubly difficult to predict write performance.

You can predict small, random read workload performance,
though, because you can largely discount the caching effects
for most scenarios, especially JBODs.
>
> Now lets see if this really has to be this way (this implies no, 
> doesn''t it ;-)
> When reading small blocks of data (as opposed to streams discussed 
> earlier) the requested data resides on a single disk and thus reading 
> it does not require to send read commands to all disks in the vdev. 
> Without detailed knowledge of the ZFS code, I suspect the problem is 
> the logical block size of any ZFS operation always uses the full 
> stripe. If true, I think this is a design error.
No, the reason is that the block is checksummed and we check
for errors upon read by verifying the checksum.  If you search
the zfs-discuss archives you will find this topic arises every 6
months or so.  Here is a more interesting thread on the subject,
dated November 2006:
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-November/035711.html

You will also note that for fixed record length workloads, we
tend to recommend the blocksize be matched with the ZFS
recordsize. This will improve efficiency for reads, in general.
> Without that, random reads to a raid-z are almost as fast as mirrored 
> data. 
> The theoretical disadvantages come from disks that have different 
> speed (probably insignificant in any real-life scenario) and the 
> statistical probability that by chance a few particular random reads 
> do in fact have to access the same disk drive to be fulfilled. (In a 
> mirrored setup, ZFS can choose from all idle devices, whereas in 
> RAID-Z it has to wait for the disk that holds the data to be ready 
> processing its current requests).
> Looking more closely, this effect mostly affects latency (not 
> performance) as random read-requests coming in should be distributed 
> equally across all devices even bette if the queue of requests gets 
> longer (this would however require ZFS to reorder requests for 
> maximum performance.
ZFS does re-order I/O.  Array controllers re-order the re-ordered
I/O. Disks then re-order I/O, just to make sure it was re-ordered
again. So it is also difficult to develop meaningful models of disk
performance in these complex systems.
>
> Since this seems to be a real issue for many ZFS users, it would be 
> nice if someone who has more time than me to look into the code, can 
> comment on the amount of work required to boost RaidZ read performance.
Periodically, someone offers to do this... but I haven''t seen an
implementation.
>
> Doing so would level the tradeoff between read- write- performance and 
> disk utilization significantly.
> Obviously if disk space (and resulting electricity costs) do not 
> matter compared to getting maximum read performance, you will always 
> be best of with 3 or even more way mirrors and a very large number of 
> vdevs in your pool.
Space, performance, reliability: pick two.

<sidebar>
The ZFS checksum has proven to be very effective at
identifying data corruption in systems.  In a traditional
RAID-5 implementation, like SVM, the data is assumed
to be correct if the read operation returned without an
error. If you try to make SVM more reliable by adding a
checksum, then you will end up at approximately the
same place ZFS is: by distrusting the hardware you take
a performance penalty, but improve your data reliability.
</sidebar>
>
> A further question that springs to mind is if copies=N is also used to 
> improve read performance.
I have not measured copies=N performance changes, but
I do not expect them to change the read efficiency.  You
will still need to read the entire block to calculate the
checksum.
> If so, you could have some read-optimized filesystems in a pool while 
> others use maximum storage efficiency (as for backups).
Hmmm... ok so how does a small, random read workload
requirement come from a backup system implementation?
I would expect backups to be single thread, sequential
workloads.  For example, many people use VTLs with ZFS
http://www.sun.com/storagetek/tape_storage/tape_virtualization/vtl_value/features.xml
 -- richard

Reasonably Related Threads

Search for more reasonably related threads

zfs discuss - Jun 2008 - ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

[zfs-discuss] ZFS-Performance: Raid-Z vs. Raid5/6 vs. mirrored

Reasonably Related Threads