thr3ads.net - zfs discuss - [zfs-discuss] ZFS performance model for sustained, contiguous writes? [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Adam Lindsay

2007-Apr-18 14:47 UTC

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Hi folks. I''m looking at putting together a 16-disk ZFS array as a
server, and after reading Richard Elling''s writings on the matter,
I''m now left wondering if it''ll have the performance we expect
of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it
*might* be made to do what we want (saturate a GigE link), but not without some
tuning....

Am I right in my understanding of relling''s small, random read model? 
For mirrored configurations, read performance is proportional to the number of
disks.  Write performance is proportional to the number of mirror sets.
For parity configurations, read performance is proportional to the number of
RAID sets.  Write performance is roughly the same.

Clearly, there are elements of the model that don''t apply to our
sustained read/writes, so does anyone have any guidance (theoretical or
empirical) on what we could expect in that arena?

I''ve seen some references to a different ZFS mode of operation for
sustained and/or contiguous transfers. What should I know about them?

Finally, some requirements I have in speccing up this server:
My requirements:
. Saturate a 1GigE link for sustained reads _and_ writes
...  (long story... let''s just imagine uncompressed HD video)
. Do it cheaply
My strong desires:
. ZFS for its reliability, redundancy, flexibility, and ease of use
. Maximise the amount of usable space
My resources:
. a server with 16x 500GB SATA drives usable for RAID
 
 
This message posted from opensolaris.org

Bart Smaalders

2007-Apr-18 17:30 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Adam Lindsay wrote:> Hi folks. I''m looking at putting together a 16-disk ZFS array as a
server, and after reading Richard Elling''s writings on the matter,
I''m now left wondering if it''ll have the performance we expect
of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it
*might* be made to do what we want (saturate a GigE link), but not without some
tuning....
> 
> Am I right in my understanding of relling''s small, random read
model?
> For mirrored configurations, read performance is proportional to the number
of disks.  Write performance is proportional to the number of mirror sets.
> For parity configurations, read performance is proportional to the number
of RAID sets.  Write performance is roughly the same.
> 
> Clearly, there are elements of the model that don''t apply to our
sustained read/writes, so does anyone have any guidance (theoretical or
empirical) on what we could expect in that arena?
> 
> I''ve seen some references to a different ZFS mode of operation for
sustained and/or contiguous transfers. What should I know about them?
> 
> Finally, some requirements I have in speccing up this server:
> My requirements:
> . Saturate a 1GigE link for sustained reads _and_ writes
> ...  (long story... let''s just imagine uncompressed HD video)
> . Do it cheaply
> My strong desires:
> . ZFS for its reliability, redundancy, flexibility, and ease of use
> . Maximise the amount of usable space
> My resources:
> . a server with 16x 500GB SATA drives usable for RAID
What you need to know is what part of your workload is
random reads.  This will directly determine the number
of spindles required.  Otherwise, if your workload is
sequential reads or writes, you can pretty much just use
an average value for disk throughput.... with your drives
and adequate CPU, you''ll have absolutely no problems
_melting_ a 1GB net.  You want to think about how many
disk failures you want to handle before things go south...
there''s always a tension between reliability and storage
and performance.

Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each
set, you get 12 drives worth of streaming IO (read or write).
That will be about 500 MB/sec, rather more than you can get
though a 1 GB net.  That''s the aggregate bandwidth; you should
be able to both sink and source data at 1Gb/sec w/o any difficulties
at all.

If you do a lot of random reads, however, that config will
behave like 2 disks in terms of IOPs.  To do lots of IOPs,
you want to be striped across lots of 2 disk mirror pairs.

My guess is if you''re doing video, you''re doing lots of
streaming IO (eg you may be reading 20 files at once, but
those files are all being read sequentially).  If that''s
the case, ZFS can do lots of clever prefetching.... on
the write side, ZFS due to its COW behavior will just
handle both random and sequentially writes pretty
much the same way.

- Bart





-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Richard Elling

2007-Apr-18 17:50 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

so much data, so little time... :-)

Adam Lindsay wrote:> Hi folks. I''m looking at putting together a 16-disk ZFS array as a
server, and after reading Richard Elling''s writings on the matter,
I''m now left wondering if it''ll have the performance we expect
of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it
*might* be made to do what we want (saturate a GigE link), but not without some
tuning....
> 
> Am I right in my understanding of relling''s small, random read
model?
> For mirrored configurations, read performance is proportional to the number
of disks.  Write performance is proportional to the number of mirror sets.
> For parity configurations, read performance is proportional to the number
of RAID sets.  Write performance is roughly the same.
> 
> Clearly, there are elements of the model that don''t apply to our
sustained read/writes, so does anyone have any guidance (theoretical or
empirical) on what we could expect in that arena?
I have a model for the disk/media bandwidth.  In this model, the bandwidth
limit is the media speed of the disk, as determined by the disk
vendor''s
data sheet.  I then apply the RAID configuration to determine the range of
maximum, sustainable, logical data, read, media bandwidth (whew! :-).
For example, consider an X4500 with 6 Hitachi E7K500 (500GByte) disks.

config             min (Mbytes/s)    max (MBytes/s)
--------------------------------------------------
RAIDZ2 (4d+2p)          124               259
RAID1+0 (2d * 3)        186               389

This will give you a sense of the maximum media bandwidth capabilities
assuming you will blow by all caches.  But this does not identify bottlenecks.
We know that channels, controllers, memory, network, and CPU bottlenecks
can and will impact actual performance, at least for large configs.
Modeling these bottlenecks is possible, but will require more work in
the tool.  If you know the hardware topology, you can do a back-of-the-napkin
analysis, too.
> I''ve seen some references to a different ZFS mode of operation for
sustained and/or contiguous transfers. What should I know about them?
> 
> Finally, some requirements I have in speccing up this server:
> My requirements:
> . Saturate a 1GigE link for sustained reads _and_ writes
> ...  (long story... let''s just imagine uncompressed HD video)
This shouldn''t be too hard, but you''ll need a bunch of disks.
In the above example, the Hitachi E7K500 is 7,200 rpm, 3.5"
If you want to blaze, then the Seagate Saviio 2.5", 15krpm
disk should be able to do something like 60-95 MBytes/s sustained
(I''m speculating, the last time I checked, they hadn''t
published
the data sheet yet)
> . Do it cheaply
Fast disks aren''t inexpensive :-(.
> My strong desires:
> . ZFS for its reliability, redundancy, flexibility, and ease of use
> . Maximise the amount of usable space
> My resources:
> . a server with 16x 500GB SATA drives usable for RAID
That should work, at least as far as the disk media bandwidth requirements.
You''ll need to make sure that you have plenty of CPU power to drive the
rest of the system.
  -- richard

Bart Smaalders

2007-Apr-18 21:11 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Adam Lindsay wrote:> Okay, the way you say it, it sounds like a good thing. I misunderstood 
> the performance ramifications of COW and ZFS''s opportunistic write
> locations, and came up with much more pessimistic guess that it would 
> approach random writes. As it is, I have upper (number of data spindles) 
> and lower (number of disk sets) bounds to deal with. I suppose the 
> available caching memory is what controls the resilience to the demands 
> of random reads?
W/ that many drives (16), if you hit in RAM the reads are not really
random :-), or they span only a tiny fraction of the available disk
space.

Are you reading and writing the same file at the same time?  Your cache
hit rate will be much better then....

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Adam Lindsay

2007-Apr-18 21:22 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Hello Bart,

Thanks for the answers...

Bart Smaalders wrote:
>> Clearly, there are elements of the model that don''t apply to
our
>> sustained read/writes, so does anyone have any guidance (theoretical 
>> or empirical) on what we could expect in that arena?
>> I''ve seen some references to a different ZFS mode of operation
for
>> sustained and/or contiguous transfers. What should I know about them?
>
> What you need to know is what part of your workload is
> random reads.  This will directly determine the number
> of spindles required.  Otherwise, if your workload is
> sequential reads or writes, you can pretty much just use
> an average value for disk throughput.... with your drives
> and adequate CPU, you''ll have absolutely no problems
> _melting_ a 1GB net.  You want to think about how many
> disk failures you want to handle before things go south...
> there''s always a tension between reliability and storage
> and performance.
Absolutely. I''ve been thinking about that a fair bit (strongly aided by
blogs.sun.com and the list archives). This server is for research
purposes, so it will be tested with various workflows at different
times, but most will be streaming IO. The most demanding imagined is
that real-time uncompressed HD write.

And as it''ll be for research, popping in and out of different
scenarios,
long-life reliability isn''t a major issue. Some resilience is always
helpful to help offset my "cheap" criteria.
> 
> Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each
> set, you get 12 drives worth of streaming IO (read or write).
> That will be about 500 MB/sec, rather more than you can get
> though a 1 GB net.  That''s the aggregate bandwidth; you should
> be able to both sink and source data at 1Gb/sec w/o any difficulties
> at all.
> 
> If you do a lot of random reads, however, that config will
> behave like 2 disks in terms of IOPs.  To do lots of IOPs,
> you want to be striped across lots of 2 disk mirror pairs.
> 
> My guess is if you''re doing video, you''re doing lots of
> streaming IO (eg you may be reading 20 files at once, but
> those files are all being read sequentially).  If that''s
> the case, ZFS can do lots of clever prefetching.... on
> the write side, ZFS due to its COW behavior will just
> handle both random and sequentially writes pretty
> much the same way.
Okay, the way you say it, it sounds like a good thing. I misunderstood
the performance ramifications of COW and ZFS''s opportunistic write
locations, and came up with much more pessimistic guess that it would
approach random writes. As it is, I have upper (number of data spindles)
and lower (number of disk sets) bounds to deal with. I suppose the
available caching memory is what controls the resilience to the demands
of random reads?

Thanks,
adam

Adam Lindsay

2007-Apr-18 21:30 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Bart Smaalders wrote:> Adam Lindsay wrote:
>> Okay, the way you say it, it sounds like a good thing. I misunderstood 
>> the performance ramifications of COW and ZFS''s opportunistic
write
>> locations, and came up with much more pessimistic guess that it would 
>> approach random writes. As it is, I have upper (number of data 
>> spindles) and lower (number of disk sets) bounds to deal with. I 
>> suppose the available caching memory is what controls the resilience 
>> to the demands of random reads?
> 
> W/ that many drives (16), if you hit in RAM the reads are not really
> random :-), or they span only a tiny fraction of the available disk
> space.
Clearly I hadn''t thought that comment through. :) I think my mental 
model included imagined bottlenecks elsewhere in the system, but I 
haven''t got to discussing those yet.
> Are you reading and writing the same file at the same time?  Your cache
> hit rate will be much better then....
Not in the general case. Hmm, but there are some scenarios with 
multimedia caching boxes, so that could be interesting to leverage 
eventually.

bedankt,
adam

Bart Smaalders

2007-Apr-18 21:43 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Adam Lindsay wrote:> Bart Smaalders wrote:
>> Adam Lindsay wrote:
>>> Okay, the way you say it, it sounds like a good thing. I 
>>> misunderstood the performance ramifications of COW and
ZFS''s
>>> opportunistic write locations, and came up with much more
pessimistic
>>> guess that it would approach random writes. As it is, I have upper 
>>> (number of data spindles) and lower (number of disk sets) bounds to
>>> deal with. I suppose the available caching memory is what controls 
>>> the resilience to the demands of random reads?
>>
>> W/ that many drives (16), if you hit in RAM the reads are not really
>> random :-), or they span only a tiny fraction of the available disk
>> space.
> 
> Clearly I hadn''t thought that comment through. :) I think my
mental
> model included imagined bottlenecks elsewhere in the system, but I 
> haven''t got to discussing those yet.
> 
Hmmm... that _was_ prob. more opaque than necessary.  What I meant was
that you''ve got something on the order of 5TB or better of disk space;
assuming uniformly distributed reads of data and 4 GB of RAM, the odds
of hitting in the cache is essentially zero wrt performance.
>> Are you reading and writing the same file at the same time?  Your cache
>> hit rate will be much better then....
> 
> Not in the general case. Hmm, but there are some scenarios with 
> multimedia caching boxes, so that could be interesting to leverage 
> eventually.
> 
> bedankt,
> adam
> 
graag gedaan.

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Adam Lindsay

2007-Apr-18 22:15 UTC

head link

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

Thanks, Richard, for your comments.

Richard Elling wrote:> so much data, so little time... :-)
:) indeed.
> Adam Lindsay wrote:
>> Clearly, there are elements of the model that don''t apply to
our
>> sustained read/writes, so does anyone have any guidance (theoretical 
>> or empirical) on what we could expect in that arena? 
> 
> I have a model for the disk/media bandwidth.  In this model, the bandwidth
> limit is the media speed of the disk, as determined by the disk
vendor''s
> data sheet.  I then apply the RAID configuration to determine the range of
> maximum, sustainable, logical data, read, media bandwidth (whew! :-).
> For example, consider an X4500 with 6 Hitachi E7K500 (500GByte) disks.
> 
> config             min (Mbytes/s)    max (MBytes/s)
> --------------------------------------------------
> RAIDZ2 (4d+2p)          124               259
> RAID1+0 (2d * 3)        186               389
And, extrapolating (and by implication from Bart''s comments), this 
scales linearly as you add data (non-parity) spindles? Even the minimum 
figures are much more inviting than what I was guessing from the random 
read figures.
> This will give you a sense of the maximum media bandwidth capabilities
> assuming you will blow by all caches.  But this does not identify 
> bottlenecks.
Indeed, and you quite nicely get into the other questions I had 
regarding my proposed server: what bottlenecks am I going to run into, 
practically? I suspect it''s best off in another thread, with your 
indulgence.
> We know that channels, controllers, memory, network, and CPU bottlenecks
> can and will impact actual performance, at least for large configs.
> Modeling these bottlenecks is possible, but will require more work in
> the tool.  If you know the hardware topology, you can do a 
> back-of-the-napkin
> analysis, too.
> 
>> . Saturate a 1GigE link for sustained reads _and_ writes
>> ...  (long story... let''s just imagine uncompressed HD video)
> 
> This shouldn''t be too hard, but you''ll need a bunch of
disks.
Okay, so I won''t be shy about aspiring to saturate an aggregated 2x
GigE
link. :)
> In the above example, the Hitachi E7K500 is 7,200 rpm, 3.5"
> If you want to blaze, then the Seagate Saviio 2.5", 15krpm
> disk should be able to do something like 60-95 MBytes/s sustained
> (I''m speculating, the last time I checked, they hadn''t
published
> the data sheet yet)
> 
>> . Do it cheaply
> 
> Fast disks aren''t inexpensive :-(.
Indeed. That''s why I want to scale out to lots of spindles, and was 
hoping ZFS would help make up for the other failings of using the cheap 
stuff.
>> My strong desires:
>> . ZFS for its reliability, redundancy, flexibility, and ease of use
>> . Maximise the amount of usable space
>> My resources:
>> . a server with 16x 500GB SATA drives usable for RAID
> 
> That should work, at least as far as the disk media bandwidth requirements.
> You''ll need to make sure that you have plenty of CPU power to
drive the
> rest of the system.
That''s the first decision I made when speccing out the system, really: 
the local vendor (experienced with running linux on this chassis) 
originally proposed two fast single-core Opterons. I suggested two 
slower (2.2GHz) dual-core ones.

Cheers,
adam

zfs discuss - Apr 2007 - ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?

[zfs-discuss] ZFS performance model for sustained, contiguous writes?