Adam Lindsay
2007-Apr-18  14:47 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Hi folks. I''m looking at putting together a 16-disk ZFS array as a server, and after reading Richard Elling''s writings on the matter, I''m now left wondering if it''ll have the performance we expect of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it *might* be made to do what we want (saturate a GigE link), but not without some tuning.... Am I right in my understanding of relling''s small, random read model? For mirrored configurations, read performance is proportional to the number of disks. Write performance is proportional to the number of mirror sets. For parity configurations, read performance is proportional to the number of RAID sets. Write performance is roughly the same. Clearly, there are elements of the model that don''t apply to our sustained read/writes, so does anyone have any guidance (theoretical or empirical) on what we could expect in that arena? I''ve seen some references to a different ZFS mode of operation for sustained and/or contiguous transfers. What should I know about them? Finally, some requirements I have in speccing up this server: My requirements: . Saturate a 1GigE link for sustained reads _and_ writes ... (long story... let''s just imagine uncompressed HD video) . Do it cheaply My strong desires: . ZFS for its reliability, redundancy, flexibility, and ease of use . Maximise the amount of usable space My resources: . a server with 16x 500GB SATA drives usable for RAID This message posted from opensolaris.org
Bart Smaalders
2007-Apr-18  17:30 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote:> Hi folks. I''m looking at putting together a 16-disk ZFS array as a server, and after reading Richard Elling''s writings on the matter, I''m now left wondering if it''ll have the performance we expect of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it *might* be made to do what we want (saturate a GigE link), but not without some tuning.... > > Am I right in my understanding of relling''s small, random read model? > For mirrored configurations, read performance is proportional to the number of disks. Write performance is proportional to the number of mirror sets. > For parity configurations, read performance is proportional to the number of RAID sets. Write performance is roughly the same. > > Clearly, there are elements of the model that don''t apply to our sustained read/writes, so does anyone have any guidance (theoretical or empirical) on what we could expect in that arena? > > I''ve seen some references to a different ZFS mode of operation for sustained and/or contiguous transfers. What should I know about them? > > Finally, some requirements I have in speccing up this server: > My requirements: > . Saturate a 1GigE link for sustained reads _and_ writes > ... (long story... let''s just imagine uncompressed HD video) > . Do it cheaply > My strong desires: > . ZFS for its reliability, redundancy, flexibility, and ease of use > . Maximise the amount of usable space > My resources: > . a server with 16x 500GB SATA drives usable for RAIDWhat you need to know is what part of your workload is random reads. This will directly determine the number of spindles required. Otherwise, if your workload is sequential reads or writes, you can pretty much just use an average value for disk throughput.... with your drives and adequate CPU, you''ll have absolutely no problems _melting_ a 1GB net. You want to think about how many disk failures you want to handle before things go south... there''s always a tension between reliability and storage and performance. Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each set, you get 12 drives worth of streaming IO (read or write). That will be about 500 MB/sec, rather more than you can get though a 1 GB net. That''s the aggregate bandwidth; you should be able to both sink and source data at 1Gb/sec w/o any difficulties at all. If you do a lot of random reads, however, that config will behave like 2 disks in terms of IOPs. To do lots of IOPs, you want to be striped across lots of 2 disk mirror pairs. My guess is if you''re doing video, you''re doing lots of streaming IO (eg you may be reading 20 files at once, but those files are all being read sequentially). If that''s the case, ZFS can do lots of clever prefetching.... on the write side, ZFS due to its COW behavior will just handle both random and sequentially writes pretty much the same way. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Richard Elling
2007-Apr-18  17:50 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
so much data, so little time... :-) Adam Lindsay wrote:> Hi folks. I''m looking at putting together a 16-disk ZFS array as a server, and after reading Richard Elling''s writings on the matter, I''m now left wondering if it''ll have the performance we expect of such a server. Looking at his figures, 5x 3-disk RAIDZ sets seems like it *might* be made to do what we want (saturate a GigE link), but not without some tuning.... > > Am I right in my understanding of relling''s small, random read model? > For mirrored configurations, read performance is proportional to the number of disks. Write performance is proportional to the number of mirror sets. > For parity configurations, read performance is proportional to the number of RAID sets. Write performance is roughly the same. > > Clearly, there are elements of the model that don''t apply to our sustained read/writes, so does anyone have any guidance (theoretical or empirical) on what we could expect in that arena?I have a model for the disk/media bandwidth. In this model, the bandwidth limit is the media speed of the disk, as determined by the disk vendor''s data sheet. I then apply the RAID configuration to determine the range of maximum, sustainable, logical data, read, media bandwidth (whew! :-). For example, consider an X4500 with 6 Hitachi E7K500 (500GByte) disks. config min (Mbytes/s) max (MBytes/s) -------------------------------------------------- RAIDZ2 (4d+2p) 124 259 RAID1+0 (2d * 3) 186 389 This will give you a sense of the maximum media bandwidth capabilities assuming you will blow by all caches. But this does not identify bottlenecks. We know that channels, controllers, memory, network, and CPU bottlenecks can and will impact actual performance, at least for large configs. Modeling these bottlenecks is possible, but will require more work in the tool. If you know the hardware topology, you can do a back-of-the-napkin analysis, too.> I''ve seen some references to a different ZFS mode of operation for sustained and/or contiguous transfers. What should I know about them? > > Finally, some requirements I have in speccing up this server: > My requirements: > . Saturate a 1GigE link for sustained reads _and_ writes > ... (long story... let''s just imagine uncompressed HD video)This shouldn''t be too hard, but you''ll need a bunch of disks. In the above example, the Hitachi E7K500 is 7,200 rpm, 3.5" If you want to blaze, then the Seagate Saviio 2.5", 15krpm disk should be able to do something like 60-95 MBytes/s sustained (I''m speculating, the last time I checked, they hadn''t published the data sheet yet)> . Do it cheaplyFast disks aren''t inexpensive :-(.> My strong desires: > . ZFS for its reliability, redundancy, flexibility, and ease of use > . Maximise the amount of usable space > My resources: > . a server with 16x 500GB SATA drives usable for RAIDThat should work, at least as far as the disk media bandwidth requirements. You''ll need to make sure that you have plenty of CPU power to drive the rest of the system. -- richard
Bart Smaalders
2007-Apr-18  21:11 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote:> Okay, the way you say it, it sounds like a good thing. I misunderstood > the performance ramifications of COW and ZFS''s opportunistic write > locations, and came up with much more pessimistic guess that it would > approach random writes. As it is, I have upper (number of data spindles) > and lower (number of disk sets) bounds to deal with. I suppose the > available caching memory is what controls the resilience to the demands > of random reads?W/ that many drives (16), if you hit in RAM the reads are not really random :-), or they span only a tiny fraction of the available disk space. Are you reading and writing the same file at the same time? Your cache hit rate will be much better then.... - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Adam Lindsay
2007-Apr-18  21:22 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Hello Bart, Thanks for the answers... Bart Smaalders wrote:>> Clearly, there are elements of the model that don''t apply to our >> sustained read/writes, so does anyone have any guidance (theoretical >> or empirical) on what we could expect in that arena? >> I''ve seen some references to a different ZFS mode of operation for >> sustained and/or contiguous transfers. What should I know about them? > > What you need to know is what part of your workload is > random reads. This will directly determine the number > of spindles required. Otherwise, if your workload is > sequential reads or writes, you can pretty much just use > an average value for disk throughput.... with your drives > and adequate CPU, you''ll have absolutely no problems > _melting_ a 1GB net. You want to think about how many > disk failures you want to handle before things go south... > there''s always a tension between reliability and storage > and performance.Absolutely. I''ve been thinking about that a fair bit (strongly aided by blogs.sun.com and the list archives). This server is for research purposes, so it will be tested with various workflows at different times, but most will be streaming IO. The most demanding imagined is that real-time uncompressed HD write. And as it''ll be for research, popping in and out of different scenarios, long-life reliability isn''t a major issue. Some resilience is always helpful to help offset my "cheap" criteria.> > Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each > set, you get 12 drives worth of streaming IO (read or write). > That will be about 500 MB/sec, rather more than you can get > though a 1 GB net. That''s the aggregate bandwidth; you should > be able to both sink and source data at 1Gb/sec w/o any difficulties > at all. > > If you do a lot of random reads, however, that config will > behave like 2 disks in terms of IOPs. To do lots of IOPs, > you want to be striped across lots of 2 disk mirror pairs. > > My guess is if you''re doing video, you''re doing lots of > streaming IO (eg you may be reading 20 files at once, but > those files are all being read sequentially). If that''s > the case, ZFS can do lots of clever prefetching.... on > the write side, ZFS due to its COW behavior will just > handle both random and sequentially writes pretty > much the same way.Okay, the way you say it, it sounds like a good thing. I misunderstood the performance ramifications of COW and ZFS''s opportunistic write locations, and came up with much more pessimistic guess that it would approach random writes. As it is, I have upper (number of data spindles) and lower (number of disk sets) bounds to deal with. I suppose the available caching memory is what controls the resilience to the demands of random reads? Thanks, adam
Adam Lindsay
2007-Apr-18  21:30 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Bart Smaalders wrote:> Adam Lindsay wrote: >> Okay, the way you say it, it sounds like a good thing. I misunderstood >> the performance ramifications of COW and ZFS''s opportunistic write >> locations, and came up with much more pessimistic guess that it would >> approach random writes. As it is, I have upper (number of data >> spindles) and lower (number of disk sets) bounds to deal with. I >> suppose the available caching memory is what controls the resilience >> to the demands of random reads? > > W/ that many drives (16), if you hit in RAM the reads are not really > random :-), or they span only a tiny fraction of the available disk > space.Clearly I hadn''t thought that comment through. :) I think my mental model included imagined bottlenecks elsewhere in the system, but I haven''t got to discussing those yet.> Are you reading and writing the same file at the same time? Your cache > hit rate will be much better then....Not in the general case. Hmm, but there are some scenarios with multimedia caching boxes, so that could be interesting to leverage eventually. bedankt, adam
Bart Smaalders
2007-Apr-18  21:43 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Adam Lindsay wrote:> Bart Smaalders wrote: >> Adam Lindsay wrote: >>> Okay, the way you say it, it sounds like a good thing. I >>> misunderstood the performance ramifications of COW and ZFS''s >>> opportunistic write locations, and came up with much more pessimistic >>> guess that it would approach random writes. As it is, I have upper >>> (number of data spindles) and lower (number of disk sets) bounds to >>> deal with. I suppose the available caching memory is what controls >>> the resilience to the demands of random reads? >> >> W/ that many drives (16), if you hit in RAM the reads are not really >> random :-), or they span only a tiny fraction of the available disk >> space. > > Clearly I hadn''t thought that comment through. :) I think my mental > model included imagined bottlenecks elsewhere in the system, but I > haven''t got to discussing those yet. >Hmmm... that _was_ prob. more opaque than necessary. What I meant was that you''ve got something on the order of 5TB or better of disk space; assuming uniformly distributed reads of data and 4 GB of RAM, the odds of hitting in the cache is essentially zero wrt performance.>> Are you reading and writing the same file at the same time? Your cache >> hit rate will be much better then.... > > Not in the general case. Hmm, but there are some scenarios with > multimedia caching boxes, so that could be interesting to leverage > eventually. > > bedankt, > adam >graag gedaan. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Adam Lindsay
2007-Apr-18  22:15 UTC
[zfs-discuss] ZFS performance model for sustained, contiguous writes?
Thanks, Richard, for your comments. Richard Elling wrote:> so much data, so little time... :-):) indeed.> Adam Lindsay wrote: >> Clearly, there are elements of the model that don''t apply to our >> sustained read/writes, so does anyone have any guidance (theoretical >> or empirical) on what we could expect in that arena? > > I have a model for the disk/media bandwidth. In this model, the bandwidth > limit is the media speed of the disk, as determined by the disk vendor''s > data sheet. I then apply the RAID configuration to determine the range of > maximum, sustainable, logical data, read, media bandwidth (whew! :-). > For example, consider an X4500 with 6 Hitachi E7K500 (500GByte) disks. > > config min (Mbytes/s) max (MBytes/s) > -------------------------------------------------- > RAIDZ2 (4d+2p) 124 259 > RAID1+0 (2d * 3) 186 389And, extrapolating (and by implication from Bart''s comments), this scales linearly as you add data (non-parity) spindles? Even the minimum figures are much more inviting than what I was guessing from the random read figures.> This will give you a sense of the maximum media bandwidth capabilities > assuming you will blow by all caches. But this does not identify > bottlenecks.Indeed, and you quite nicely get into the other questions I had regarding my proposed server: what bottlenecks am I going to run into, practically? I suspect it''s best off in another thread, with your indulgence.> We know that channels, controllers, memory, network, and CPU bottlenecks > can and will impact actual performance, at least for large configs. > Modeling these bottlenecks is possible, but will require more work in > the tool. If you know the hardware topology, you can do a > back-of-the-napkin > analysis, too. > >> . Saturate a 1GigE link for sustained reads _and_ writes >> ... (long story... let''s just imagine uncompressed HD video) > > This shouldn''t be too hard, but you''ll need a bunch of disks.Okay, so I won''t be shy about aspiring to saturate an aggregated 2x GigE link. :)> In the above example, the Hitachi E7K500 is 7,200 rpm, 3.5" > If you want to blaze, then the Seagate Saviio 2.5", 15krpm > disk should be able to do something like 60-95 MBytes/s sustained > (I''m speculating, the last time I checked, they hadn''t published > the data sheet yet) > >> . Do it cheaply > > Fast disks aren''t inexpensive :-(.Indeed. That''s why I want to scale out to lots of spindles, and was hoping ZFS would help make up for the other failings of using the cheap stuff.>> My strong desires: >> . ZFS for its reliability, redundancy, flexibility, and ease of use >> . Maximise the amount of usable space >> My resources: >> . a server with 16x 500GB SATA drives usable for RAID > > That should work, at least as far as the disk media bandwidth requirements. > You''ll need to make sure that you have plenty of CPU power to drive the > rest of the system.That''s the first decision I made when speccing out the system, really: the local vendor (experienced with running linux on this chassis) originally proposed two fast single-core Opterons. I suggested two slower (2.2GHz) dual-core ones. Cheers, adam