I''m trying to run some IOzone benchmarking on a new system to get a feel for baseline performance. Unfortunately, the system has a lot of memory (144GB), but I have some time so am approaching my runs as follows: Throughput: iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls IOPS: iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt Not sure what I gain/lose by using threads or not. Am I off on this? System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of 15 disks each -- RAIDZ3. NexentaStor 3.1.2. Ray
On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote:> Throughput: > iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls > > IOPS: > iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txtDo you expect to be reading or writing 36 or 288Gb files very often on this array? The largest file size I''ve used in my still lengthy benchmarks was 16Gb. If you use the sizes you''ve proposed, it could take several days or weeks to complete. Try a web search for "iozone examples" if you want more details on the command switches. -Gary
On Mon, Apr 30, 2012 at 4:15 PM, Ray Van Dolson <rvandolson at esri.com> wrote:> I''m trying to run some IOzone benchmarking on a new system to get a > feel for baseline performance.If you have compression turned on (and I highly recommend turning it on if you have the CPU power to handle it), the IOzone data will be flawed. I did not look deeper into it, but the data that IOzone uses compresses very, very well. Much more so than any real data out there. I used a combination of Filebench and Oracle''s Orion to test ZFS performance. Recently I started writing my own utilities for testing, as _none_ of the existing offerings tested what I needed (lots and lots of small, less than 64KB, files). My tool is only OK for relative measures.> Unfortunately, the system has a lot of memory (144GB), but I have some > time so am approaching my runs as follows:When I was testing systems with more RAM than I wanted (when does that ever happen :-), I called the ARC to something rational (2GB, 4GB etc) and ran the tests with file sizes four times the ARC limit. Unfortunately, the siwiki site appears to be down (gone ???). On Solaris 10, the following in /etc/system (and a reboot) will cap the zfa arc to the amount of RAM specified (in bytes). Not sure on Nextena (and I have not had to cap the arc on my Nexenta Core system at home). set zfs:zfs_arc_max = 4294967296> Throughput: > ? ?iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls > > IOPS: > ? ?iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt > > Not sure what I gain/lose by using threads or not.IOzone without threads is single threaded and will demonstrate the performance a single user or application will achieve. When you use threads in IOzone you see performance for N simultaneous users (or applications). In my experience, the knee in the performance vs. # of threads curve happens somewhere between one and two times the number of CPUs in the system. In other words, with a 16 CPU system, performance scales linearly as the number of threads increases until you get to somewhere between 16 and 32. At that point the performance will start flattening out and eventually _decreases_ as you add more threads. Using multiple threads (or processes or clients or etc.) is a good way to measure how many simultaneous users your system can handle (at a certain performance level).> Am I off on this? > > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. ?16 vdevs of 15 > disks each -- RAIDZ3. ?NexentaStor 3.1.2.-- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players
On Mon, 30 Apr 2012, Ray Van Dolson wrote:> I''m trying to run some IOzone benchmarking on a new system to get a > feel for baseline performance.Unfortunately, benchmarking with IOzone is a very poor indicator of what performance will be like during normal use. Forcing the system to behave like it is short on memory only tests how the system will behave when it is short on memory. Testing multi-threaded synchronous writes with IOzone might actually mean something if it is representative of your work-load. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, May 01, 2012 at 03:21:05AM -0700, Gary Driggs wrote:> On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote: > > > Throughput: > > iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls > > > > IOPS: > > iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt > > Do you expect to be reading or writing 36 or 288Gb files very often on > this array? The largest file size I''ve used in my still lengthy > benchmarks was 16Gb. If you use the sizes you''ve proposed, it could > take several days or weeks to complete. Try a web search for "iozone > examples" if you want more details on the command switches. > > -GaryThe problem is this box has 144GB of memory. If I go with a 16GB file size (which I did), then memory and caching influences the results pretty severely (I get around 3GB/sec for writes!). Obviously, I could yank RAM for purposes of benchmarking..... :) Thanks, Ray
On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote:> On Mon, 30 Apr 2012, Ray Van Dolson wrote: > > > I''m trying to run some IOzone benchmarking on a new system to get a > > feel for baseline performance. > > Unfortunately, benchmarking with IOzone is a very poor indicator of > what performance will be like during normal use. Forcing the system > to behave like it is short on memory only tests how the system will > behave when it is short on memory. > > Testing multi-threaded synchronous writes with IOzone might actually > mean something if it is representative of your work-load. > > BobSounds like IOzone may not be my best option here (though it does produce pretty graphs). bonnie++ actually gave me more realistic sounding numbers, and I''ve been reading good thigns about fio. Ray
On 5/1/12, Ray Van Dolson wrote:> The problem is this box has 144GB of memory. If I go with a 16GB file > size (which I did), then memory and caching influences the results > pretty severely (I get around 3GB/sec for writes!).The idea of benchmarking -- IMHO -- is to vaguely attempt to reproduce real world loads. Obviously, this is an imperfect science but if you''re going to be writing a lot of small files (e.g. NNTP or email servers used to be a good real world example) then you''re going to want to benchmark for that. If you''re going to want to write a bunch of huge files (are you writing a lot of 16GB files?) then you''ll want to test for that. Caching anywhere in the pipeline is important for benchmarks because you aren''t going to turn off a cache or remove RAM in production are you? -Gary
On Tue, May 1, 2012 at 1:45 PM, Gary <gdriggs at gmail.com> wrote:> The idea of benchmarking -- IMHO -- is to vaguely attempt to reproduce > real world loads. Obviously, this is an imperfect science but if > you''re going to be writing a lot of small files (e.g. NNTP or email > servers used to be a good real world example) then you''re going to > want to benchmark for that. If you''re going to want to write a bunch > of huge files (are you writing a lot of 16GB files?) then you''ll want > to test for that. Caching anywhere in the pipeline is important for > benchmarks because you aren''t going to turn off a cache or remove RAM > in production are you?It also depends on what you are going to be tuning. When I needed to decided on a zpool configuration (# of vdev''s, type of vdev, etc.) I did not want the effect of the cache "hiding" the underlying performance limitations of the physical drive configuration. In that case I either needed to use a very large test data set or reduce the size (effect) of the RAM. By limiting the ARC to 2 GB for my test, I was able to relatively easily quantify the performance differences between the various configurations. Once we picked a configuration, we let the ARC take as much RAM as it wanted and re-ran the benchmark to see what kind of real world performance we would get. Unfortunately, we could not easily simulate 400 real world people sitting at desktops accessing the data. So our ARC limited benchmark was effectively a "worst case" number and the full ARC the "best case". The real world, as usual, fell somewhere in between. Finding a benchmark tool that matches _my_ work load is why I have started kludging together my own. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Tue, 1 May 2012, Ray Van Dolson wrote:>> >> Testing multi-threaded synchronous writes with IOzone might actually >> mean something if it is representative of your work-load. > > Sounds like IOzone may not be my best option here (though it does > produce pretty graphs). > > bonnie++ actually gave me more realistic sounding numbers, and I''ve > been reading good thigns about fio.None of these benchmarks is really useful other than to stress-test your hardware. Assuming that the hardware is working properly, when you intentionally break the cache, IOzone should produce numbers similar to what you could have estimated from hardware specification sheets and an understanding of the algorithms. Sun engineers used ''filebench'' to do most of their performance testing because it allowed configuring the behavior to emulate various usage models. You can get it from "https://sourceforge.net/projects/filebench/". Zfs is all about caching so the cache really does need to be included (and not intentionally broken) in any realistic measurement of how the system will behave. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
more comments... On May 1, 2012, at 10:41 AM, Ray Van Dolson wrote:> On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote: >> On Mon, 30 Apr 2012, Ray Van Dolson wrote: >> >>> I''m trying to run some IOzone benchmarking on a new system to get a >>> feel for baseline performance. >> >> Unfortunately, benchmarking with IOzone is a very poor indicator of >> what performance will be like during normal use. Forcing the system >> to behave like it is short on memory only tests how the system will >> behave when it is short on memory. >> >> Testing multi-threaded synchronous writes with IOzone might actually >> mean something if it is representative of your work-load. >> >> Bob > > Sounds like IOzone may not be my best option here (though it does > produce pretty graphs).For performance analysis of ZFS systems, you need to consider the advantages of the hybrid storage pool. I wrote a white paper last summer describing a model that you can use with your performance measurements or data from vendor datasheets. http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf And in presentation form, http://www.slideshare.net/relling/nexentastor-performance-tuning-openstorage-summit-2011 Recently, this model has been expanded and enhanced. Contact me offline, if you are interested. I have used IOzone, filebench, and vdbench for a lot of performance characterization lately. Each has their own strength, but all can build a full characterization profile of a system. For IOzone, I like to run a full characterization run, which precludes multithreaded runs, for a spectrum of I/O sizes and WSS. Such info can be useful to explore the boundaries of your system''s performance and compare to other systems. Also, for systems with > 50GB of RAM, there are some tunables needed for good scaling under heavy write load workloads. Alas, there is no perfect answer and no single tunable setting works optimally for all cases. WIP. YMMV. A single, summary metric is not very useful...> bonnie++ actually gave me more realistic sounding numbers, and I''ve > been reading good thigns about fio.IMNSHO, bonnie++ is a totally useless benchmark. Roch disected it rather nicely at https://bigip-blogs-cms-adc.oracle.com/roch/entry/decoding_bonnie [gag me! Does Oracle have butugly URLs or what? ;-)] -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120501/f330299c/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Ray Van Dolson > > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of > 15 > disks each -- RAIDZ3. NexentaStor 3.1.2.I think you''ll get better, both performance & reliability, if you break each of those 15-disk raidz3''s into three 5-disk raidz1''s. Here''s why: Obviously, with raidz3, if any 3 of 15 disks fail, you''re still in operation, and on the 4th failure, you''re toast. Obviously, with raidz1, if any 1 of 5 disks fail, you''re still in operation, and on the 2nd failure, you''re toast. So it''s all about computing the probability of 4 overlapping failures in the 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1. In order to calculate that, you need to estimate the time to resilver any one failed disk... In ZFS, suppose you have a record of 128k, and suppose you have a 2-way mirror vdev. Then each disk writes 128k. If you have a 3-disk raidz1, then each disk writes 64k. If you have a 5-disk raidz1, then each disk writes 32k. If you have a 15-disk raidz3, then each disk writes 10.6k. Assuming you have a machine in production, and you are doing autosnapshots. And your data is volatile. Over time, it serves to fragment your data, and after a year or two of being in production, your resilver will be composed almost entirely of random IO. Each of the non-failed disks must read their segment of the stripe, in order to reconstruct the data that will be written to the new good disk. If you''re in the 15-disk raidz3 configuration... Your segment size is approx 3x smaller, which means approx 3x more IO operations. Another way of saying that... Assuming the amount of data you will write to your pool is the same regardless of which architecture you chose... For discussion purposes, let''s say you write 3T to your pool. And let''s momentarily assume you whole pool will be composed of 15 disks, in either a single raidz3, or in 3x 5-disk raidz1. If you use one big raidz3, then the 3T will require at least 24million 128k records to hold it all, and each 128k record will be divided up onto all the disks. If you use the smaller raidz1, then only 1T will get written to each vdev, and you will only need 8million records on each disk. Thus, to resilver the large vdev, you will require 3x more IO operations. Worse still, on each IO request, you have to wait for the slowest of all disks to return. If you were in a 2-way mirror situation, your seek time would be the average seek time of a single disk. But if you were in an infinite-disk situation, your seek time would be the worst case seek time on every single IO operation, which is about 2x longer than the average seek time. So not only do you have 3x more seeks to perform, you have up to 2x longer to wait upon each seek... Now, to put some numbers on this... A single 1T disk can sustain (let''s assume) 1.0 Gbit/sec read/write sequential. This means resilvering the entire disk sequentially, including unused space, (which is not what ZFS does) would require 2.2 hours. In practice, on my 1T disks, which are in a mirrored configuration, I find resilvering takes 12 hours. I would expect this to be ~4 days if I were using 5-disk raidz1, and I would expect it to be ~12 days if I were using 15-disk raidz3. Your disks are all 2T, so you should double all the times I just wrote. Your raidz3 should be able to resilver a single disk in approx 24 days. Your raidz5 should be able to do one in ~ 8 days. If you were using mirrors, ~ 1 day. Suddenly the prospect of multiple failures overlapping don''t seem so unlikely.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Paul Kraus > > If you have compression turned on (and I highly recommend turning > it on if you have the CPU power to handle it),What if he''s storing video files, compressed files, or encrypted data? Then compression is 100% waste. So you should qualify a statement like that... Compression can be great, depending on the type of data to be stored. In my usage scenarios, I usually benefit a lot, both in terms of capacity and speed, by enabling compression.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > Zfs is all about caching so the cache really does need to be included > (and not intentionally broken) in any realistic measurement of how the > system will behave.I agree with what others have said - and this comment in particular. The only useful thing you can do is to NOT break your system intentionally, and instead find ways to emulate the real life jobs you want to do. This is exceptionally difficult, because in real life, your system will be on for a long time, doing periodic snapshot rotation, and periodic scrubs, and people will be doing all sorts of work scattered about on disk... Sometimes writing, sometimes reading, sometimes modifying, sometimes deleting. The modifies and deletes are particularly important. Because when you mix a bunch of reads/writes/overwrites/deletes in with a bunch of snapshots automatically being created & destroyed over time, these behaviors totally change the way data gets distributed throughout your pool. And the periodic scrub will also affect your memory usage and therefore distribution patterns. Given the amount of ram you have, I really don''t think you''ll be able to get any useful metric out of iozone in this lifetime.
On Thu, May 3, 2012 at 10:39 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Paul Kraus >> >> ? ? If you have compression turned on (and I highly recommend turning >> it on if you have the CPU power to handle it), > > What if he''s storing video files, compressed files, or encrypted data? ?Then > compression is 100% waste. ?So you should qualify a statement like that... > Compression can be great, depending on the type of data to be stored. ?In my > usage scenarios, I usually benefit a lot, both in terms of capacity and > speed, by enabling compression.Even with uncompressable data I measure better performance with compression turned on rather than off. I have been testing with random data that shows a compressratio of 1:1. I will test with some real data that is already highly compressed and see if that agrees with my prior testing. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players
On Thu, May 3, 2012 at 7:47 AM, Edward Ned Harvey wrote:> Given the amount of ram you have, I really don''t think you''ll be able to get > any useful metric out of iozone in this lifetime.I still think it would be apropos if dedup and compression were being used. In that case, does filebench have an option for testing either of those? -Gary
On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Ray Van Dolson > > > > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of > > 15 > > disks each -- RAIDZ3. NexentaStor 3.1.2. > > I think you''ll get better, both performance & reliability, if you break each > of those 15-disk raidz3''s into three 5-disk raidz1''s. Here''s why: > > Obviously, with raidz3, if any 3 of 15 disks fail, you''re still in > operation, and on the 4th failure, you''re toast. > Obviously, with raidz1, if any 1 of 5 disks fail, you''re still in operation, > and on the 2nd failure, you''re toast. > > So it''s all about computing the probability of 4 overlapping failures in the > 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1. In > order to calculate that, you need to estimate the time to resilver any one > failed disk... > > In ZFS, suppose you have a record of 128k, and suppose you have a 2-way > mirror vdev. Then each disk writes 128k. If you have a 3-disk raidz1, then > each disk writes 64k. If you have a 5-disk raidz1, then each disk writes > 32k. If you have a 15-disk raidz3, then each disk writes 10.6k. > > Assuming you have a machine in production, and you are doing autosnapshots. > And your data is volatile. Over time, it serves to fragment your data, and > after a year or two of being in production, your resilver will be composed > almost entirely of random IO. Each of the non-failed disks must read their > segment of the stripe, in order to reconstruct the data that will be written > to the new good disk. If you''re in the 15-disk raidz3 configuration... > Your segment size is approx 3x smaller, which means approx 3x more IO > operations. > > Another way of saying that... Assuming the amount of data you will write to > your pool is the same regardless of which architecture you chose... For > discussion purposes, let''s say you write 3T to your pool. And let''s > momentarily assume you whole pool will be composed of 15 disks, in either a > single raidz3, or in 3x 5-disk raidz1. If you use one big raidz3, then the > 3T will require at least 24million 128k records to hold it all, and each > 128k record will be divided up onto all the disks. If you use the smaller > raidz1, then only 1T will get written to each vdev, and you will only need > 8million records on each disk. Thus, to resilver the large vdev, you will > require 3x more IO operations. > > Worse still, on each IO request, you have to wait for the slowest of all > disks to return. If you were in a 2-way mirror situation, your seek time > would be the average seek time of a single disk. But if you were in an > infinite-disk situation, your seek time would be the worst case seek time on > every single IO operation, which is about 2x longer than the average seek > time. So not only do you have 3x more seeks to perform, you have up to 2x > longer to wait upon each seek... > > Now, to put some numbers on this... > A single 1T disk can sustain (let''s assume) 1.0 Gbit/sec read/write > sequential. This means resilvering the entire disk sequentially, including > unused space, (which is not what ZFS does) would require 2.2 hours. In > practice, on my 1T disks, which are in a mirrored configuration, I find > resilvering takes 12 hours. I would expect this to be ~4 days if I were > using 5-disk raidz1, and I would expect it to be ~12 days if I were using > 15-disk raidz3. > > Your disks are all 2T, so you should double all the times I just wrote. > Your raidz3 should be able to resilver a single disk in approx 24 days. > Your raidz5 should be able to do one in ~ 8 days. If you were using > mirrors, ~ 1 day. > > Suddenly the prospect of multiple failures overlapping don''t seem so > unlikely.Ed, thanks for taking the time to write this all out. Definitely food for thought. Ray
On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > I think you''ll get better, both performance & reliability, if you break each > of those 15-disk raidz3''s into three 5-disk raidz1''s. ?Here''s why:Incorrect on reliability; see below.> Now, to put some numbers on this... > A single 1T disk can sustain (let''s assume) 1.0 Gbit/sec read/write > sequential. ?This means resilvering the entire disk sequentially, including > unused space, (which is not what ZFS does) would require 2.2 hours. ?In > practice, on my 1T disks, which are in a mirrored configuration, I find > resilvering takes 12 hours. ?I would expect this to be ~4 days if I were > using 5-disk raidz1, and I would expect it to be ~12 days if I were using > 15-disk raidz3.Based on your use of "I would expect", I''m guessing you haven''t done the actual measurement. I see ~12-16 hour resilver times on pools using 1TB drives in raidz configurations. The resilver times don''t seem to vary with whether I''m using raidz1 or raidz2.> Suddenly the prospect of multiple failures overlapping don''t seem so > unlikely.Which is *exactly* why you need multiple-parity solutions. Put simply, if you''re using single-parity redundancy with 1TB drives or larger (raidz1 or 2-way mirroring) then you''re putting your data at risk. I''m seeing - at a very low level, but clearly non-zero - occasional read errors during rebuild of raidz1 vdevs, leading to data loss. Usually just one file, so it''s not too bad (and zfs will tell you which file has been lost). And the observed error rates we''re seeing in terms of uncorrectable (and undetectable) errors from drives are actually slightly better than you would expect from the manufacturers spec sheets. So you definitely need raidz2 rather than raidz1; I''m looking at going to raidz3 for solutions using current high capacity (ie 3TB) drives. (On performance, I know what the theory says about getting one disk''s worth of IOPS out of each vdev in a raidz configuration. In practice we''re finding that our raidz systems actually perform pretty well when compared with dynamic stripes, mirrors, and hardware raid LUNs.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On 5/4/2012 1:24 PM, Peter Tribble wrote:> On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: >> I think you''ll get better, both performance& reliability, if you break each >> of those 15-disk raidz3''s into three 5-disk raidz1''s. Here''s why: > Incorrect on reliability; see below. > >> Now, to put some numbers on this... >> A single 1T disk can sustain (let''s assume) 1.0 Gbit/sec read/write >> sequential. This means resilvering the entire disk sequentially, including >> unused space, (which is not what ZFS does) would require 2.2 hours. In >> practice, on my 1T disks, which are in a mirrored configuration, I find >> resilvering takes 12 hours. I would expect this to be ~4 days if I were >> using 5-disk raidz1, and I would expect it to be ~12 days if I were using >> 15-disk raidz3. > Based on your use of "I would expect", I''m guessing you haven''t > done the actual measurement. > > I see ~12-16 hour resilver times on pools using 1TB drives in > raidz configurations. The resilver times don''t seem to vary > with whether I''m using raidz1 or raidz2. > >> Suddenly the prospect of multiple failures overlapping don''t seem so >> unlikely. > Which is *exactly* why you need multiple-parity solutions. Put > simply, if you''re using single-parity redundancy with 1TB drives > or larger (raidz1 or 2-way mirroring) then you''re putting your > data at risk. I''m seeing - at a very low level, but clearly non-zero - > occasional read errors during rebuild of raidz1 vdevs, leading to > data loss. Usually just one file, so it''s not too bad (and zfs will tell > you which file has been lost). And the observed error rates we''re > seeing in terms of uncorrectable (and undetectable) errors from > drives are actually slightly better than you would expect from the > manufacturers spec sheets. > > So you definitely need raidz2 rather than raidz1; I''m looking at > going to raidz3 for solutions using current high capacity (ie 3TB) > drives. > > (On performance, I know what the theory says about getting one > disk''s worth of IOPS out of each vdev in a raidz configuration. In > practice we''re finding that our raidz systems actually perform > pretty well when compared with dynamic stripes, mirrors, and > hardware raid LUNs.) >Really, guys: Richard, myself, and several others have covered how ZFS does resilvering (and on disk reliability, a related issue), and included very detailed calculations on IOPS required and discussions about slabs, recordsize, and how disks operate with regards to seek/access times and OS caching. Please search the archives, as it''s not fruitful to repost the exact same thing repeatedly. Short version: assuming identical drives and the exact same usage pattern and /amount/ of data, the time it takes the various ZFS configurations to resilver is N for ANY mirrored config and a bit less than N*M for a M-disk RAIDZ*, where M = the number of data disks in the RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time as a 5-drive (total) RAIDZ1. Calculating what N is depends entirely on the pattern which the data was written on the drive. You''re always going to be IOPS-bound on the disk being resilvered. Which RAIDZ* config to use (assuming you have a fixed tolerance for data loss) depends entirely on what your data usage pattern does to resilver times; configurations needing very long resilver times better have more redundancy. And, remember, larger configs will allow for more data to be stored, that also increases resilver time. Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk''s worth of IOPS (averaged over a reasonable time period). Caching may make it appear to give more IOPS in certain cases, but that''s neither sustainable nor predictable, and the backing store is still only giving 1 disk''s IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write. -Erik
On Fri, 4 May 2012, Erik Trimble wrote:> predictable, and the backing store is still only giving 1 disk''s IOPS. The > RAIDZ* may, however, give you significantly more throughput (in MB/s) than a > single disk if you do a lot of sequential read or write.Has someone done real-world measurements which indicate that raidz* actually provides better sequential read or write than simple mirroring with the same number of disks? While it seems that there should be an advantage, I don''t recall seeing posted evidence of such. If there was a measurable advantage, it would be under conditions which are unlikely in the real world. The only thing totally clear to me is that raidz* provides better storage efficiency than mirroring and that raidz1 is dangerous with large disks. Provided that the media reliability is sufficiently high, there are still many performance and operational advantages obtained from simple mirroring (duplex mirroring) with zfs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 5/5/2012 8:04 AM, Bob Friesenhahn wrote:> On Fri, 4 May 2012, Erik Trimble wrote: >> predictable, and the backing store is still only giving 1 disk''s >> IOPS. The RAIDZ* may, however, give you significantly more >> throughput (in MB/s) than a single disk if you do a lot of sequential >> read or write. > > Has someone done real-world measurements which indicate that raidz* > actually provides better sequential read or write than simple > mirroring with the same number of disks? While it seems that there > should be an advantage, I don''t recall seeing posted evidence of such. > If there was a measurable advantage, it would be under conditions > which are unlikely in the real world. > > The only thing totally clear to me is that raidz* provides better > storage efficiency than mirroring and that raidz1 is dangerous with > large disks. > > Provided that the media reliability is sufficiently high, there are > still many performance and operational advantages obtained from simple > mirroring (duplex mirroring) with zfs. > > BobI''ll see what I can do about actual measurements. Given that we''re really recommending a minimum of RAIDZ2 nowdays (with disks > 1TB), that means, for N disks, you get N-2 data disks in a RAIDZ2, and N/2 disks in a standard striped mirror. My brain says that even with the overhead of parity calculation, for doing sequential read/write of at least the slab size (i.e. involving all the data drives in a RAIDZ2), performance for the RAIDZ2 should be better for N >= 6. But, that''s my theoretical brain, and we should do some decent benchmarking, to put some hard fact to that. -Erik
On May 5, 2012, at 8:04 AM, Bob Friesenhahn wrote:> On Fri, 4 May 2012, Erik Trimble wrote: >> predictable, and the backing store is still only giving 1 disk''s IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write. > > Has someone done real-world measurements which indicate that raidz* actually provides better sequential read or write than simple mirroring with the same number of disks? While it seems that there should be an advantage, I don''t recall seeing posted evidence of such. If there was a measurable advantage, it would be under conditions which are unlikely in the real world.Why would one expect raidz to be faster? Mirrors will always win on reads because you read from all sides of the mirror. Writes are a bit more difficult to predict and measure, mostly because ZFS writes to the pool are async.> The only thing totally clear to me is that raidz* provides better storage efficiency than mirroring and that raidz1 is dangerous with large disks.space, performance, dependability: pick two -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120505/d938b637/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > Has someone done real-world measurements which indicate that raidz* > actually provides better sequential read or write than simple > mirroring with the same number of disks? While it seems that there > should be an advantage, I don''t recall seeing posted evidence of such. > If there was a measurable advantage, it would be under conditions > which are unlikely in the real world.Apparently I pulled it down at some point, so I don''t have a URL for you anymore, but I did, and I posted. Long story short, both raidzN and mirror configurations behave approximately the way you would hope they do. That is... Approximately, as compared to a single disk: And I *mean* approximately, because I''m just pulling it back from memory the way I chose to remember it, which is to say, a simplified model that I felt comfortable with: seq rd seq wr rand rd rand wr 2-disk mirror 2x 1x 2x 1x 3-disk mirror 3x 1x 3x 1x 2x 2disk mirr 4x 2x 4x 2x 3x 2disk mirr 6x 3x 6x 3x 3-disk raidz 2x 2x 1x 1x 4-disk raidz 3x 3x 1x 1x 5-disk raidz 4x 4x 1x 1x 6-disk raidz 5x 5x 1x 1x I went on to test larger and more complex arrangements... Started getting things like 1.9x and 1.8x where I would have expected 2x and so forth... Sorry for being vague now, but the data isn''t in front of me anymore. Might not ever be again.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Paul Kraus > > Even with uncompressable data I measure better performance with > compression turned on rather than off.*cough*
On Mon, 7 May 2012, Edward Ned Harvey wrote:> > Apparently I pulled it down at some point, so I don''t have a URL for you > anymore, but I did, and I posted. Long story short, both raidzN and mirror > configurations behave approximately the way you would hope they do. That > is... > > Approximately, as compared to a single disk: And I *mean* approximately,Yes, I remember your results. In a few weeks I should be setting up a new system with OpenIndiana and 8 SAS disks. This will give me an opportunity to test again. Last time I got to play was back in Feburary 2008 and I did not bother to test raidz (http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf). Most common benchmarking is sequential read/write and rarely read-file/write-file where ''file'' is a megabyte or two and the file is different for each iteration. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On May 7, 2012, at 1:53 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn >> >> Has someone done real-world measurements which indicate that raidz* >> actually provides better sequential read or write than simple >> mirroring with the same number of disks? While it seems that there >> should be an advantage, I don''t recall seeing posted evidence of such. >> If there was a measurable advantage, it would be under conditions >> which are unlikely in the real world. > > Apparently I pulled it down at some point, so I don''t have a URL for you > anymore, but I did, and I posted. Long story short, both raidzN and mirror > configurations behave approximately the way you would hope they do. That > is... > > Approximately, as compared to a single disk: And I *mean* approximately, > because I''m just pulling it back from memory the way I chose to remember it, > which is to say, a simplified model that I felt comfortable with:This model is completely wrong for writes. Suggest you deal with writes separately. Also, the random reads must be small random reads, where I/O size << 128k. For most common use cases, expect random reads to be 4k or 8k. -- richard> seq rd seq wr rand rd rand wr > 2-disk mirror 2x 1x 2x 1x > 3-disk mirror 3x 1x 3x 1x > 2x 2disk mirr 4x 2x 4x 2x > 3x 2disk mirr 6x 3x 6x 3x > 3-disk raidz 2x 2x 1x 1x > 4-disk raidz 3x 3x 1x 1x > 5-disk raidz 4x 4x 1x 1x > 6-disk raidz 5x 5x 1x 1x > > I went on to test larger and more complex arrangements... Started getting > things like 1.9x and 1.8x where I would have expected 2x and so forth... > Sorry for being vague now, but the data isn''t in front of me anymore. Might > not ever be again. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120508/8464a851/attachment.html>