Hello zfs-discuss, Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB filebench/randomread script, filesize=256GB 2 disks for system, 2 disks as hot-spares, atime set to off for a pool, cache_bshift set to 8K (2^13), recordsize untouched (default). pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool was created wit 4 raid-z1 groups each with 5 disks and another 4 raid-z1 groups each with 6 disks. 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) (36 disks of usable space) a. nthreads = 1 ~60 ops b. nthreads = 4 ~250 ops c. nthreads = 8 ~520 ops d. nthreads = 128 ~1340 ops 1340/8 = 167 ops 2. pool: 2x raid-z2 (10 disks) + 2x raid-z2 (12 disks) (36 disks of usable space) a. nthreads = 1 ~50 ops b. nthreads = 4 ~190 ops c. nthreads = 8 ~360 ops d. nthreads = 128 ~720 ops 720/4 = 180 ops 3. pool: 2x raid-z2 (22 disks) (40 disks of usable space) a. nthreads = 1 ~40 ops b. nthreads = 4 ~120 ops c. nthreads = 8 ~160 ops d. nthreads = 128 ~345 ops 345/2 = 172 ops 4. pool: 4x raid-z (11 disks) (40 disks of usable space) a. nthreads = 1 ~50 ops b. nthreads = 4 ~190 ops c. nthreads = 8 ~340 ops d. nthreads = 128 ~710 ops 710/4 = 177 ops 5. pool: 4x raid-z2 (11 disks) (36 disks of usable space) a. nthreads = 1 ~55 ops b. nthreads = 4 ~200 ops c. nthreads = 8 ~350 ops d. nthreads = 128 ~760 ops 760/4 = 190 ops 6. pool:22x mirror (2 disks) (22 disks of usable space) a. nthreads = 1 ~75 ops b. nthreads = 4 ~320 ops c. nthreads = 8 ~670 ops d. nthreads = 128 ~3900 ops 3900/22 = 177 ops 3900/44 = 88 ops (it''s a read test after all) Well, random reads really tends to give about 1-2x # of IOs of a single disk in a raid-z[12] group. For some workloads it''s really bad. For some workloads I would definitely prefer much better random read performance in terms of IO/s and trade in write performance. Maybe something like raid-y[12] which are more like classical raid-[56]? That way user would have a choice - good read performance however excellent write performance, or just the opposite. Right now RAID-Z1 and RAID-Z2 is just plainly horrible in terms of # IOs for random-read environments when caching ratio is marginal. This is especially painful with x4500. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Wow. Thanks for the data. This is somewhat consistent with what I predict in RAIDoptimizer. Robert Milkowski wrote:> Hello zfs-discuss, > > > Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB > > filebench/randomread script, filesize=256GBYour performance numbers are better than I predict, and I believe it is because of filesize=256GB. With that small of a filesize, you won''t be getting any long seeks. In other words, your performance will go down as you use more of the space and need long seeks. I''ve added my RAIDoptimizer predictions below... RAIDoptimizer predicts that a single Hitachi E7K500 SATA drive is good for 79 short, random read iops (7,200 rpm, 8.5 ms avg seek) These drives should be using the SATA framework which includes NCQ, so that will also have a positive affect on performance for many threads.> 2 disks for system, 2 disks as hot-spares, atime set to off for a > pool, cache_bshift set to 8K (2^13), recordsize untouched (default). > > pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool > was created wit 4 raid-z1 groups each with 5 disks and another 4 > raid-z1 groups each with 6 disks. > > > 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) > (36 disks of usable space) > > a. nthreads = 1 ~60 ops > b. nthreads = 4 ~250 ops > c. nthreads = 8 ~520 ops > d. nthreads = 128 ~1340 ops > > 1340/8 = 167 opsRAIDoptimizer predicts 632 iops for lots of threads measured/predicted = 2.12> 2. pool: 2x raid-z2 (10 disks) + 2x raid-z2 (12 disks) > (36 disks of usable space) > > a. nthreads = 1 ~50 ops > b. nthreads = 4 ~190 ops > c. nthreads = 8 ~360 ops > d. nthreads = 128 ~720 ops > > 720/4 = 180 opsRAIDoptimizer predicts 316 iops for lots of threads measured/predicted = 2.28> 3. pool: 2x raid-z2 (22 disks) > (40 disks of usable space) > > a. nthreads = 1 ~40 ops > b. nthreads = 4 ~120 ops > c. nthreads = 8 ~160 ops > d. nthreads = 128 ~345 ops > > 345/2 = 172 opsRAIDoptimizer predicts 158 iops for lots of threads measured/predicted = 2.18> 4. pool: 4x raid-z (11 disks) > (40 disks of usable space) > > a. nthreads = 1 ~50 ops > b. nthreads = 4 ~190 ops > c. nthreads = 8 ~340 ops > d. nthreads = 128 ~710 ops > > 710/4 = 177 opsRAIDoptimizer predicts 316 iops for lots of threads measured/predicted = 2.25> 5. pool: 4x raid-z2 (11 disks) > (36 disks of usable space) > > a. nthreads = 1 ~55 ops > b. nthreads = 4 ~200 ops > c. nthreads = 8 ~350 ops > d. nthreads = 128 ~760 ops > > 760/4 = 190 opsRAIDoptimizer predicts 316 iops for lots of threads measured/predicted = 2.41> 6. pool:22x mirror (2 disks) > (22 disks of usable space) > > a. nthreads = 1 ~75 ops > b. nthreads = 4 ~320 ops > c. nthreads = 8 ~670 ops > d. nthreads = 128 ~3900 ops > > 3900/22 = 177 ops > 3900/44 = 88 ops (it''s a read test after all)RAIDoptimizer predicts 3474 iops for lots of threads measured/predicted = 1.12> Well, random reads really tends to give about 1-2x # of IOs of a > single disk in a raid-z[12] group. For some workloads it''s really bad. > For some workloads I would definitely prefer much better random read > performance in terms of IO/s and trade in write performance. > Maybe something like raid-y[12] which are more like classical > raid-[56]? That way user would have a choice - good read performance > however excellent write performance, or just the opposite.I don''t see how you can get both end-to-end data integrity and read avoidance. Given the large number of horror stories from RAID-5 users, I''d rather not give in on end-to-end data integrity. Ideas?> Right now RAID-Z1 and RAID-Z2 is just plainly horrible in terms of # > IOs for random-read environments when caching ratio is marginal. > This is especially painful with x4500.Yes. I might need to adjust RAIDoptimizer''s predictions for RAID-Z/Z2. I''ll need more data to prove a new model, though [yes, its on my list :-)] -- richard
> I don''t see how you can get both end-to-end data integrity and > read avoidance.Checksum the individual RAID-5 blocks, rather than the entire stripe? In more detail: Allow the pointer to the block to contain one checksum per device used (the count will vary if you''re using a RAID-Z style algorithm). Checksum each device''s data independently, so the pointer looks like one of: (a) array of <device, offset, checksum> tuples (b) <device, offset> tuple and array of checksums The latter is closer to what we have today with RAID-Z (allocate across all devices), the former is more flexible and might work better if the number of disks in the stripe can be changed. Reading an entire stripe then requires reading all the data (as today) and verifying the individual checksums. If any checksum fails, reconstruct from the remaining blocks. Reading part of a stripe requires reading only the data from the disks holding the requested data and verifying their individual checksums. If any checksum fails, fall back on reading the whole block (across all devices) and reconstructing. Writing an entire stripe is pretty much as today, write all the data to the requested disks, but with individual checksums. Writing a partial stripe is more interesting. With style (b) block pointers -- RAID-Z style -- you need to do a read/modify/write of the stripe to get all the new data into the right place. But with style (a) block pointers, you?re back to RAID-5 style writes (read old data+parity or remaining data, write new data+parity). (You don''t need to rewrite the whole stripe since the block pointer can refer to the original partial stripe.) This message posted from opensolaris.org
> Checksum the individual RAID-5 blocks, rather than the entire stripe?Depending on your the number of drives in your RAID-Z, this will increase your metadata size by N-1 * 32 bytes. Would this not be an undesirable cost increase on the metadata size?> In more detail: Allow the pointer to the block to contain one > checksum per device used (the count will vary if you''re using a > RAID-Z style algorithm). Checksum each device''s data independently, > so the pointer looks like one of: > > (a) array of <device, offset, checksum> tuples > (b) <device, offset> tuple and array of checksumsThe former is really nice, as it also allows you to place the block arbitrarily within a disk. This could potentially allow a more efficient implementation of RAID-Z with variable sized disks (some disks could effectively be concatenated as in a RAID-0 reducing the overall striped width).> Writing a partial stripe is more interesting. With style (b) block > pointers -- RAID-Z style -- you need to do a read/modify/write of > the stripe to get all the new data into the right place. But with > style (a) block pointers, you?re back to RAID-5 style writes (read > old data+parity or remaining data, write new data+parity). (You > don''t need to rewrite the whole stripe since the block pointer can > refer to the original partial stripe.)Indeed! Out of interest what are the major concerns with increasing the block pointer size from 128 bytes? James
Hi Robert, Out of curiosity would it be possible to see the same test but hitting the disk with write operations instead of read? Best Regards, Jason On 11/2/06, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello zfs-discuss, > > > Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB > > filebench/randomread script, filesize=256GB > > 2 disks for system, 2 disks as hot-spares, atime set to off for a > pool, cache_bshift set to 8K (2^13), recordsize untouched (default). > > pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool > was created wit 4 raid-z1 groups each with 5 disks and another 4 > raid-z1 groups each with 6 disks. > > > 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) > (36 disks of usable space) > > a. nthreads = 1 ~60 ops > b. nthreads = 4 ~250 ops > c. nthreads = 8 ~520 ops > d. nthreads = 128 ~1340 ops > > 1340/8 = 167 ops > > 2. pool: 2x raid-z2 (10 disks) + 2x raid-z2 (12 disks) > (36 disks of usable space) > > a. nthreads = 1 ~50 ops > b. nthreads = 4 ~190 ops > c. nthreads = 8 ~360 ops > d. nthreads = 128 ~720 ops > > 720/4 = 180 ops > > 3. pool: 2x raid-z2 (22 disks) > (40 disks of usable space) > > a. nthreads = 1 ~40 ops > b. nthreads = 4 ~120 ops > c. nthreads = 8 ~160 ops > d. nthreads = 128 ~345 ops > > 345/2 = 172 ops > > 4. pool: 4x raid-z (11 disks) > (40 disks of usable space) > > a. nthreads = 1 ~50 ops > b. nthreads = 4 ~190 ops > c. nthreads = 8 ~340 ops > d. nthreads = 128 ~710 ops > > 710/4 = 177 ops > > 5. pool: 4x raid-z2 (11 disks) > (36 disks of usable space) > > a. nthreads = 1 ~55 ops > b. nthreads = 4 ~200 ops > c. nthreads = 8 ~350 ops > d. nthreads = 128 ~760 ops > > 760/4 = 190 ops > > 6. pool:22x mirror (2 disks) > (22 disks of usable space) > > a. nthreads = 1 ~75 ops > b. nthreads = 4 ~320 ops > c. nthreads = 8 ~670 ops > d. nthreads = 128 ~3900 ops > > 3900/22 = 177 ops > 3900/44 = 88 ops (it''s a read test after all) > > > > Well, random reads really tends to give about 1-2x # of IOs of a > single disk in a raid-z[12] group. For some workloads it''s really bad. > For some workloads I would definitely prefer much better random read > performance in terms of IO/s and trade in write performance. > Maybe something like raid-y[12] which are more like classical > raid-[56]? That way user would have a choice - good read performance > however excellent write performance, or just the opposite. > > Right now RAID-Z1 and RAID-Z2 is just plainly horrible in terms of # > IOs for random-read environments when caching ratio is marginal. > This is especially painful with x4500. > > > > > > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hello Robert, Thursday, November 2, 2006, 5:12:37 PM, you wrote: RM> Hello zfs-discuss, RM> Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB RM> filebench/randomread script, filesize=256GB RM> 2 disks for system, 2 disks as hot-spares, atime set to off for a RM> pool, cache_bshift set to 8K (2^13), recordsize untouched (default). RM> pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool RM> was created wit 4 raid-z1 groups each with 5 disks and another 4 RM> raid-z1 groups each with 6 disks. RM> 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) RM> (36 disks of usable space) RM> a. nthreads = 1 ~60 ops RM> b. nthreads = 4 ~250 ops RM> c. nthreads = 8 ~520 ops RM> d. nthreads = 128 ~1340 ops RM> 1340/8 = 167 ops RM> Now the same pool config but actual RAID-5 is done using SVM and zfs just does striping between SVM R5 devices. with nthreads=128 I get ~3680 ops which is almost 3x as much as with raid-z. I don''t like this config but maybe it''s a better way to go than with raid-z after all - at least with some environments. ps. however creating large file is about 4x slower than on raid-z -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Robert, Sunday, November 5, 2006, 2:12:58 AM, you wrote: RM> Hello Robert, RM> Thursday, November 2, 2006, 5:12:37 PM, you wrote: RM>> Hello zfs-discuss, RM>> Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB RM>> filebench/randomread script, filesize=256GB RM>> 2 disks for system, 2 disks as hot-spares, atime set to off for a RM>> pool, cache_bshift set to 8K (2^13), recordsize untouched (default). RM>> pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool RM>> was created wit 4 raid-z1 groups each with 5 disks and another 4 RM>> raid-z1 groups each with 6 disks. RM>> 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) RM>> (36 disks of usable space) RM>> a. nthreads = 1 ~60 ops RM>> b. nthreads = 4 ~250 ops RM>> c. nthreads = 8 ~520 ops RM>> d. nthreads = 128 ~1340 ops RM>> 1340/8 = 167 ops RM>> RM> Now the same pool config but actual RAID-5 is done using SVM and zfs RM> just does striping between SVM R5 devices. RM> with nthreads=128 I get ~3680 ops it''s actually ~4100 ops -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Sun, Nov 05, 2006 at 02:12:58AM +0100, Robert Milkowski wrote:> Hello Robert, > > Thursday, November 2, 2006, 5:12:37 PM, you wrote: > > RM> Hello zfs-discuss, > > > RM> Server: x4500, 2x Opetron 285 (dual-core), 16GB RAM, 48x500GB > > RM> filebench/randomread script, filesize=256GB > > RM> 2 disks for system, 2 disks as hot-spares, atime set to off for a > RM> pool, cache_bshift set to 8K (2^13), recordsize untouched (default). > > RM> pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) means that one pool > RM> was created wit 4 raid-z1 groups each with 5 disks and another 4 > RM> raid-z1 groups each with 6 disks. > > > RM> 1. pool: 4x raid-z (5 disks) + 4x raid-z (6 disks) > RM> (36 disks of usable space) > > RM> a. nthreads = 1 ~60 ops > RM> b. nthreads = 4 ~250 ops > RM> c. nthreads = 8 ~520 ops > RM> d. nthreads = 128 ~1340 ops > > RM> 1340/8 = 167 ops > RM> > > Now the same pool config but actual RAID-5 is done using SVM and zfs > just does striping between SVM R5 devices. > > with nthreads=128 I get ~3680 ops > > which is almost 3x as much as with raid-z. > > I don''t like this config but maybe it''s a better way to go than with > raid-z after all - at least with some environments. > > > ps. however creating large file is about 4x slower than on raid-zIn my opinion RAID-Z is closer to RAID-3 than to RAID-5. In RAID-3 you do only full stripe writes/reads, which is also the case for RAID-Z. What I found while working on RAID-3 implementation for FreeBSD was that for small RAID-3 arrays there is a way to speed up random reads up to 40% by using parity component in a round-robin fashion. For example (DiskP stands for partity component): Disk0 Disk1 Disk2 Disk3 DiskP And now when I get read request I do: Request number Components 0 Disk0+Disk1+Disk2+Disk3 1 Disk1+Disk2+Disk3+(Disk1^Disk2^Disk3^DiskP) 2 Disk2+Disk3+(Disk2^Disk3^DiskP^Disk0)+Disk0 3 Disk3+(Disk3^DiskP^Disk0+Disk1)+Disk0+Disk1 etc. + - concatenation ^ - XOR In other words for every read request different component is skipped. It was still a bit slower than RAID-5, though. And of course writes in RAID-3 (and probably for RAID-Z) are much, much faster. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061105/464d6a49/attachment.bin>
I don''t think you''d see the same performance benefits on RAID-Z since parity isn''t always on the same disk. Are you seeing hot/cool disks? Adam On Sun, Nov 05, 2006 at 04:03:18PM +0100, Pawel Jakub Dawidek wrote:> In my opinion RAID-Z is closer to RAID-3 than to RAID-5. In RAID-3 you > do only full stripe writes/reads, which is also the case for RAID-Z. > > What I found while working on RAID-3 implementation for FreeBSD was that > for small RAID-3 arrays there is a way to speed up random reads up to > 40% by using parity component in a round-robin fashion. For example > (DiskP stands for partity component): > > Disk0 Disk1 Disk2 Disk3 DiskP > > And now when I get read request I do: > > Request number Components > 0 Disk0+Disk1+Disk2+Disk3 > 1 Disk1+Disk2+Disk3+(Disk1^Disk2^Disk3^DiskP) > 2 Disk2+Disk3+(Disk2^Disk3^DiskP^Disk0)+Disk0 > 3 Disk3+(Disk3^DiskP^Disk0+Disk1)+Disk0+Disk1 > etc. > > + - concatenation > ^ - XOR > > In other words for every read request different component is skipped. > > It was still a bit slower than RAID-5, though. And of course writes in > RAID-3 (and probably for RAID-Z) are much, much faster. > > -- > Pawel Jakub Dawidek http://www.wheel.pl > pjd at FreeBSD.org http://www.FreeBSD.org > FreeBSD committer Am I Evil? Yes, I Am!> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
> I don''t think you''d see the same performance benefits on RAID-Z since > parity isn''t always on the same disk. Are you seeing hot/cool disks?In addition, doesn''t it always have to read all columns so that the parity can be validated? -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On 09 November, 2006 - Darren Dunham sent me these 0,7K bytes:> > I don''t think you''d see the same performance benefits on RAID-Z since > > parity isn''t always on the same disk. Are you seeing hot/cool disks? > > In addition, doesn''t it always have to read all columns so that the > parity can be validated?If the block checksum is ok, then the parity is ok too.. I think? (assuming checksum=on) /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Hello Richard, Thursday, November 2, 2006, 8:09:44 PM, you wrote: REP> I''ve added my RAIDoptimizer predictions below... RAIDoptimizer - what is it exactly and where can I get it? :) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Jason, Friday, November 3, 2006, 4:54:04 AM, you wrote: JJWW> Hi Robert, JJWW> Out of curiosity would it be possible to see the same test but hitting JJWW> the disk with write operations instead of read? Sorry... I really wanted to and I''m afraid I won''t find enough time before my vacations in a few days. I''ll try when I come back. ps. however we did test today on a 4x11 raid-z2 groups in one zpool on an actual production. Writing from many clients lot of small files over nfs - thumper could easily receive up to 65MB/s over the net - it could get more I guess it''s just that clients weren''t generating more. Comparing to our experience with HW RAID-5 it''s much better. I guess that as long as there would be enough free space in a pool writing from all those clients will be quite fast. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com