Jonathan Wheeler
2006-Jul-17 12:34 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Hi All, I''ve just built an 8 disk zfs storage box, and I''m in the testing phase before I put it into production. I''ve run into some unusual results, and I was hoping the community could offer some suggestions. I''ve bascially made the switch to Solaris on the promises of ZFS alone (yes I''m that excited about it!), so naturally I''m looking forward to some great performance - but it appears I''m going to need some help finding all of it. I was having even lower numbers with filebench, so I decided to dial back to a really simple app for testing - bonnie. The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 64pci-x bus. The bottle neck here, by my thinkng, should be the disks themselves. It''s not the disk interfaces (''300MB''), the disk bus (300MB EACH), the pci-x bus (1.1GB), and I''d hope a 64-bit 3Ghz cpu would be sufficent. Tests were run on a fresh clean zpool, on an idle system. Rogue results were dropped, and as you can see below, all tests were run more then once. 8GB should be far more then the 1GB of RAM that the system has, eliminating caching issues. If I''ve still managed to overlook something in my testing setup, please let me know - I sure did try! Sorry about the formatting - this is bound to end up ugly Bonnie -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- raid0 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8 disk 8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0 2.0 8 disk 8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9 2.1 so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be LOWER then writes?? -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- mirror MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8 disk 8196 33285 38.6 46033 9.9 33077 6.8 67934 90.4 93445 7.7 230.5 1.3 8 disk 8196 34821 41.4 46136 9.0 32445 6.6 67120 89.1 94403 6.9 210.4 1.8 46MB/sec writes, each disk individually can do better, but I guess keeping 8 disks in sync is hurting performance. The 94MB/sec writes is interesting. One the one hand, that''s greater then 1 disk''s worth, so I''m getting striping performance out of a mirror GO ZFS. On the other, if I can get striping performance from mirrored reads, why is it only 94MB/sec? Seemingly it''s not cpu bound. Now for the important test, raid-z -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- raidz MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 8 disk 8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3 1.0 8 disk 8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3 1.0 8 disk 8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5 0.9 7 disk 8196 51103 58.8 93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9 1.0 7 disk 8196 49446 56.8 93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1 1.0 7 disk 8196 49831 57.1 81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4 1.0 6 disk 8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7 0.9 6 disk 8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4 0.8 4 disk 8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1 0.9 I''m getting distinctly non-linear scaling here. Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with cpu to spare (roughly half on what each individual disk should be capable of). Here I''m getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec? Using 30 as a basline, I''d be expecting to see twice that with 8 disks (240ish?). What I end up with is ~135, Clearly not good scaling at all. The really interesting numbers happen at 7 disks - it''s slower then with 4, in all tests. I ran it 3x to be sure. Note this was a native 7 disk raid-z, it wasn''t 8 running in degraded mode with 7. Something is really wrong with my write performance here across the board. Reads: 4 disks gives me 190MB/sec. WOAH! I''m very happy with that. 8 disks should scale to 380 then, Well 320 isn''t all that far off - no biggie. Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are good for 60+MB/sec individually. 290 is 48/disk - note also that this is better then my raid0 performance?! Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? Something is going very wrong here too. The 7 disk raidz read test is about what I''d expect (330/7= 47/disk), but it shows that the 8 disk is actually going backwards. hmm... I understand that going for an 8 disk wide raidz isn''t optimal in terms of redundancy and IOPS/sec - but my workload shouldn''t involve large amounts of sustained random IO, so I''m happy to take the loss in favour of absolute capacity. My issue here is the scaling on sequential block transfers, not optimal design. All three raid levels have had unexpected results, and I''ll really apprectiate some suggestions on how I can troubleshoot this. I know how to run iostat while bonnie is running, but that''s about it. Incidentally, iostat is telling me that the disks are at best on hitting around 70% B. With the 8 disk tests, it was often below 50%.... Is my issue perhaps with the sata card that I''m using? Maybe it''s just not able to handle that much throughput, despite being advertised to do so. With Raid0 (aka dynamic stripes), I know that each disk can read at 60-70Mb/sec. Why am I not getting 65*8 (500MB/sec+) performance. Maybe it''s the marvell driver at fault here? My thinking is that I need to get raid0 performing as expected before looking at raidz, but I''m afraid I really don''t know where to begin. All thoughts & suggestions welcome. I''m not using the disks yet, so I can blow the zpool away as needed. Many thanks, Jonathan Wheeler This message posted from opensolaris.org
Dana H. Myers
2006-Jul-17 15:47 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Jonathan Wheeler wrote: I''m not a ZFS expert - I''m just an enthusiastic user inside Sun. Here are some brief observations:> Bonnie > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > raid0 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0 2.0 > 8 disk 8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9 2.1 > > so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be LOWER then writes??I believe this can happen because ZFS is optimized for writes, though I would tend expect a sequential write followed by a sequential read to be about the same if there''s no other filesystem activity during the write.> -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > mirror MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 33285 38.6 46033 9.9 33077 6.8 67934 90.4 93445 7.7 230.5 1.3 > 8 disk 8196 34821 41.4 46136 9.0 32445 6.6 67120 89.1 94403 6.9 210.4 1.8 > > 46MB/sec writes, each disk individually can do better, but I guess keeping 8 disks in sync is hurting performance. The 94MB/sec writes is interesting. One the one hand, that''s greater then 1 disk''s worth, so I''m getting striping performance out of a mirror GO ZFS. On the other, if I can get striping performance from mirrored reads, why is it only 94MB/sec? Seemingly it''s not cpu bound.I expect a mirror to perform about the same as a single disk for writes, and about the same as two disks for reads, which seems to be the case here. Someone from the ZFS team can correct me, but I tend to believe that reads from a mirror are scheduled in pairs; it doesn''t help the read performance to have 6 more copies of the same data available.> Now for the important test, raid-zI''ll have to let the experts dissect this data; it looks a little goofy to me, too. Dana
Roch
2006-Jul-17 16:00 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Sorry to plug my own blog but have you had a look at these ? http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to (raidz) http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs Also, my thinking is that raid-z is probably more friendly when the config contains (power-of-2 + 1) disks (or + 2 for raid-z2). -r Jonathan Wheeler writes: > Hi All, > > I''ve just built an 8 disk zfs storage box, and I''m in the testing > phase before I put it into production. I''ve run into some unusual > results, and I was hoping the community could offer some > suggestions. I''ve bascially made the switch to Solaris on the promises > of ZFS alone (yes I''m that excited about it!), so naturally I''m > looking forward to some great performance - but it appears I''m going > to need some help finding all of it. > > I was having even lower numbers with filebench, so I decided to dial > back to a really simple app for testing - bonnie. > > The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate > sata II 300GB disks, Supermicro SAT2-MV8 8 port sata controller, > running at/on a 133Mhz 64pci-x bus. > The bottle neck here, by my thinkng, should be the disks themselves. > It''s not the disk interfaces (''300MB''), the disk bus (300MB EACH), the > pci-x bus (1.1GB), and I''d hope a 64-bit 3Ghz cpu would be sufficent. > > Tests were run on a fresh clean zpool, on an idle system. Rogue > results were dropped, and as you can see below, all tests were run > more then once. 8GB should be far more then the 1GB of RAM that the > system has, eliminating caching issues. > > If I''ve still managed to overlook something in my testing setup, > please let me know - I sure did try! > > Sorry about the formatting - this is bound to end up ugly > > Bonnie > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > raid0 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0 2.0 > 8 disk 8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9 2.1 > > so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be LOWER then writes?? > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > mirror MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 33285 38.6 46033 9.9 33077 6.8 67934 90.4 93445 7.7 230.5 1.3 > 8 disk 8196 34821 41.4 46136 9.0 32445 6.6 67120 89.1 94403 6.9 210.4 1.8 > > 46MB/sec writes, each disk individually can do better, but I guess > keeping 8 disks in sync is hurting performance. The 94MB/sec writes is > interesting. One the one hand, that''s greater then 1 disk''s worth, so > I''m getting striping performance out of a mirror GO ZFS. On the other, > if I can get striping performance from mirrored reads, why is it only > 94MB/sec? Seemingly it''s not cpu bound. > > > Now for the important test, raid-z > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > raidz MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3 1.0 > 8 disk 8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3 1.0 > 8 disk 8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5 0.9 > 7 disk 8196 51103 58.8 93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9 1.0 > 7 disk 8196 49446 56.8 93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1 1.0 > 7 disk 8196 49831 57.1 81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4 1.0 > 6 disk 8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7 0.9 > 6 disk 8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4 0.8 > 4 disk 8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1 0.9 > > I''m getting distinctly non-linear scaling here. > > Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 > =33Mb/sec with cpu to spare (roughly half on what each individual disk > should be capable of). Here I''m getting 123/4= 30Mb/sec, or should > that be 123/3= 41Mb/sec? > Using 30 as a basline, I''d be expecting to see twice that with 8 disks (240ish?). What I end up with is ~135, Clearly not good scaling at all. > The really interesting numbers happen at 7 disks - it''s slower then > with 4, in all tests. > I ran it 3x to be sure. > Note this was a native 7 disk raid-z, it wasn''t 8 running in degraded > mode with 7. Something is really wrong with my write performance here across the board. > > Reads: 4 disks gives me 190MB/sec. WOAH! I''m very happy with that. 8 > disks should scale to 380 then, Well 320 isn''t all that far off - no > biggie. > Looking at the 6 disk raidz is interesting though, 290MB/sec. The > disks are good for 60+MB/sec individually. 290 is 48/disk - note also > that this is better then my raid0 performance?! > Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra > performance? Something is going very wrong here too. > > The 7 disk raidz read test is about what I''d expect (330/7= 47/disk), but it shows that the 8 disk is actually going backwards. > > hmm... > > > I understand that going for an 8 disk wide raidz isn''t optimal in terms of redundancy and IOPS/sec - but my workload shouldn''t involve large amounts of sustained random IO, so I''m happy to take the loss in favour of absolute capacity. > My issue here is the scaling on sequential block transfers, not optimal design. > > All three raid levels have had unexpected results, and I''ll really apprectiate some suggestions on how I can troubleshoot this. I know how to run iostat while bonnie is running, but that''s about it. Incidentally, iostat is telling me that the disks are at best on hitting around 70% B. With the 8 disk tests, it was often below 50%.... > > Is my issue perhaps with the sata card that I''m using? Maybe it''s just not able to handle that much throughput, despite being advertised to do so. With Raid0 (aka dynamic stripes), I know that each disk can read at 60-70Mb/sec. Why am I not getting 65*8 (500MB/sec+) performance. Maybe it''s the marvell driver at fault here? > > My thinking is that I need to get raid0 performing as expected before looking at raidz, but I''m afraid I really don''t know where to begin. > > All thoughts & suggestions welcome. I''m not using the disks yet, so I can blow the zpool away as needed. > > Many thanks, > Jonathan Wheeler > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2006-Jul-17 16:43 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Dana H. Myers wrote:> Jonathan Wheeler wrote: >> -------Sequential Output-------- ---Sequential Input-- --Random-- >> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- >> mirror MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU >> 8 disk 8196 33285 38.6 46033 9.9 33077 6.8 67934 90.4 93445 7.7 230.5 1.3 >> 8 disk 8196 34821 41.4 46136 9.0 32445 6.6 67120 89.1 94403 6.9 210.4 1.8 >> >> 46MB/sec writes, each disk individually can do better, but I guess keeping 8 disks in sync is hurting performance. The 94MB/sec writes is interesting. One the one hand, that''s greater then 1 disk''s worth, so I''m getting striping performance out of a mirror GO ZFS. On the other, if I can get striping performance from mirrored reads, why is it only 94MB/sec? Seemingly it''s not cpu bound. > > I expect a mirror to perform about the same as a single disk for writes, and about > the same as two disks for reads, which seems to be the case here. Someone from > the ZFS team can correct me, but I tend to believe that reads from a mirror are > scheduled in pairs; it doesn''t help the read performance to have 6 more copies of > the same data available.Is this an 8-way mirror, or a 4x2 RAID-1+0? For the former, I agree with Dana. For the latter, you should get more available space and better performance. 8-way mirror: zpool create blah mirror c1d0 c1d1 c1d2 c1d3 c1d4 c1d5 c1d6 c1d7 4x2-way mirror: zpool create blag mirror c1d0 c1d1 mirror c1d2 c1d3 mirror c1d4 c1d5 mirror c1d6 c1d7 -- richard
Al Hopper
2006-Jul-17 17:00 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
On Mon, 17 Jul 2006, Roch wrote:> > Sorry to plug my own blog but have you had a look at these ? > > http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to (raidz) > http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs > > Also, my thinking is that raid-z is probably more friendly > when the config contains (power-of-2 + 1) disks (or + 2 for > raid-z2).+1 I think that 5 disks for a raidz is the sweet spot IMHO. But ... YMMV etc.etc. FWIW: here''s a datapoint from a dirty raidz system with 8Gb of RAM & 5 * 300Gb SATA disks: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfs0 16G 88937 99 195973 47 95536 29 75279 95 228022 27 433.9 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 31812 99 +++++ +++ +++++ +++ 28761 99 +++++ +++ +++++ +++ zfs0,16G,88937,99,195973,47,95536,29,75279,95,228022,27,433.9,1,16,31812,99,+++++,+++,+++++,+++,28761,99,+++++,+++,+++++,+++ I''m *very* pleased with the current release of ZFS. That being said, ZFS can be frustrating at times. Occasionally it''ll issue in excess of 1k I/O ops a Second (IOPS) and you''ll say "holy snit, look at..." - and then there are times you wonder why it won''t issue more that ~250 IOPS. But, for a Rev 1 filesystem, with the technical complexity of ZFS, this level of performance is excellent IMHO and I expect that all kinds of improvements will continue to be made on the code over time. Jonathan - I expect the answer to your performance expectations is that ZFS is-what-it-is at the moment. A suggestion is to split your 8 drives into a 5 disk raidz pool and a 2 disk mirror with one spare drive remaining. Of course this is from my ZFS experience and for my intended usage and may not apply to your intended application(s). Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
James Dickens
2006-Jul-17 19:03 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
On 7/17/06, Jonathan Wheeler <griffous at griffous.net> wrote:> Hi All, > > I''ve just built an 8 disk zfs storage box, and I''m in the testing phase before I put it into production. I''ve run into some unusual results, and I was hoping the community could offer some suggestions. I''ve bascially made the switch to Solaris on the promises of ZFS alone (yes I''m that excited about it!), so naturally I''m looking forward to some great performance - but it appears I''m going to need some help finding all of it. > > I was having even lower numbers with filebench, so I decided to dial back to a really simple app for testing - bonnie. > > The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 64pci-x bus. > The bottle neck here, by my thinkng, should be the disks themselves. > It''s not the disk interfaces (''300MB''), the disk bus (300MB EACH), the pci-x bus (1.1GB), and I''d hope a 64-bit 3Ghz cpu would be sufficent. > > Tests were run on a fresh clean zpool, on an idle system. Rogue results were dropped, and as you can see below, all tests were run more then once. 8GB should be far more then the 1GB of RAM that the system has, eliminating caching issues. > > If I''ve still managed to overlook something in my testing setup, please let me know - I sure did try! > > Sorry about the formatting - this is bound to end up ugly > > Bonnie > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > raid0 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0 2.0 > 8 disk 8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9 2.1 > > so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be LOWER then writes?? > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > mirror MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 33285 38.6 46033 9.9 33077 6.8 67934 90.4 93445 7.7 230.5 1.3 > 8 disk 8196 34821 41.4 46136 9.0 32445 6.6 67120 89.1 94403 6.9 210.4 1.8 > > 46MB/sec writes, each disk individually can do better, but I guess keeping 8 disks in sync is hurting performance. The 94MB/sec writes is interesting. One the one hand, that''s greater then 1 disk''s worth, so I''m getting striping performance out of a mirror GO ZFS. On the other, if I can get striping performance from mirrored reads, why is it only 94MB/sec? Seemingly it''s not cpu bound. > > > Now for the important test, raid-z > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > raidz MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 8 disk 8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3 1.0 > 8 disk 8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3 1.0 > 8 disk 8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5 0.9 > 7 disk 8196 51103 58.8 93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9 1.0 > 7 disk 8196 49446 56.8 93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1 1.0 > 7 disk 8196 49831 57.1 81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4 1.0 > 6 disk 8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7 0.9 > 6 disk 8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4 0.8 > 4 disk 8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1 0.9 > > I''m getting distinctly non-linear scaling here. > Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with cpu to spare (roughly half on what each individual disk should be capable of). Here I''m getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec? > Using 30 as a basline, I''d be expecting to see twice that with 8 disks (240ish?). What I end up with is ~135, Clearly not good scaling at all. > The really interesting numbers happen at 7 disks - it''s slower then with 4, in all tests. > I ran it 3x to be sure. > Note this was a native 7 disk raid-z, it wasn''t 8 running in degraded mode with 7. > Something is really wrong with my write performance here across the board. > > Reads: 4 disks gives me 190MB/sec. WOAH! I''m very happy with that. 8 disks should scale to 380 then, Well 320 isn''t all that far off - no biggie. > Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are good for 60+MB/sec individually. 290 is 48/disk - note also that this is better then my raid0 performance?! > Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? Something is going very wrong here too. >I''m not an expert, but would be great if you could run at least one more test. can you try 2x 4disks in a raidz pool to see if the system does scale to 380MB/s or does the cpu does get in the way. If another controller card availible it would be interesting to see what effect if any there is to spliting the 8 drives across 2 controllers 4 drives per controller, to see if you get any performance change. I wonder if more cpus/cores would help this test, well theoretically the single cpu is fast enough but when you have checksuming, creating parity, reading and writing, and the benchmark you may get some strange interaction. You may want to try again in a few weeks, I heard that a change went into the kernel that makes SATA access more effiecient. James Dickens uadmin.blogspot.com> The 7 disk raidz read test is about what I''d expect (330/7= 47/disk), but it shows that the 8 disk is actually going backwards. > > hmm... > > > I understand that going for an 8 disk wide raidz isn''t optimal in terms of redundancy and IOPS/sec - but my workload shouldn''t involve large amounts of sustained random IO, so I''m happy to take the loss in favour of absolute capacity. > My issue here is the scaling on sequential block transfers, not optimal design. > > All three raid levels have had unexpected results, and I''ll really apprectiate some suggestions on how I can troubleshoot this. I know how to run iostat while bonnie is running, but that''s about it. Incidentally, iostat is telling me that the disks are at best on hitting around 70% B. With the 8 disk tests, it was often below 50%.... > > Is my issue perhaps with the sata card that I''m using? Maybe it''s just not able to handle that much throughput, despite being advertised to do so. With Raid0 (aka dynamic stripes), I know that each disk can read at 60-70Mb/sec. Why am I not getting 65*8 (500MB/sec+) performance. Maybe it''s the marvell driver at fault here? > > My thinking is that I need to get raid0 performing as expected before looking at raidz, but I''m afraid I really don''t know where to begin. > > All thoughts & suggestions welcome. I''m not using the disks yet, so I can blow the zpool away as needed. > > Many thanks, > Jonathan Wheeler > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jonathan Wheeler
2006-Jul-18 08:56 UTC
[zfs-discuss] Re: ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Richard Elling wrote:> Dana H. Myers wrote: > >> Jonathan Wheeler wrote: >><snip>>>> On the one hand, that''s greater then 1 disk''s worth, so I''m getting striping performance out of a mirror GO ZFS. On the other, if I can get striping performance from mirrored reads, why is it only 94MB/sec? Seemingly it''s not cpu bound. >> >> >> >> I expect a mirror to perform about the same as a single disk for writes, and about >> the same as two disks for reads, which seems to be the case here. Someone from >> the ZFS team can correct me, but I tend to believe that reads from a mirror are >> scheduled in pairs; it doesn''t help the read performance to have 6 more copies of >> the same data available.Makes sense, thanks Dana.> Is this an 8-way mirror, or a 4x2 RAID-1+0? For the former, I agree with Dana.Yup, a full 8 way mirror.> For the latter, you should get more available space and better performance. > 8-way mirror: > zpool create blah mirror c1d0 c1d1 c1d2 c1d3 c1d4 c1d5 c1d6 c1d7 > 4x2-way mirror: > zpool create blag mirror c1d0 c1d1 mirror c1d2 c1d3 mirror c1d4 c1d5 mirror c1d6 c1d7 >I agree, the it would be a win both ways. Though in my own defence I never intended to run a full 8 way mirror for actual use - it was just a fun test to see what would happen, or that the results might help point towards a bottleneck that wasn''t so obvious with the other raid levels. Thanks, Jonathan Wheeler This message posted from opensolaris.org
Jonathan Wheeler
2006-Jul-18 13:19 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
> On Mon, 17 Jul 2006, Roch wrote:> > > > > Sorry to plug my own blog but have you had a look > at these ? > > > > > http://blogs.sun.com/roller/page/roch?entry=when_to_a > nd_not_to (raidz) > > > http://blogs.sun.com/roller/page/roch?entry=the_dynam > ics_of_zfs > > > > Also, my thinking is that raid-z is probably more > friendly > > when the config contains (power-of-2 + 1) disks (or > + 2 for > > raid-z2). > Yes I did, and please, plug away!! These are awesome blog entries, and I''ve read both of them several times. You rule! Really. I wish I could understood a bit more of your second one, it''s a bit over my head I''m afraid. I understand that 8 disks is not optimal for a raidz set, especially for random inputs, and your blog entry is the reason for my comment to that effect near the bottom of my first post. My lastest raid 50 results were much more healthy, but I don''t know that I''m ready to sacrifice 300GB of storage to that slight improve - especially as zfs can''t grow the individual stripes (yet...) > > I think that 5 disks for a raidz is the sweet spot > IMHO. But ... YMMV etc.etc. > > FWIW: here''s a datapoint from a dirty raidz system > with 8Gb of RAM & 5 * > 300Gb SATA disks: > > Version 1.03 ------Sequential Output------ > --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- > -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP > K/sec %CP K/sec %CP /sec %CP > zfs0 16G 88937 99 195973 47 95536 29 > 75279 95 228022 27 433.9 1 > ------Sequential Create------ > --------Random Create-------- > -Create-- --Read--- -Delete-- > -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP > /sec %CP /sec %CP /sec %CP > 16 31812 99 +++++ +++ +++++ +++ 28761 > 99 +++++ +++ +++++ +++ > s0,16G,88937,99,195973,47,95536,29,75279,95,228022,27, > 433.9,1,16,31812,99,+++++,+++,+++++,+++,28761,99,+++++ > ,+++,+++++,+++ Here is my version with 5 disks in a single raidz: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 5 disks 16384 62466 72.5 133768 28.0 97698 21.7 66504 88.1 241481 20.7 118.2 1.4 Ouch, your one is much better! Can you tell me more about your setup? > I''m *very* pleased with the current release of ZFS. > That being said, ZFS > an be frustrating at times. Occasionally it''ll issue > in excess of 1k I/O > ops a Second (IOPS) and you''ll say "holy snit, look > at..." - and then > there are times you wonder why it won''t issue more > that ~250 IOPS. But, > for a Rev 1 filesystem, with the technical complexity > of ZFS, this level > of performance is excellent IMHO and I expect that > all kinds of > improvements will continue to be made on the code > over time. I don''t really have a point of comparison to know how well my hardware should be performing in the real world, just a gut feeling that it should be doing better, and some rather odd scaling issues. Please don''t take this as zfs bashing. I still can''t stop telling everyone I know about how I can create a 2TB raid in 3 seconds - I think ZFS is wicked cool! This thread is two fold, 1) I''m hoping to learn more about the zfs & solaris performance tuning by digging on in and investigating. 2) I have some notion of hopefully being helpful by providing developers with some real world data that might help in improving the code. I''m more then happy to to any testing that anyone can throw at me. I''ve already had one email from one person asking me to run their dtrace script with benchmarking and email back the results. This is great. I can''t code, but if I can help give back in any way here - hurrah! > Jonathan - I expect the answer to your performance > expectations is that > ZFS is-what-it-is at the moment. Along those lines, I''ll upgrade to the lastest nevada as soon as my connection finishes it. 5 CDs is very non-trivial down in this part of the world sadly. > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. Thanks for the reply Al, Jonathan Wheeler
Jonathan Wheeler
2006-Jul-18 13:20 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
> On 7/17/06, Jonathan Wheeler <griffous at griffous.net>> wrote: <SNIP> > > Reads: 4 disks gives me 190MB/sec. WOAH! I''m very > happy with that. 8 disks should scale to 380 then, > Well 320 isn''t all that far off - no biggie. > > Looking at the 6 disk raidz is interesting though, > 290MB/sec. The disks are good for 60+MB/sec > individually. 290 is 48/disk - note also that this is > better then my raid0 performance?! > > Adding another 2 disks to my raidz gives me a mere > 30Mb/sec extra performance? Something is going very > wrong here too. > > > I''m not an expert, but would be great if you could > run at least one more test. > > can you try 2x 4disks in a raidz pool to see if the > system does scale > to 380MB/s or does the cpu does get in the way. Sure thing! Like this? # zpool create Z raidz c0t0d0 c0t1d0 c0t2d0 c0t3d0 raidz c0t4d0 c0t5d0 c0t6d0 c0t7d0 # zpool status pool: Z state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM Z ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 raidz ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 errors: No known data errors Which gave me: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU raid50 8196 66551 76.0 107030 21.8 112764 25.9 67629 90.0 327566 35.4 195.6 1.4 raid50 8196 64689 74.6 173965 33.4 107816 23.8 67586 89.9 347390 34.6 204.6 1.3 raid50 8196 64235 74.3 159652 32.0 105808 23.9 67679 90.1 114283 10.0 201.0 1.8 raid50 8196 64348 74.5 171047 34.0 104658 23.3 67422 89.5 353125 34.2 204.2 1.8 raid50 8196 66034 75.6 170279 33.4 109865 24.2 67413 89.6 345114 34.8 197.7 1.3 As you can see I ran it 5 times. ZFS seemed to be taking a bit of a nap (throttling?) on runs 1 & 3, but the trend is definitely better then a single raid-z group, across the board. My understanding of dymanic stripe sizes still needs work, so please correct me if I have this wrong; It''s interesting to me that even though there are only 6 disks to use for writes (rather then 7), and the cpu is having to do twice the parity calculations, the performance is still actually _better_ overall. The sequential reads is higher, and the random seeks is not surprisingly better. James, does this rule out the cpu for you? These results are probably better interpretted by the experts, so please, speak up :). I read this as showing that my disk subsystem did have more in it then we were seeing from a single Raidz. > > If another controller card availible it would be > interesting to see > what effect if any there is to spliting the 8 drives > across 2 > controllers 4 drives per controller, to see if you > get any performance > change. Yes, I would love to test this. I''ve been suspicious that this might be a controller issue, though I was somewhat hoping that there would be some way that the OS could tell me this. I don''t actually have a second card, to test this properly :( I do however have 2 motherboard sata ports. It wouldn''t be a great test (6 on one, 2 on the other), but it may give some interesting results. I''ll try this tomorrow. > I wonder if more cpus/cores would help this test, > well theoretically > the single cpu is fast enough but when you have > checksuming, creating > parity, reading and writing, and the benchmark you > may get some > strange interaction. I don''t have another cpu sadly. I could do some test with checksuming turned off though if you''d like to see those results. If it turns out that this is sapping a large amount of cpu, it''s a penalty I''m quite happy to pay given the benefits. I just want to know where the bottleneck lies. > You may want to try again in a few weeks, I heard > that a change went > into the kernel that makes SATA access more > effiecient. > Well that would be great. I heard something about NCQ not being implemented yet, perhaps this is what you are referring to? > James Dickens > uadmin.blogspot.com > Jonathan
Luke Lonergan
2006-Jul-18 16:45 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
The prefetch and I/O scheduling of nv41 were responsible for some quirky performance. First time read performance might be good, then subsequent reads might be very poor. With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off): Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ 30850 99 +++++ +++ +++++ +++ Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26085 90 +++++ +++ 5700 98 21448 97 +++++ +++ 4381 97 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26082 99 +++++ +++ 5588 98 21399 88 +++++ +++ 4272 97 So that?s 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read. - Luke On 7/18/06 6:19 AM, "Jonathan Wheeler" <griffous at griffous.net> wrote:>> Version 1.03 ------Sequential Output------ >> --Sequential Input- --Random- >> -Per Chr- --Block-- -Rewrite- >> -Per Chr- --Block-- --Seeks-- >> Machine Size K/sec %CP K/sec %CP K/sec %CP >> K/sec %CP K/sec %CP /sec %CP >> zfs0 16G 88937 99 195973 47 95536 29 >> 75279 95 228022 27 433.9 1 >> ------Sequential Create------ >> --------Random Create-------- >> -Create-- --Read--- -Delete-- >> -Create-- --Read--- -Delete-- >> files /sec %CP /sec %CP /sec %CP >> /sec %CP /sec %CP /sec %CP >> 16 31812 99 +++++ +++ +++++ +++ 28761 >> 99 +++++ +++ +++++ +++ >> s0,16G,88937,99,195973,47,95536,29,75279,95,228022,27, >> 433.9,1,16,31812,99,+++++,+++,+++++,+++,28761,99,+++++ >> ,+++,+++++,+++ > > Here is my version with 5 disks in a single raidz: > > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > 5 disks 16384 62466 72.5 133768 28.0 97698 21.7 66504 88.1 241481 20.7 > 118.2 1.4-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060718/f01859e5/attachment.html>
Eric Schrock
2006-Jul-18 17:16 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
Also note that the current prefetching (even in snv_41) still suffers from some major systematic performance problems. This should be fixed by snv_45/s10u3, and is covered by the following bug: 6447377 ZFS prefetch is inconsistant I''ll duplicate Mark''s evaluation here, since it doesn''t show up on bugs.opensolaris.org: --------- The main problem here is that the dmu is not informing the zfetch code about all IO''s. The zfetch interface is only called when the dmu needs to go to the ARC to resolve an IO request. If the dmu finds that the buffer is already cached (in the dmu) it does not bother to call zfetch. So here''s what can happen: 1 - ARC cache gets loaded up with some portion of a file ''X'' 2 - application initiates a sequential read on ''X'' 3 - DMU reads first 10 blocks from the file via arc_read() 4 - dmu_zfetch() detects sequential read pattern and starts prefetching 5 - DMU finds blocks 11-15 already cached (does not tell zfetch) 6 - DMU issues read for block 16 7 - dmu_zfetch() sees a gap in the read pattern, and so assumes that we are doing a *strided read*, and changes its prefetch algorithm: prefetch 10, skip 5, prefetch 10, ... etc. As the dmu finds other blocks in its cache, the zfetch algorithms can become even more confused. --------- With some additional fixes from Jeff, sequential read performance has been vastly improved. These fixes are undergoing final testing as we speak. - Eric On Tue, Jul 18, 2006 at 09:45:09AM -0700, Luke Lonergan wrote:> The prefetch and I/O scheduling of nv41 were responsible for some quirky > performance. First time read performance might be good, then subsequent > reads might be very poor. > > With a very recent update to the zfs module that improves I/O scheduling and > prefetching, I get the following bonnie++ 1.03a results with a 36 drive > RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs > checksumming is off): > > Version 1.03 ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 > 1801 4 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 +++++ +++ +++++ +++ +++++ +++ 30850 99 +++++ +++ +++++ > +++ > > Bumping up the number of concurrent processes to 2, we get about 1.5x speed > reads of RAID10 with a concurrent workload (you have to add the rates > together): > > Version 1.03 ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 > 1233 2 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 26085 90 +++++ +++ 5700 98 21448 97 +++++ +++ 4381 > 97 > > Version 1.03 ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 > 1274 3 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 26082 99 +++++ +++ 5588 98 21399 88 +++++ +++ 4272 > 97 > > So that?s 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per > character sequential read. > > - Luke > > > On 7/18/06 6:19 AM, "Jonathan Wheeler" <griffous at griffous.net> wrote: > > > >> Version 1.03 ------Sequential Output------ > >> --Sequential Input- --Random- > >> -Per Chr- --Block-- -Rewrite- > >> -Per Chr- --Block-- --Seeks-- > >> Machine Size K/sec %CP K/sec %CP K/sec %CP > >> K/sec %CP K/sec %CP /sec %CP > >> zfs0 16G 88937 99 195973 47 95536 29 > >> 75279 95 228022 27 433.9 1 > >> ------Sequential Create------ > >> --------Random Create-------- > >> -Create-- --Read--- -Delete-- > >> -Create-- --Read--- -Delete-- > >> files /sec %CP /sec %CP /sec %CP > >> /sec %CP /sec %CP /sec %CP > >> 16 31812 99 +++++ +++ +++++ +++ 28761 > >> 99 +++++ +++ +++++ +++ > >> s0,16G,88937,99,195973,47,95536,29,75279,95,228022,27, > >> 433.9,1,16,31812,99,+++++,+++,+++++,+++,28761,99,+++++ > >> ,+++,+++++,+++ > > > > Here is my version with 5 disks in a single raidz: > > > > -------Sequential Output-------- ---Sequential Input-- > > --Random-- > > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > > --Seeks--- > > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > > /sec %CPU > > 5 disks 16384 62466 72.5 133768 28.0 97698 21.7 66504 88.1 241481 20.7 > > 118.2 1.4 >> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Tao Chen
2006-Jul-19 23:41 UTC
[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
On 7/17/06, Jonathan Wheeler <griffous at griffous.net> wrote:> > Hi All, > > I''ve just built an 8 disk zfs storage box, and I''m in the testing phase > before I put it into production. I''ve run into some unusual results, and I > was hoping the community could offer some suggestions. I''ve bascially made > the switch to Solaris on the promises of ZFS alone (yes I''m that excited > about it!), so naturally I''m looking forward to some great performance - but > it appears I''m going to need some help finding all of it. > >One major concern Jonathan has is the 7-raidz write performance. (I see no big surprise in ''read'' results.) "The really interesting numbers happen at 7 disks - it''s slower then with 4, in all tests." I randomly picked 3 results from his several runs: -Per Char- --Block--- -Rewrite-- MB K/sec %CPU K/sec %CPU K/sec %CPU ==== ========== ============ =========4-disk 8196 57965 67.9 123268 27.6 78712 17.1 7-disk 8196 49454 57.1 92149 20.1 73013 16.0 8-disk 8196 61345 70.7 139259 28.5 89545 20.8 I looked at the corresponding dtrace data for 7 and 8-raidz cases. (Should have also asked for 4-raidz data. Jonathan, you can still send 4-raidz data to me offline.) In 7-raidz, each disk had writes in two sizes: 214-block or 85-block, equally. DEVICE BLKs COUNT -------- ---- -------- sd1 85 27855 214 27882 sd2 85 27854 214 27868 sd3 85 27849 214 27884 ... In 8-raidz, sd1,3,5,7 had either 220 or 221-block writes, equally. sd2,4,6,8 had 100% of 146-block writes. DEVICE BLKs COUNT -------- ---- -------- sd1 220 16325 221 16338 sd2 146 49001 sd3 220 16335 221 16333 sd4 146 49005 sd5 220 16340 221 16324 sd6 146 49001 sd7 220 16332 221 16333 sd8 146 49009 In terms of average write response time, in 7-raidz DEVICE WRITE AVG.ms ------- ------- ------ sd1 63990 54.03 sd2 64000 53.65 sd3 63898 55.48 sd4 64190 54.14 sd5 64091 54.81 sd6 63967 57.83 sd7 64092 54.19 in 8-raidz DEVICE WRITE AVG.ms ------- ------- ------ sd1 42276 6.64 sd2 58467 19.66 sd3 42287 6.24 sd4 55198 20.01 sd5 42285 6.64 sd6 58409 22.90 sd7 42235 6.88 sd8 54967 24.46 At bdev level, 8-raidz shows much better turnaround time than 7-raidz, while disk 1,3,5,7 (larger writes) are better than 2,4,6,8 (smaller writes). So 8-raidz wins by larger writes and much better response time for each write, but why these two differences? and why the disparity between odd- and even-number disks within 8-raidz? Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060719/603d8281/attachment.html>