Gray Carper
2008-Oct-14 07:31 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey, all! We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. In trying to discover optimal ZFS pool construction settings, we''ve run a number of iozone tests, so I thought I''d share them with you and see if you have any comments, suggestions, etc. First, on a single Thumper, we ran baseline tests on the direct-attached storage (which is collected into a single ZFS pool comprised of four raidz2 groups)... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest Write: 123919 Rewrite: 146277 Read: 383226 Reread: 383567 Random Read: 84369 Random Write: 121617 [8GB file size, 512KB record size] Command: Write: 373345 Rewrite: 665847 Read: 2261103 Reread: 2175696 Random Read: 2239877 Random Write: 666769 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest Write: 517092 Rewrite: 541768 Read: 682713 Reread: 697875 Random Read: 89362 Random Write: 488944 These results look very nice, though you''ll notice that the random read numbers tend to be pretty low on the 1GB and 64GB tests (relative to their sequential counterparts), but the 8GB random (and sequential) read is unbelievably good. Now we move to the head node''s iSCSI aggregate ZFS pool... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /volumes/data-iscsi/perftest/1gbtest Write: 127108 Rewrite: 120704 Read: 394073 Reread: 396607 Random Read: 63820 Random Write: 5907 [8GB file size, 512KB record size] Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f /volumes/data-iscsi/perftest/8gbtest Write: 235348 Rewrite: 179740 Read: 577315 Reread: 662253 Random Read: 249853 Random Write: 274589 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /volumes/data-iscsi/perftest/64gbtest Write: 190535 Rewrite: 194738 Read: 297605 Reread: 314829 Random Read: 93102 Random Write: 175688 Generally speaking, the results look good, but you''ll notice that random writes are atrocious on the 1GB tests and random reads are not so great on the 1GB and 64GB tests, but the 8GB test looks great across the board. Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and raidz2 modes - there were no significant changes in the results. So, how concerned should we be about the low scores here and there? Any suggestions on how to improve our configuration? And how excited should we be about the 8GB tests? ;> Thanks so much for any input you have! -Gray --- University of Michigan Medical School Information Services -- This message posted from opensolaris.org
James C. McPherson
2008-Oct-14 12:10 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote:> Hey, all! > > We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI > targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on > an x4200 head node. In trying to discover optimal ZFS pool construction > settings, we''ve run a number of iozone tests, so I thought I''d share them > with you and see if you have any comments, suggestions, etc.[snip] Which build are you running? Have you done any system or ZFS tuning? James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Gray Carper
2008-Oct-14 12:30 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey there, James! We''re actually running NexentaStor v1.0.8, which is based on b85. We haven''t done any tuning ourselves, but I suppose it is possible that Nexenta did. If there''s something specific you have in mind, I''d be happy to look for it. Thanks! -Gray On Tue, Oct 14, 2008 at 8:10 PM, James C. McPherson <James.McPherson at sun.com> wrote:> Gray Carper wrote: > >> Hey, all! >> >> We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI >> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on >> an x4200 head node. In trying to discover optimal ZFS pool construction >> settings, we''ve run a number of iozone tests, so I thought I''d share them >> with you and see if you have any comments, suggestions, etc. >> > > [snip] > > > Which build are you running? Have you done any system > or ZFS tuning? > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >-- Gray Carper MSIS Technical Services University of Michigan Medical School -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/ef6f99ff/attachment.html>
James C. McPherson
2008-Oct-14 12:33 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote:> Hey there, James! > > We''re actually running NexentaStor v1.0.8, which is based on b85. We > haven''t done any tuning ourselves, but I suppose it is possible that > Nexenta did. If there''s something specific you''d like me to look for, > I''d be happy to.Hi Gray, So build 85.... that''s getting a bit long in the tooth now. I know there have been *lots* of ZFS, Marvell SATA and iSCSI fixes and enhancements since then which went into OpenSolaris. I know they''re in Solaris Express and the updated binary distro form of os2008.05 - I just don''t know whether Erast and the Nexenta clan have included them in what they are releasing as 1.0.8. Erast - could you chime in here please? Unfortunately I''ve got no idea about Nexenta. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Gray Carper
2008-Oct-14 12:44 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hello again! (And hellos to Erast, who has been a huge help to me many, many times! :>) As I understand it, Nexenta 1.1 should be released in a matter of weeks and it''ll be based on build 101. We are waiting for that with baited breath, since it includes some very important Active Directory integration fixes, but this sounds like another reason to be excited about it. Maybe this is a discussion that should be tabled until we are able to upgrade? -Gray On Tue, Oct 14, 2008 at 8:33 PM, James C. McPherson <James.McPherson at sun.com> wrote:> Gray Carper wrote: > >> Hey there, James! >> >> We''re actually running NexentaStor v1.0.8, which is based on b85. We >> haven''t done any tuning ourselves, but I suppose it is possible that Nexenta >> did. If there''s something specific you''d like me to look for, I''d be happy >> to. >> > > Hi Gray, > So build 85.... that''s getting a bit long in the tooth now. > > I know there have been *lots* of ZFS, Marvell SATA and iSCSI > fixes and enhancements since then which went into OpenSolaris. > I know they''re in Solaris Express and the updated binary distro > form of os2008.05 - I just don''t know whether Erast and the > Nexenta clan have included them in what they are releasing as 1.0.8. > > Erast - could you chime in here please? Unfortunately I''ve got no > idea about Nexenta. > > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >-- Gray Carper MSIS Technical Services University of Michigan Medical School -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/e97373d0/attachment.html>
James C. McPherson
2008-Oct-14 12:47 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote:> Hello again! (And hellos to Erast, who has been a huge help to me many, > many times! :>) > > As I understand it, Nexenta 1.1 should be released in a matter of weeks > and it''ll be based on build 101. We are waiting for that with baited > breath, since it includes some very important Active Directory > integration fixes, but this sounds like another reason to be excited > about it. Maybe this is a discussion that should be tabled until we are > able to upgrade?Yup, I think that''s probably the best thing. And thanks for passing on the info about the 1.1 release, I''ll keep that in my back pocket :) cheers, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Gray Carper
2008-Oct-14 12:59 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Howdy! Sounds good. We''ll upgrade to 1.1 (b101) as soon as it is released, re-run our battery of tests, and see where we stand. Thanks! -Gray On Tue, Oct 14, 2008 at 8:47 PM, James C. McPherson <James.McPherson at sun.com> wrote:> Gray Carper wrote: > >> Hello again! (And hellos to Erast, who has been a huge help to me many, >> many times! :>) >> >> As I understand it, Nexenta 1.1 should be released in a matter of weeks >> and it''ll be based on build 101. We are waiting for that with baited breath, >> since it includes some very important Active Directory integration fixes, >> but this sounds like another reason to be excited about it. Maybe this is a >> discussion that should be tabled until we are able to upgrade? >> > > Yup, I think that''s probably the best thing. And thanks > for passing on the info about the 1.1 release, I''ll keep > that in my back pocket :) > > > cheers, > James > > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/604951ef/attachment.html>
Akhilesh Mritunjai
2008-Oct-14 14:05 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Just a random spectator here, but I think artifacts you''re seeing here are not due to file size, but rather due to record size. What is the ZFS record size ? On a personal note, I wouldn''t do non-concurrent (?) benchmarks. They are at best useless and at worst misleading for ZFS - Akhilesh. -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Oct-14 14:36 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Tue, 14 Oct 2008, Gray Carper wrote:> > So, how concerned should we be about the low scores here and there? > Any suggestions on how to improve our configuration? And how excited > should we be about the 8GB tests? ;>The level of concern should depend on how you expect your storage pool to actually be used. It seems that it should work great for bulk storage, but not to support a database, or ultra high-performance super-computing applications. The good 8GB performance is due to successful ZFS ARC caching in RAM, and because the record size is reasonable given the ZFS block size and the buffering ability of the intermediate links. You might see somewhat better performance using a 256K record size. It may take quite a while to fill 150TB up. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Erast Benson
2008-Oct-14 15:04 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
James, all serious ZFS bug fixes back-ported to b85 as well as marvell and other sata drivers. Not everything is possible to back-port of course, but I would say all critical things are there. This includes ZFS ARC optimization patches, for example. On Tue, 2008-10-14 at 22:33 +1000, James C. McPherson wrote:> Gray Carper wrote: > > Hey there, James! > > > > We''re actually running NexentaStor v1.0.8, which is based on b85. We > > haven''t done any tuning ourselves, but I suppose it is possible that > > Nexenta did. If there''s something specific you''d like me to look for, > > I''d be happy to. > > Hi Gray, > So build 85.... that''s getting a bit long in the tooth now. > > I know there have been *lots* of ZFS, Marvell SATA and iSCSI > fixes and enhancements since then which went into OpenSolaris. > I know they''re in Solaris Express and the updated binary distro > form of os2008.05 - I just don''t know whether Erast and the > Nexenta clan have included them in what they are releasing as 1.0.8. > > Erast - could you chime in here please? Unfortunately I''ve got no > idea about Nexenta. > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog >
Brent Jones
2008-Oct-14 17:29 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <gcarper at umich.edu> wrote:> Hey, all! > > We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. In trying to discover optimal ZFS pool construction settings, we''ve run a number of iozone tests, so I thought I''d share them with you and see if you have any comments, suggestions, etc. > > First, on a single Thumper, we ran baseline tests on the direct-attached storage (which is collected into a single ZFS pool comprised of four raidz2 groups)... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > Write: 123919 > Rewrite: 146277 > Read: 383226 > Reread: 383567 > Random Read: 84369 > Random Write: 121617 > > [8GB file size, 512KB record size] > Command: > Write: 373345 > Rewrite: 665847 > Read: 2261103 > Reread: 2175696 > Random Read: 2239877 > Random Write: 666769 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest > Write: 517092 > Rewrite: 541768 > Read: 682713 > Reread: 697875 > Random Read: 89362 > Random Write: 488944 > > These results look very nice, though you''ll notice that the random read numbers tend to be pretty low on the 1GB and 64GB tests (relative to their sequential counterparts), but the 8GB random (and sequential) read is unbelievably good. > > Now we move to the head node''s iSCSI aggregate ZFS pool... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /volumes/data-iscsi/perftest/1gbtest > Write: 127108 > Rewrite: 120704 > Read: 394073 > Reread: 396607 > Random Read: 63820 > Random Write: 5907 > > [8GB file size, 512KB record size] > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f /volumes/data-iscsi/perftest/8gbtest > Write: 235348 > Rewrite: 179740 > Read: 577315 > Reread: 662253 > Random Read: 249853 > Random Write: 274589 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /volumes/data-iscsi/perftest/64gbtest > Write: 190535 > Rewrite: 194738 > Read: 297605 > Reread: 314829 > Random Read: 93102 > Random Write: 175688 > > Generally speaking, the results look good, but you''ll notice that random writes are atrocious on the 1GB tests and random reads are not so great on the 1GB and 64GB tests, but the 8GB test looks great across the board. Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and raidz2 modes - there were no significant changes in the results. > > So, how concerned should we be about the low scores here and there? Any suggestions on how to improve our configuration? And how excited should we be about the 8GB tests? ;> > > Thanks so much for any input you have! > -Gray > --- > University of Michigan > Medical School Information Services > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Your setup sounds very interesting how you export iSCSI to another head unit, can you give me some more details on your file system layout, and how you mount it on the head unit? Sounds like a pretty clever way to export awesomely large volumes! Regards, -- Brent Jones brent at servuhome.net
James C. McPherson
2008-Oct-14 21:52 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Erast Benson wrote:> James, all serious ZFS bug fixes back-ported to b85 as well as marvell > and other sata drivers. Not everything is possible to back-port of > course, but I would say all critical things are there. This includes ZFS > ARC optimization patches, for example.Excellent! James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Gray Carper
2008-Oct-15 04:51 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey there, Bob! Looks like you and Akhilesh (thanks, Akhilesh!) are driving at a similar, very valid point. I''m currently using the default recordsize (128K) on all of the ZFS pool (those of the iSCSI target nodes and the aggregate pool on the head node). I should''ve mentioned something about how the storage will be used in my original post, so I''m glad you brought it up. It will all be presented over NFS and CIFS as a 10GBe+Infiniband NAS which will serve a number of organizations. Some organizations will simply use their area for end-user file sharing, others will use it as a disk backup target, others for databases, and still others for HPC data crunching (gene sequences). Each of these uses will be on different filesystems, of course, so I expect it would be good to set different recordsize paramaters for each one. Do you have any suggestions on good starting sizes for each? I''d imagine filesharing might benefit from a relatively small record size (64K?), image-based backup targets might like a pretty large record size (256K?), databases just need recordsizes to match their block sizes, and HPC...I have no idea. Heh. I expect I''ll need to get in contact with the HPC lab to see what kind of profile they have (whether they deal with tiny files or big files, etc). What do you think? Today I''m going to try a few non-ZFS-related tweaks (disabling the Nagle algorithm on the iSCSI initiator and increasing MTU everywhere to 9000). I''ll give those a shot and see if they yield performance enhancements. -Gray On Tue, Oct 14, 2008 at 10:36 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Tue, 14 Oct 2008, Gray Carper wrote: > >> >> So, how concerned should we be about the low scores here and there? Any >> suggestions on how to improve our configuration? And how excited should we >> be about the 8GB tests? ;> >> > > The level of concern should depend on how you expect your storage pool to > actually be used. It seems that it should work great for bulk storage, but > not to support a database, or ultra high-performance super-computing > applications. The good 8GB performance is due to successful ZFS ARC caching > in RAM, and because the record size is reasonable given the ZFS block size > and the buffering ability of the intermediate links. You might see somewhat > better performance using a 256K record size. > > It may take quite a while to fill 150TB up. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/a63dc4f3/attachment.html>
Gray Carper
2008-Oct-15 09:50 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Howdy, Brent! Thanks for your interest! We''re pretty enthused about this project over here and I''d be happy to share some details with you (and anyone else who cares to peek). In this post I''ll try to hit the major configuration bullet-points, but I can also throw you command-line level specifics if you want them. 1. The six Thumper iSCSI target nodes, and the iSCSI initiator head node, all have a high-availability network configuration by marrying link aggregation and IP multipathing techniques. Each machine has four 1GBe interfaces and one 10GBe interface (we could have had two 10GBe interfaces, but we decided to save some cash ;>). We link aggregate the four 1GBe interfaces together to create a fatter 4GBe pipe, then we use IP multipathing to group together the 10GBe interface and the 4GBe aggregation. Through this we create a virtual "service IP" which can float back and forth, automatically, between the two interfaces in the event of a network path failure. The preferred home is the 10GBe interface, but if that dies (or any part of its network path dies, like a switch somewhere down the line), then the service IP migrates to the 4GBe aggregate (which is on a completely separate network path) within four seconds. Whenever the 10GBe interface is happy again, the service IP automatically migrates back to its home. 2 The head node also has an Infiniband interface which plugs it into our HPC compute cluster network, giving the cluster direct access to whatever storage it needs. 3. All six iSCSI nodes have a redundant disk configuration using four ZFS raidz2 groups, each containing 10 drives which are spread across five controllers. Six additional disks, from a sixth controller, also live in the pool as spares. This results in a 28.4TB data pool for each node that can survive disk and controller failures. 4. Each of the six iSCSI nodes are presenting the entirety of their 28TB pools through CHAP-authenticated iSCSI targets. (See http://docs.sun.com/app/docs/doc/819-5461/gechv?a=view for more info on that process.) 5. The NAS nead node has wrangled up all six of the iSCSI targets (using "iscsiadm add discovery-address ...") and joined them to create ~150TB ofusable storage (using "zpool create" against the devices created with iscsiadm). With that, we''ve been able to carve up the storage into multiple ZFS filesystems, each with its own recordsize, quota, permissions, NFS/CIFS shares, etc. I think that about covers the high-level stuff. If there''s any area you want to dive deeper into, fire away! -Gray On Wed, Oct 15, 2008 at 1:29 AM, Brent Jones <brent at servuhome.net> wrote:> On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <gcarper at umich.edu> wrote: > > Hey, all! > > > > We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI > targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an > x4200 head node. In trying to discover optimal ZFS pool construction > settings, we''ve run a number of iozone tests, so I thought I''d share them > with you and see if you have any comments, suggestions, etc. > > > > First, on a single Thumper, we ran baseline tests on the direct-attached > storage (which is collected into a single ZFS pool comprised of four raidz2 > groups)... > > > > [1GB file size, 1KB record size] > > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > > Write: 123919 > > Rewrite: 146277 > > Read: 383226 > > Reread: 383567 > > Random Read: 84369 > > Random Write: 121617 > > > > [8GB file size, 512KB record size] > > Command: > > Write: 373345 > > Rewrite: 665847 > > Read: 2261103 > > Reread: 2175696 > > Random Read: 2239877 > > Random Write: 666769 > > > > [64GB file size, 1MB record size] > > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f > /data-das/perftest/64gbtest > > Write: 517092 > > Rewrite: 541768 > > Read: 682713 > > Reread: 697875 > > Random Read: 89362 > > Random Write: 488944 > > > > These results look very nice, though you''ll notice that the random read > numbers tend to be pretty low on the 1GB and 64GB tests (relative to their > sequential counterparts), but the 8GB random (and sequential) read is > unbelievably good. > > > > Now we move to the head node''s iSCSI aggregate ZFS pool... > > > > [1GB file size, 1KB record size] > > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f > /volumes/data-iscsi/perftest/1gbtest > > Write: 127108 > > Rewrite: 120704 > > Read: 394073 > > Reread: 396607 > > Random Read: 63820 > > Random Write: 5907 > > > > [8GB file size, 512KB record size] > > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f > /volumes/data-iscsi/perftest/8gbtest > > Write: 235348 > > Rewrite: 179740 > > Read: 577315 > > Reread: 662253 > > Random Read: 249853 > > Random Write: 274589 > > > > [64GB file size, 1MB record size] > > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f > /volumes/data-iscsi/perftest/64gbtest > > Write: 190535 > > Rewrite: 194738 > > Read: 297605 > > Reread: 314829 > > Random Read: 93102 > > Random Write: 175688 > > > > Generally speaking, the results look good, but you''ll notice that random > writes are atrocious on the 1GB tests and random reads are not so great on > the 1GB and 64GB tests, but the 8GB test looks great across the board. > Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, > raidz1, and raidz2 modes - there were no significant changes in the results. > > > > So, how concerned should we be about the low scores here and there? Any > suggestions on how to improve our configuration? And how excited should we > be about the 8GB tests? ;> > > > > Thanks so much for any input you have! > > -Gray > > --- > > University of Michigan > > Medical School Information Services > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > Your setup sounds very interesting how you export iSCSI to another > head unit, can you give me some more details on your file system > layout, and how you mount it on the head unit? > Sounds like a pretty clever way to export awesomely large volumes! > > Regards, > > -- > Brent Jones > brent at servuhome.net >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/718e14d7/attachment.html>
Bob Friesenhahn
2008-Oct-15 14:33 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Wed, 15 Oct 2008, Gray Carper wrote:> be good to set different recordsize paramaters for each one. Do you have any > suggestions on good starting sizes for each? I''d imagine filesharing might > benefit from a relatively small record size (64K?), image-based backup > targets might like a pretty large record size (256K?), databases just need > recordsizes to match their block sizes, and HPC...I have no idea. Heh. I > expect I''ll need to get in contact with the HPC lab to see what kind of > profile they have (whether they deal with tiny files or big files, etc). > What do you think?Pretty much the *only* reason to reduce the ZFS recordsize from its default of 128K is to support relatively unusual applications like databases which do random read/writes of small (often 8K) blocks. For sequential I/O, 128K is fine even if the application (or client) does reads/writes using much smaller blocks. For small-block random I/O you will find that ZFS performance improves immensely when the ZFS recordsize matches the application recordsize. The reason for this is that ZFS does I/O using its full blocksize and so there is more latency and waste of I/O bandwidth and CPU if ZFS needs to process a 128K block for each 8K block update. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Archie Cowan
2008-Oct-15 15:01 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
I just stumbled upon this thread somehow and thought I''d share my zfs over iscsi experience. We recently abandoned a similar configuration with several pairs of x4500s exporting zvols as iscsi targets and mirroring them for "high availability" with T5220s. Initially, our performance was also good using iozone tests, but, in testing the resilvering processes with 10tb of data, it was abysmal. It took over a month for a 10tb x4500 mirror that was mostly mirrored to resilver back into health with its pair. So, not exactly a highly available configuration... if the other x4500 went unhealthy while the other was still resilvering we''d have been in a real bad place. Also, "zfs send" operations on filesystems hosted by the iscsi zpool couldn''t push out more than a few kilobytes per second. Yes, we had all the multipathing, vlans, memory buffering and all kinds of nonsense to keep the network from being the bottleneck but to not much benefit. This was our plan for keeping our remote sites'' filesystems in sync so it was vital. Maybe we did something completely wrong with our setup, but I''d suggest you verify how long it takes to resilver new x4500s into your iscsi pools and also see how well it does when your zpools are almost full. Our initial good performance test results were too good to be true and it turned out that they weren''t the whole story. Good luck. -- This message posted from opensolaris.org
Richard Elling
2008-Oct-15 17:39 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Archie Cowan wrote:> I just stumbled upon this thread somehow and thought I''d share my zfs over iscsi experience. > > We recently abandoned a similar configuration with several pairs of x4500s exporting zvols as iscsi targets and mirroring them for "high availability" with T5220s. >In general, such tasks would be better served by T5220 (or the new T5440 :-) and J4500s. This would change the data paths from: client --<net>-- T5220 --<net>-- X4500 --<SATA>-- disks to client --<net>-- T5440 --<SAS>-- disks With the J4500 you get the same storage density as the X4500, but with SAS access (some would call this direct access). You will have much better bandwidth and lower latency between the T5440 (server) and disks while still having the ability to multi-head the disks. The J4500 is a relatively new system, so this option may not have been available at the time Archie was building his system. -- richard> Initially, our performance was also good using iozone tests, but, in testing the resilvering processes with 10tb of data, it was abysmal. It took over a month for a 10tb x4500 mirror that was mostly mirrored to resilver back into health with its pair. So, not exactly a highly available configuration... if the other x4500 went unhealthy while the other was still resilvering we''d have been in a real bad place. > > Also, "zfs send" operations on filesystems hosted by the iscsi zpool couldn''t push out more than a few kilobytes per second. Yes, we had all the multipathing, vlans, memory buffering and all kinds of nonsense to keep the network from being the bottleneck but to not much benefit. This was our plan for keeping our remote sites'' filesystems in sync so it was vital. > > Maybe we did something completely wrong with our setup, but I''d suggest you verify how long it takes to resilver new x4500s into your iscsi pools and also see how well it does when your zpools are almost full. Our initial good performance test results were too good to be true and it turned out that they weren''t the whole story. > > Good luck. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Akhilesh Mritunjai
2008-Oct-15 18:20 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hi Gray, You''ve got a nice setup going there, few comments: 1. Do not tune ZFS without a proven test-case to show otherwise, except... 2. For databases. Tune recordsize for that particular FS to match DB recordsize. Few questions... * How are you divvying up the space ? * How are you taking care of redundancy ? * Are you aware that each layer of ZFS needs its own redundancy ? Since you have got a mixed use case here, I would be surprized if a general config would cover all, though it might do with some luck. -- This message posted from opensolaris.org
Ross
2008-Oct-15 18:47 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Am I right in thinking your top level zpool is a raid-z pool consisting of six 28TB iSCSI volumes? If so that''s a very nice setup, it''s what we''d be doing if we had that kind of cash :-) -- This message posted from opensolaris.org
Miles Nordin
2008-Oct-15 18:54 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "gc" == Gray Carper <gcarper at umich.edu> writes:gc> 5. The NAS nead node has wrangled up all six of the iSCSI gc> targets are you using raidz on the head node? It sounds like simple striping, which is probably dangerous with the current code. This kind of sucks because with simple striping you will get the performance of the 6 mega-spindles, while in a raidz you don''t just get less storage, you get ~1/6th the seek bandwidth. but that''s better than losing a whole pool. It''s not even fully effective redundancy if resilvering/scrubbing takes 3 days per 1TB, but if it just stops the pool from becoming corrupt and unimportable then it''s done its job. how are you backing up that much storage? or is it all emphemeral? It''s common to lose a whole pool, so I''d have thought you''d want to, for example, keep home directories on a main pool and a backup pool, but keep only one copy of the backup dumps since in theory they have corresponding originals somewhere else. If you did split your x4500 * 6 into two pools, I wonder how you''d lay out a ``main pool'''' and ``backup pool'''' such that they''d be unlikely to get corrupt together. make them on disjoint sets of iscsi target nodes? keep the backup pool exported? you could keep backup pool imported so other groups can write their backups there, and spread it across all 6 targets, but declare a recurring noon - 3pm maintenance window for the backup pool, in which you: export, test-import, export, take snapshots of the zvols on the target nodes, import. Normally you would need II to use device-snapshots for corruption protection, but since you have two layers of ZFS you can use this remedial trick without learning how to use AVS. but only while the pool is exported because otherwise there''s no way to take all six snapshots at the same instant. or you could just hope. I''m most interested in failure testing. What happens when you reboot nodes or break network connections while there''s write activity to the pool? That is nice that the ``service address'''' fails over and fails back, and that you''ve somehow extended the heartbeat all the way from target to head node, but does this actually work well with iSCSI? Does the iSCSI TCP circuit get broken and remade when the address moves, and does this cause errors or even cause corruption if it happens while writing to the pool? How about something more drastic like rebooting the x4500''s---does the head node patiently wait and then continue where it left off like clients are supposed to when their NFS server reboots, or does it panic, or does it freeze for a couple minutes, mark the target down and continue, and then throw a bunch of CKSUM errors when the target comes back? The last one is what happens for me, but I have a mirrorz vdev on the head node so my setup''s different. If you can get this setup to work sanely in error scenarios, I think it can potentially have an availability advantage because some of the driver problems causing freezes and kernel panics and hung ports we''ve seen won''t hurt you---you can just reboot the whole target node, so shitty drivers become merely irritating to the sysadmin instead of an availability problem. but my expectation is, you can''t. It sounds really scary to me, to be honest, like: 200 eggs, one basket. and the basket is made of duct tape. i''m less interested in performance. I can think of a bunch of silly performance-test questions but I found most interesting Archie''s experience about how performance can influence effective reliability. Here are the silly questions: have you tried any other layouts? like exporting individual disks with iSCSI? My intuition is that this would not work well because of TCP congestion, and I also worry the iscsi target would freeze the whole box when one drive failed, a behavior which could be statistically significant to the overall system''s reliability when there are so many drives involved. but I wonder. also a simple SVM stripe, or maybe two or three stripes per box, might be faster by avoiding zvol COW. (also, know that Linux has an iSCSI target, too. actually it has three right now: IET, scst, and stgt) any end-to-end testing yet? how is the performance of NFS or CIFS or...what are you hoping to use over the infiniband again, comstar/iSER or is it just IP+NFS? i don''t know much about IB. are there fast disks in the head node that you could use to expermient with slog or l2arc? since slogs can''t be removed without destroying the pool, you might want testing of NFS+slog/NFS-slog before the pool has real data on it. can you try with and without RED on the switches? i''ve always wondered if this makes a difference but not bothered to check it because my targets are too slow. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/50501500/attachment.bin>
Miles Nordin
2008-Oct-15 19:04 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "r" == Ross <myxiplx at hotmail.com> writes:r> Am I right in thinking your top level zpool is a raid-z pool r> consisting of six 28TB iSCSI volumes? If so that''s a very r> nice setup, not if it scrubs at 400GB/day, and ''zfs send'' is uselessly slow. Also I am thinking the J4500 Richard mentioned may be more robust to single disk failures not taking down the whole box compared to a device with a Solaris kernel in it. s/very nice/stupidly large capacity/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/0e0cf81a/attachment.bin>
Marcelo Leal
2008-Oct-15 19:24 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Are you talking about what he had in the "logic of the configuration at top level", or you are saying his top level pool is a raidz? I would think his top level zpool is a raid0... -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Oct-15 19:35 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Wed, 15 Oct 2008, Marcelo Leal wrote:> Are you talking about what he had in the "logic of the configuration at top level", or you are saying his top level pool is a raidz? > I would think his top level zpool is a raid0...ZFS does not support RAID0 (simple striping). Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tomas Ă–gren
2008-Oct-15 19:40 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On 15 October, 2008 - Bob Friesenhahn sent me these 0,6K bytes:> On Wed, 15 Oct 2008, Marcelo Leal wrote: > > > Are you talking about what he had in the "logic of the configuration at top level", or you are saying his top level pool is a raidz? > > I would think his top level zpool is a raid0... > > ZFS does not support RAID0 (simple striping).zpool create mypool disk1 disk2 disk3 Sure it does. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Marcelo Leal
2008-Oct-15 20:08 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
So, there is no raid10 in a solaris/zfs setup? I?m talking about "no redundancy"... -- This message posted from opensolaris.org
Bob Friesenhahn
2008-Oct-15 20:45 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Wed, 15 Oct 2008, Tomas ?gren wrote:>> ZFS does not support RAID0 (simple striping). > > zpool create mypool disk1 disk2 disk3 > > Sure it does.This is load-share, not RAID0. Also, to answer the other fellow, since ZFS does not support RAID0, it also does not support RAID 1+0 (10). :-) With RAID0 and 8 drives in a stripe, if you send a 128K block of data, it gets split up into eight chunks, with a chunk written to each drive. With ZFS''s load share, that 128K block of data only gets written to one of the eight drives and no striping takes place. The next write is highly likely to go to a different drive. Load share seems somewhat similar to RAID0 but it is easy to see that it is not by looking at the drive LEDs on an drive array while writes are taking place. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2008-Oct-15 22:57 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Bob Friesenhahn wrote:> On Wed, 15 Oct 2008, Tomas ?gren wrote: >>> ZFS does not support RAID0 (simple striping). >> >> zpool create mypool disk1 disk2 disk3 >> >> Sure it does. > > This is load-share, not RAID0. Also, to answer the other fellow, > since ZFS does not support RAID0, it also does not support RAID 1+0 > (10). :-)Technically correct. But beware of operational definitions. From the SNIA Dictionary, http://www.snia.org/education/dictionary RAID Level 0 [Storage System] Synonym for data striping. RAID Level 1 [Storage System] Synonym for mirroring. RAID Level 10 not defined at SNIA, but generally agreed to be data stripes of mirrors. Data Striping [Storage System] A disk array data mapping technique in which fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern. Disk striping is commonly called RAID Level 0 or RAID 0 because of its similarity to common RAID data mapping techniques. It includes no redundancy, however, so strictly speaking, the appellation RAID is a misnomer. mirroring [Storage System] A configuration of storage in which two or more identical copies of data are maintained on separate media; also known as RAID Level 1, disk shadowing, real-time copy, and t1 copy. ZFS dynamic stripes are not restricted by fixed-length sequences, so they are not, technically data stripes by SNIA definition. ZFS mirrors do fit the SNIA definition of mirroring, though ZFS does so by logical address, not a physical block offset. You will often see people describe ZFS mirroring with multiple top-level vdevs as RAID-1+0 (or 10), because that is a well-known thing. But if you see this in any of the official documentation, then please file a bug.> > With RAID0 and 8 drives in a stripe, if you send a 128K block of data, > it gets split up into eight chunks, with a chunk written to each > drive. With ZFS''s load share, that 128K block of data only gets > written to one of the eight drives and no striping takes place. The > next write is highly likely to go to a different drive. Load share > seems somewhat similar to RAID0 but it is easy to see that it is not > by looking at the drive LEDs on an drive array while writes are taking > place.ZFS allocates data in slabs. By default, the slabs are 1 MByte each. So a vdev is divided into a collection of slabs and when ZFS fills a slab, it moves onto another. With a dynamic stripe, the next slab may be on a different vdev, depending on how much free space is available. So you may see many I/Os hitting one disk, just because they happen to be allocated on the same vdev, perhaps in the same slab. -- richard
Ross
2008-Oct-16 07:45 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Well obviously recovery scenarios need testing, but I still don''t see it being that bad. My thinking on this is: 1. Loss of a server is very much the worst case scenario. Disk errors are much more likely, and with raid-z2 pools on the individual servers this should not pose a problem. I also would not expect to see disk failures downing an entire x4500. Sun have sold an awful lot of these now, enough for me to feel any such problems should be a thing of the past. 2. Even when a server does fail, the nature of ZFS is such that you would not expect to loose your data, nor should you be expecting to resilver the entire 28TB. A motherboard / backplane / PSU failure will offline that server, but once the faulted components are replaced your pool will come back online. Once the pool is online, ZFS has the ability to resilver just the changed data, meaning that your rebuild time will be simply proportional to the time the server was down. Of course these failure modes would need testing, as would rebuild times. I don''t see ''zfs send'' performance being an issue though, not unless Grey has another 150TB of storage lying around that he''s not telling us about. :-) There are always going to be some tradeoffs between risk, capacity and price, but I expect that the benefits of this setup far outweigh the negatives. Ross -- This message posted from opensolaris.org
Gray Carper
2008-Oct-16 07:50 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Howdy! Very valuable advice here (and from Bob, who made similar comments - thanks, Bob!). I think, then, we''ll generally stick to 128K recordsizes. In the case of databases, we''ll stray as appropriate, and we may also stray with the HPC compute cluster if we can get demonstrate that it is worth it. To answer your questions below... Currently, we have a single pool, in a "load share" configuration (no raidz), that collects all the storage (which answers Ross'' question too).>From that we carve filesystems on demand. There are many more tests plannedfor that construction, though, so we are not married to it. Redundancy abounds. ;> Since the pool doesn''t employ raidz, it isn''t internally redundant, but we plan to replicate the pool''s data to an identical system (which is not yet built) at another site. Our initial userbase don''t need the replication, however, because they uses the system for little more than scratch space. Huge genomic datasets are dumped on the storage, analyzed, and the results (which are much smaller) get sent elsewhere. Everything is wiped out soon after that and the process starts again. Future projected uses of the storage, however, would be far less tolerant of loss, so I expect we''ll want to reconfigure the pool in raidz. I see that Archie and Miles have shared some harrowing concerns which we take very seriously. I don''t think I''ll be able to reply to them today, but I certainly will in the near future (particularly once we''ve completed some more of our induced failure scenarios). Sidenote: Today we made eight network/iSCSI related tweaks that, in aggregate, have resulted in dramatic performance improvements (some I just hadn''t gotten around to yet, others suggested by Sun''s Mertol Ozyoney)... - disabling the Nagle algorithm on the head node - setting each iSCSI target block size to match the ZFS record size of 128K - disabling "thin provisioning" on the iSCSI targets - enabling jumbo frames everywhere (each switch and NIC) - raising ddi_msix_alloc_limit to 8 - raising ip_soft_rings_cnt to 16 - raising tcp_deferred_acks_max to 16 - raising tcp_local_dacks_max to 16 Rerunning the same tests, we now see... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest Write: 143373 Rewrite: 183170 Read: 433205 Reread: 435503 Random Read: 90118 Random Write: 19488 [8GB file size, 512KB record size] Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f /volumes/data-iscsi/perftest/8gbtest Write: 463260 Rewrite: 449280 Read: 1092291 Reread: 881044 Random Read: 442565 Random Write: 565565 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest Write: 357199 Rewrite: 342788 Read: 609553 Reread: 645618 Random Read: 218874 Random Write: 339624 Thanks so much to everyone for all their great contributions! -Gray On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai < mritun+opensolaris at gmail.com <mritun%2Bopensolaris at gmail.com>> wrote:> Hi Gray, > > You''ve got a nice setup going there, few comments: > > 1. Do not tune ZFS without a proven test-case to show otherwise, except... > 2. For databases. Tune recordsize for that particular FS to match DB > recordsize. > > Few questions... > > * How are you divvying up the space ? > * How are you taking care of redundancy ? > * Are you aware that each layer of ZFS needs its own redundancy ? > > Since you have got a mixed use case here, I would be surprized if a general > config would cover all, though it might do with some luck. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/66c8c224/attachment.html>
Ross
2008-Oct-16 07:58 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Miles makes a good point here, you really need to look at how this copes with various failure modes. Based on my experience, iSCSI is something that may cause you problems. When I tested this kind of setup last year I found that the entire pool hung for 3 minutes any time an iSCSI volume went offline. It looked like a relatively simple thing to fix if you can recompile the iSCSI driver, and there is talk about making the timeout adjustable, but for me that was enough to put our project on hold for now. Ross -- This message posted from opensolaris.org
Gray Carper
2008-Oct-16 08:00 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Oops - one thing I meant to mention: We only plan to cross-site replicate data for those folks who require it. The HPC data crunching would have no use for it, so that filesystem wouldn''t be replicated. In reality, we only expect a select few users, with relatively small filesystems, to actually need replication. (Which begs the question: Why build an identical 150TB system to support that? Good question. I think we''ll reevaluate. ;>) -Gray On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper <gcarper at umich.edu> wrote:> Howdy! > > Very valuable advice here (and from Bob, who made similar comments - > thanks, Bob!). I think, then, we''ll generally stick to 128K recordsizes. In > the case of databases, we''ll stray as appropriate, and we may also stray > with the HPC compute cluster if we can get demonstrate that it is worth it. > > To answer your questions below... > > Currently, we have a single pool, in a "load share" configuration (no > raidz), that collects all the storage (which answers Ross'' question too). > From that we carve filesystems on demand. There are many more tests planned > for that construction, though, so we are not married to it. > > Redundancy abounds. ;> Since the pool doesn''t employ raidz, it isn''t > internally redundant, but we plan to replicate the pool''s data to an > identical system (which is not yet built) at another site. Our initial > userbase don''t need the replication, however, because they uses the system > for little more than scratch space. Huge genomic datasets are dumped on the > storage, analyzed, and the results (which are much smaller) get sent > elsewhere. Everything is wiped out soon after that and the process starts > again. Future projected uses of the storage, however, would be far less > tolerant of loss, so I expect we''ll want to reconfigure the pool in raidz. > > I see that Archie and Miles have shared some harrowing concerns which we > take very seriously. I don''t think I''ll be able to reply to them today, but > I certainly will in the near future (particularly once we''ve completed some > more of our induced failure scenarios). > > Sidenote: Today we made eight network/iSCSI related tweaks that, in > aggregate, have resulted in dramatic performance improvements (some I just > hadn''t gotten around to yet, others suggested by Sun''s Mertol Ozyoney)... > > - disabling the Nagle algorithm on the head node > - setting each iSCSI target block size to match the ZFS record size of 128K > - disabling "thin provisioning" on the iSCSI targets > - enabling jumbo frames everywhere (each switch and NIC) > - raising ddi_msix_alloc_limit to 8 > - raising ip_soft_rings_cnt to 16 > - raising tcp_deferred_acks_max to 16 > - raising tcp_local_dacks_max to 16 > > Rerunning the same tests, we now see... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > Write: 143373 > Rewrite: 183170 > Read: 433205 > Reread: 435503 > Random Read: 90118 > Random Write: 19488 > > [8GB file size, 512KB record size] > Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f > /volumes/data-iscsi/perftest/8gbtest > Write: 463260 > Rewrite: 449280 > Read: 1092291 > Reread: 881044 > Random Read: 442565 > Random Write: 565565 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest > Write: 357199 > Rewrite: 342788 > Read: 609553 > Reread: 645618 > Random Read: 218874 > Random Write: 339624 > > Thanks so much to everyone for all their great contributions! > -Gray > > > On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai < > mritun+opensolaris at gmail.com <mritun%2Bopensolaris at gmail.com>> wrote: > >> Hi Gray, >> >> You''ve got a nice setup going there, few comments: >> >> 1. Do not tune ZFS without a proven test-case to show otherwise, except... >> 2. For databases. Tune recordsize for that particular FS to match DB >> recordsize. >> >> Few questions... >> >> * How are you divvying up the space ? >> * How are you taking care of redundancy ? >> * Are you aware that each layer of ZFS needs its own redundancy ? >> >> Since you have got a mixed use case here, I would be surprized if a >> general config would cover all, though it might do with some luck. >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > > > -- > Gray Carper > MSIS Technical Services > University of Michigan Medical School > gcarper at umich.edu | skype: graycarper | 734.418.8506 > http://www.umms.med.umich.edu/msis/ >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/5adc0bb5/attachment.html>
Miles Nordin
2008-Oct-16 19:01 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "r" == Ross <myxiplx at hotmail.com> writes:r> 1. Loss of a server is very much the worst case scenario. r> Disk errors are much more likely, and with raid-z2 pools on r> the individual servers yeah, it kind of sucks that the slow resilvering speed enforces this two-tier scheme. Also if you''re going to have 1000 spinning platters you''ll have a drive failure every four days or so---you need to be able to do more than one resilver at a time, and you need to do resilvers without interrupting scrubs which could take so long to run that you run them continuously. The ZFS-on-zvol hack lets you do both to a point, but I think it''s an ugly workaround for lack of scalability in flat ZFS, not the ideal way to do things. r> A motherboard / backplane / PSU failure will offline that r> server, but once the faulted components are replaced your pool r> will come back online. Once the pool is online, ZFS has the r> ability to resilver just the changed data, except that is not what actually happens for my iSCSI setup. If I ''zpool offline'' the target before taking it down, it usually does work as you describe---a relatively fast resilver kicks off, and no CKSUM errors appear later. I''ve used it gently. I haven''t offlined a raidz2 device for three weeks while writing gigabytes to the pool in the mean time, but for my gentle use it does seem to work. But if the iSCSI target goes down unexpectedly---ex., because I pull the network cord---it does come back online and does resilver, but latent CKSUM errors show up weeks later. Also, if the head node reboots during a resilver, ZFS totally forgets what it was doing, and upon reboot just blindly mounts the unclean component as if it were clean, later calling all the differences CKSUM errors. same thing happens if you offline a device, then reboot. The ``persistent'''' offlining doesn''t seem to work, and in any case the device comes online without a proper resilver. SVM had dirty-region logging stored in the metadb so that resilvers could continue where they left off across reboots. I believe SVM usually did a full resilver when a component disappeared, but am not sure this was always the case. Anyway ZFS doesn''t seem to have a similar capability, at least not one that works. so, in practice, whenever any iSCSI component goes away unexpectedly---target server failure, power failure, kernel panic, L2 spanning tree reconfiguration, whatever---you have to scrub the whole pool from the head node. It''s interesting how the speed and optimisation of these maintenance activities limit pool size. It''s not just full scrubs. If the filesystem is subject to corruption, you need a backup. If the filesystem takes two months to back up / restore, then you need really solid incremental backup/restore features, and the backup needs to be a cold spare, not just a backup---restoring means switching the roles of the primary and backup system, not actually moving data. finally, for really big pools, even O(n) might be too slow. The ZFS best practice guide for converting UFS to ZFS says ``start multiple rsync''s in parallel,'''' but I think we''re finding zpool scrubs and zfs sends are not well-parallelized. These reliability limitations and performance characteristics of maintenance tasks seem to make a sort of max-pool-size Wall beyond which you end up painted into corners. If they were made better, I think you''d later hit another wall at the maximum amount of data you could push through one head node and would have to switch to some QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match ZFS this filesystem would have to do scrubs, resilvers, and backups in a distributed way not just distribute normal data access. A month ago I might have ranted, ``head node speed puts a cap on how _busy_ the filesystem can be, not how big it can be, so ZFS (modulo a lot of bug fixes) could be fantastic for data sets of virtually unlimited size even with its single-initiator, single-head-node limitation, so long as the pool gets very light access.'''' Now, I don''t think so, because scrubbing/resilvering/backup-restore has to flow through the head node, too. This observation also means my preference for a ``recovery tool'''' that treats corrupt pools as read-only over fsck (online or offline) isn''t very scalable. The original zfs kool-aid ``online maintenance'''' model of doing a cheap fsck at import time and a long O(n) fsck through online scrubs is the only one with a future in a world where maintenance activities can take months. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/171d8234/attachment.bin>
Marion Hakanson
2008-Oct-16 19:20 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
carton at Ivy.NET said:> It''s interesting how the speed and optimisation of these maintenance > activities limit pool size. It''s not just full scrubs. If the filesystem is > subject to corruption, you need a backup. If the filesystem takes two months > to back up / restore, then you need really solid incremental backup/restore > features, and the backup needs to be a cold spare, not just a > backup---restoring means switching the roles of the primary and backup > system, not actually moving data.I''ll chime in here with feeling uncomfortable with such a huge ZFS pool, and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach. There just seem to be too many moving parts depending on each other, any one of which can make the entire pool unavailable. For the stated usage of the original poster, I think I would aim toward turning each of the Thumpers into an NFS server, configure the head-node as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS to the "cluster" of file servers. You''ll end up with a huge logical pool, but a Thumper outage should result only in loss of access to the data on that particular system. The work of scrub/resilver/replication can be divided among the servers rather than all living on a single head node. Regards, Marion
Erast Benson
2008-Oct-16 19:43 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
pNFS is NFS-centric of course and it is not yet stable, isn''t it? btw, what is the ETA for pNFS putback? On Thu, 2008-10-16 at 12:20 -0700, Marion Hakanson wrote:> carton at Ivy.NET said: > > It''s interesting how the speed and optimisation of these maintenance > > activities limit pool size. It''s not just full scrubs. If the filesystem is > > subject to corruption, you need a backup. If the filesystem takes two months > > to back up / restore, then you need really solid incremental backup/restore > > features, and the backup needs to be a cold spare, not just a > > backup---restoring means switching the roles of the primary and backup > > system, not actually moving data. > > I''ll chime in here with feeling uncomfortable with such a huge ZFS pool, > and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach. There > just seem to be too many moving parts depending on each other, any one of > which can make the entire pool unavailable. > > For the stated usage of the original poster, I think I would aim toward > turning each of the Thumpers into an NFS server, configure the head-node > as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS > to the "cluster" of file servers. You''ll end up with a huge logical pool, > but a Thumper outage should result only in loss of access to the data on > that particular system. The work of scrub/resilver/replication can be > divided among the servers rather than all living on a single head node. > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Nicolas Williams
2008-Oct-16 19:53 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Thu, Oct 16, 2008 at 12:20:36PM -0700, Marion Hakanson wrote:> I''ll chime in here with feeling uncomfortable with such a huge ZFS pool, > and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach. There > just seem to be too many moving parts depending on each other, any one of > which can make the entire pool unavailable.But does it work well enough? It may be faster than NFS if there''s only one client for each volume (unless you have fast slog devices for the ZIL). And it''d have better semantics too (e.g., no need for the client and server to agree on identities/domains). Nico --
Marion Hakanson
2008-Oct-16 19:54 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Richard.Elling at Sun.COM said:> In general, such tasks would be better served by T5220 (or the new T5440 :-) > and J4500s. This would change the data paths from: > client --<net>-- T5220 --<net>-- X4500 --<SATA>-- disks to > client --<net>-- T5440 --<SAS>-- disks > > With the J4500 you get the same storage density as the X4500, but with SAS > access (some would call this direct access). You will have much better > bandwidth and lower latency between the T5440 (server) and disks while still > having the ability to multi-head the disks. TheThere''s an odd economic factor here, if you''re in the .edu sector: The Sun Education Essentials promotional price list has the X4540 priced lower than a bare J4500 (not on the promotional list, but with a standard EDU discount). We have a project under development right now which might be served well by one of these EDU X4540''s with a J4400 attached to it. The spec sheets for J4400 and J4500 say you can chain together enough of them to make a pool of 192 drives. I''m unsure about the bandwidth of these daisy-chained SAS interconnects, though. Any thoughts as to how high one might scale an X4540-plus-J4x00 solution? How does the X4540''s internal disk bandwidth compare to that of the (non-RAID) SAS HBA? Regards, Marion
Miles Nordin
2008-Oct-16 20:30 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes:nw> But does it work well enough? It may be faster than NFS if You''re talking about different things. Gray is using NFS period between the storage cluster and the compute cluster, no iSCSI. Gray''s (``does it work well enough''''): iSCSI within storage cluster NFS to storage consumers Marion''s (less ``uncomfortable''''): nothing(?) within storage cluster pNFS to storage consumers but Marion''s is not really possible at all, and won''t be for a while with other groups'' choice of storage-consumer platform, so it''d have to be GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude in-house hack. I guess since Gray is copying data in and out all the time he doesn''t have to worry about the glacial-restore problem and corruption problem. If it were my worry, I''d definitely include NFS clients in the performance test because iSCSI is high-latency, and the NFS clients could be more latency-sensitive than the local benchmark. I might test coalescing in the big data separately from running the crunching, because maybe the big data can be copied in with pax-over-netcat, or something other than NFS, and maybe the crunching could treat the big data as read-only and write its small result to a fast standalone ZFS server which would make NFS faster. and i''d get the small important data that needs backup off this mess (but please let us know how the failure simulating testing goes!). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/3724d648/attachment.bin>
Nicolas Williams
2008-Oct-16 20:44 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Thu, Oct 16, 2008 at 04:30:28PM -0400, Miles Nordin wrote:> >>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes: > > nw> But does it work well enough? It may be faster than NFS if > > You''re talking about different things. Gray is using NFS period > between the storage cluster and the compute cluster, no iSCSI.I was replying to Marion''s comment about "ZFS-over-ISCSI-on-ZFS," not to Gray. I can see why one might worry about ZFS-over-iSCSI-on-ZFS. Two layers of copy-on-write might interact in odd ways that kill performance. But if you want ZFS-over-iSCSI in the first place then ZFS-over-iSCSI-on-ZFS sounds like the correct approach IF it can perform well enough. ZFS-over-iSCSI could certainly perform better than NFS, but again, it may depend on what kind of ZIL devices you have. Nico --
Miles Nordin
2008-Oct-16 21:11 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes: >>>>> "mh" == Marion Hakanson <hakansom at ohsu.edu> writes:nw> I was replying to Marion''s [...] nw> ZFS-over-iSCSI could certainly perform better than NFS, better than what, ZFS-over-''mkfile''-files-on-NFS? No one was suggesting that. Do you mean better than pNFS? It sounded at first like you meant iSCSI-over-ZFS should perform better than NFS, but no one''s suggesting that either. Gray: NFS over ZFS over iSCSI over ZFS over disk Marion: pNFS over ZFS over disk they are both using the same amount of {,p}NFS. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/a99e16ad/attachment.bin>
David Magda
2008-Oct-16 22:29 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Oct 16, 2008, at 15:20, Marion Hakanson wrote:> For the stated usage of the original poster, I think I would aim > toward > turning each of the Thumpers into an NFS server, configure the head- > node > as a pNFS/NFSv4.1It''s a shame that Lustre isn''t available on Solaris yet either.
Marion Hakanson
2008-Oct-16 23:46 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
carton at Ivy.NET said:> but Marion''s is not really possible at all, and won''t be for a while with > other groups'' choice of storage-consumer platform, so it''d have to be > GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude > in-house hack.Well, of course the magnitude of fringe factor is in the eye of the beholder. I didn''t intend to make pNFS seem like a done deal. I don''t quite yet think of OpenSolaris as a "done deal" either, still using Solaris-10 here in production, but since this is an OpenSolaris mailing list I should be more careful. Anyway, from looking over the wiki/blog info, apparently the sticking point with pNFS may be client-side availability -- there''s only Linux and (Open)Solaris NFSv4.1 clients just yet. Still, pNFS claims to be backwards compatible with NFS v3 clients: If you point a traditional NFS client at the pNFS metadata server, the MDS is supposed to relay the data from the backend data servers. dmagda at ee.ryerson.ca said:> It''s a shame that Lustre isn''t available on Solaris yet either.Actually, that may not be so terribly fringey, either. Lustre and Sun''s Scalable Storage product can make use of Thumpers: http://www.sun.com/software/products/lustre/ http://www.sun.com/servers/cr/scalablestorage/ Apparently it''s possible to have a Solaris/ZFS data-server for Lustre backend storage: http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU I see they do not yet have anything other than Linux clients, so that''s a limitation. But you can share out a Lustre filesystem over NFS, potentially from multiple Lustre clients. Maybe via CIFS/samba as well. Lastly, I''ve considered the idea of using Shared-QFS to glue together multiple Thumper-hosted ISCSI LUN''s. You could add shared-QFS clients (acting as NFS/CIFS servers) if the client load needed more than one. Then SAM-FS would be a possibility for backup/replication. Anyway, I do feel that none of this stuff is quite "there" yet. But my experience with ZFS on fiberchannel SAN storage, that sinking feeling I''ve had when a little connectivity glitch resulted in a ZFS panic, makes me wonder if non-redundant ZFS on an ISCSI SAN is "there" yet, either. So far none of our lost-connection incidents resulted in pool corruption, but we have only 4TB or so. Restoring that much from tape is feasible, but even if Gray''s 150TB of data can be recreated, it would take weeks to reload it. If it''s decided that the clustered-filesystem solutions aren''t feasible yet, the suggestion I''ve seen that I liked the best was Richard''s, with a bad-boy server SAS-connected to multiple J4500''s. But since Gray''s project already has the X4500''s, I guess they''d have to find another use for them (:-). Regards, Marion
Ross
2008-Oct-17 08:31 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Some of that is very worrying Miles, do you have bug ID''s for any of those problems? I''m guessing the problem of the device being reported ok after the reboot could be this one: http://bugs.opensolaris.org/view_bug.do?bug_id=6582549 And could the errors after the reboot be one of these? http://bugs.opensolaris.org/view_bug.do?bug_id=6558852 http://bugs.opensolaris.org/view_bug.do?bug_id=6675685 I don''t have the same concerns myself that you guys have over massive pools since we''re working at a much smaller scale, but even so it''s no good ZFS having one of it''s main selling features as "only resilvers the missing data" if it can''t be relied upon to do that every time in real world situations. Incidentally, even with those resilver bugs, a few back of the envelope calculations makes me think that this might not be too bad with 10Gb ethernet: Server size: 28TB Interconnect speed: 10Gb/s (call it 8Gb/s of actual bandwidth) Usage: 70% (worst case scenario - pool dies while under heavy load) That gives us an available resilver bandwidth of 3Gb''s, which I''ll divide by two since that has to be used for both reads and writes. 28TB @ 1.5Gb/s gives a resilver time of around 42 hours, and changing some of the assumptions by dropping pool usage to 20% brings that down to 16 hours. It''s still a long time, but for a rare disaster recovery scenario for a large pool, I think I could live with it. -- This message posted from opensolaris.org
Miles Nordin
2008-Oct-17 18:35 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
>>>>> "r" == Ross <myxiplx at hotmail.com> writes:r> do you have bug ID''s for any of those problems? yeah, some of them, so maybe they will be fixed in s10u6. Sometimes the bug report writer has a narrower idea of the problem than I do, but bugs.opensolaris.org is still encouraging. Also note that there are secret bugs, usually a bad thing but you could pervert it into reason for even more optimism. 6592835 6602697 6722540 6698575 6675685 r> Server size: 28TB Interconnect speed: 10Gb/s (call it 8Gb/s of r> actual bandwidth) Usage: 70% r> That gives us an available resilver bandwidth of 3Gb''s, which r> I''ll divide by two since that has to be used for both reads r> and writes. well...10Gbit/s Ethernet is full duplex but none of that matters. I don''t think people are reporting resilvers at anywhere near wire speed. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081017/a52863ac/attachment.bin>
Gary Mills
2008-Oct-20 18:28 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Thu, Oct 16, 2008 at 03:50:19PM +0800, Gray Carper wrote:> > Sidenote: Today we made eight network/iSCSI related tweaks that, in > aggregate, have resulted in dramatic performance improvements (some I > just hadn''t gotten around to yet, others suggested by Sun''s Mertol > Ozyoney)... > - disabling the Nagle algorithm on the head node > - setting each iSCSI target block size to match the ZFS record size of > 128K > - disabling "thin provisioning" on the iSCSI targets > - enabling jumbo frames everywhere (each switch and NIC) > - raising ddi_msix_alloc_limit to 8 > - raising ip_soft_rings_cnt to 16 > - raising tcp_deferred_acks_max to 16 > - raising tcp_local_dacks_max to 16Can you tell us which of those changes made the most dramatic improvement? I have a similar situation here, with a 2-TB ZFS pool on a T2000 using Iscsi to a Netapp file server. Is there any way to tell in advance if any of those changes will make a difference? Many of them seem to be server resources. How can I determine their current usage? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Jim Dunham
2008-Oct-20 20:21 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gary,>> Sidenote: Today we made eight network/iSCSI related tweaks that, in >> aggregate, have resulted in dramatic performance improvements >> (some I >> just hadn''t gotten around to yet, others suggested by Sun''s Mertol >> Ozyoney)... >> - disabling the Nagle algorithm on the head node >> - setting each iSCSI target block size to match the ZFS record >> size of >> 128K >> - disabling "thin provisioning" on the iSCSI targets >> - enabling jumbo frames everywhere (each switch and NIC) >> - raising ddi_msix_alloc_limit to 8 >> - raising ip_soft_rings_cnt to 16 >> - raising tcp_deferred_acks_max to 16 >> - raising tcp_local_dacks_max to 16 > > Can you tell us which of those changes made the most dramatic > improvement?>> - disabling the Nagle algorithm on the head nodeThis will have a dramatic effective on most I/Os, except for large sequential writes.>> - setting each iSCSI target block size to match the ZFS record size >> of 128K >> - enabling jumbo frames everywhere (each switch and NIC)These will have a positive effect for large writes, both sequential and random>> - disabling "thin provisioning" on the iSCSI targetsThis only has a benefit for file-based or dsk based backing stores. If one use rdsk backing stores of any type, this is not an issue. Jim> I have a similar situation here, with a 2-TB ZFS pool on > a T2000 using Iscsi to a Netapp file server. Is there any way to tell > in advance if any of those changes will make a difference? Many of > them seem to be server resources. How can I determine their current > usage? > > -- > -Gary Mills- -Unix Support- -U of M Academic Computing and > Networking- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Storage Platform Software Group Sun Microsystems, Inc.
Gray Carper
2008-Oct-21 00:54 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey, Jim! Thanks so much for the excellent assist on this - much better than I could have ever answered it! I thought I''d add a little bit on the other four... - raising ddi_msix_alloc_limit to 8 For PCI cards that use up to 8 interrupts, which our 10GBe adapters do. The previous value of 2 could cause some CPU interrupt bottlenecks. So far, this has been more of a preventative measure - we haven''t seen a case where this really made any performance impact. - raising ip_soft_rings_cnt to 16 This increases the number of kernel threads associated with packet processing and is specifically meant to reduce the latency in handling 10GBe. This showed a small performance improvement. - raising tcp_deferred_acks_max to 16 This reduces the number of ACK packets sent, thus reducing the overall TCP overhead. This showed a small performance improvement. - raising tcp_local_dacks_max to 16 This also slows down ACK packets and showed a tiny performance improvement. Overall, we have found these four settings to not make a whole lot of difference, but every little bit helps. ;> The four that Jim went through were much more impactful particularly the enabling of jumbo frames and the disabling of the Nagle algorithm. -Gray On Tue, Oct 21, 2008 at 4:21 AM, Jim Dunham <James.Dunham at sun.com> wrote:> Gary, > > Sidenote: Today we made eight network/iSCSI related tweaks that, in >>> aggregate, have resulted in dramatic performance improvements (some I >>> just hadn''t gotten around to yet, others suggested by Sun''s Mertol >>> Ozyoney)... >>> - disabling the Nagle algorithm on the head node >>> - setting each iSCSI target block size to match the ZFS record size of >>> 128K >>> - disabling "thin provisioning" on the iSCSI targets >>> - enabling jumbo frames everywhere (each switch and NIC) >>> - raising ddi_msix_alloc_limit to 8 >>> - raising ip_soft_rings_cnt to 16 >>> - raising tcp_deferred_acks_max to 16 >>> - raising tcp_local_dacks_max to 16 >>> >> >> Can you tell us which of those changes made the most dramatic >> improvement? >> > > - disabling the Nagle algorithm on the head node >>> >> > This will have a dramatic effective on most I/Os, except for large > sequential writes. > > - setting each iSCSI target block size to match the ZFS record size of >>> 128K >>> - enabling jumbo frames everywhere (each switch and NIC) >>> >> > > These will have a positive effect for large writes, both sequential and > random > > - disabling "thin provisioning" on the iSCSI targets >>> >> > This only has a benefit for file-based or dsk based backing stores. If one > use rdsk backing stores of any type, this is not an issue. > > Jim > > I have a similar situation here, with a 2-TB ZFS pool on >> a T2000 using Iscsi to a Netapp file server. Is there any way to tell >> in advance if any of those changes will make a difference? Many of >> them seem to be server resources. How can I determine their current >> usage? >> >> -- >> -Gary Mills- -Unix Support- -U of M Academic Computing and >> Networking- >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > Jim Dunham > Storage Platform Software Group > Sun Microsystems, Inc. >-- Gray Carper MSIS Technical Services University of Michigan Medical School gcarper at umich.edu | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081021/3c4799e8/attachment.html>
Robert Milkowski
2008-Oct-22 17:18 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hello Richard, Wednesday, October 15, 2008, 6:39:49 PM, you wrote: RE> Archie Cowan wrote:>> I just stumbled upon this thread somehow and thought I''d share my zfs over iscsi experience. >> >> We recently abandoned a similar configuration with several pairs of x4500s exporting zvols as iscsi targets and mirroring them for "high availability" with T5220s. >>RE> In general, such tasks would be better served by T5220 (or the new T5440 :-) RE> and J4500s. This would change the data paths from: RE> client --<net>-- T5220 --<net>-- X4500 --<SATA>-- disks RE> to RE> client --<net>-- T5440 --<SAS>-- disks RE> With the J4500 you get the same storage density as the X4500, but RE> with SAS access (some would call this direct access). You will have RE> much better bandwidth and lower latency between the T5440 (server) RE> and disks while still having the ability to multi-head the disks. The RE> J4500 is a relatively new system, so this option may not have been RE> available at the time Archie was building his system. Has MPxIO for J4500 (SAS) been backported to S10 yet? -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Robert Milkowski
2008-Oct-22 17:26 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hello Bob, Wednesday, October 15, 2008, 9:45:52 PM, you wrote: BF> On Wed, 15 Oct 2008, Tomas ?gren wrote:>>> ZFS does not support RAID0 (simple striping). >> >> zpool create mypool disk1 disk2 disk3 >> >> Sure it does.BF> This is load-share, not RAID0. Also, to answer the other fellow, BF> since ZFS does not support RAID0, it also does not support RAID 1+0 BF> (10). :-) BF> With RAID0 and 8 drives in a stripe, if you send a 128K block of data, BF> it gets split up into eight chunks, with a chunk written to each BF> drive. With ZFS''s load share, that 128K block of data only gets Well, it depends on your stripe width - generally it would be true only if your strip with would be of 16KB, If you set-up 128KB stripe width you would end-up with one or two IO/s to one or two disk drives depending if your write was stripe aligned or not. ZFS will make sure that every fs block is stripe aligned when doing raid-0 like configuration (aka zfs dynamic striping). However it''s not true for raid-z{1|2} -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Richard Elling
2008-Oct-22 19:44 UTC
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Robert Milkowski wrote:> Hello Richard, > > Wednesday, October 15, 2008, 6:39:49 PM, you wrote: > > RE> Archie Cowan wrote: > >>> I just stumbled upon this thread somehow and thought I''d share my zfs over iscsi experience. >>> >>> We recently abandoned a similar configuration with several pairs of x4500s exporting zvols as iscsi targets and mirroring them for "high availability" with T5220s. >>> >>> > > RE> In general, such tasks would be better served by T5220 (or the new T5440 :-) > RE> and J4500s. This would change the data paths from: > RE> client --<net>-- T5220 --<net>-- X4500 --<SATA>-- disks > RE> to > RE> client --<net>-- T5440 --<SAS>-- disks > > RE> With the J4500 you get the same storage density as the X4500, but > RE> with SAS access (some would call this direct access). You will have > RE> much better bandwidth and lower latency between the T5440 (server) > RE> and disks while still having the ability to multi-head the disks. The > RE> J4500 is a relatively new system, so this option may not have been > RE> available at the time Archie was building his system. > > Has MPxIO for J4500 (SAS) been backported to S10 yet? >It is not a J4500 feature, it will depend on the HBA and driver. mpt(7d) has it in Solaris 10 5/08 (update 5) and patches are available for update 4. When in doubt, check the man page for your driver. -- richard