We are having a really tough time accepting the performance with ZFS and NFS interaction. I have tried so many different ways trying to make it work (even zfs set:zil_disable 1) and I''m still no where near the performance of using a standard NFS mounted UFS filesystem - insanely slow; especially on file rewrites. We have been combing the message boards and it looks like there was a lot of talk about this interaction of zfs+nfs back in november and before but since i have not seen much. It seems the only fix up to that date was to disable zil, is that still the case? Did anyone ever get closure on this? We are running solaris 10 (SPARC) .latest patched 11/06 release connecting directly via FC to a 6120 with 2 raid 5 volumes over a bge interface (gigabit). tried raidz, mirror and stripe with no negligible difference in speed. the clients connecting to this machine are HP-UX 11i and OS X 10.4.9 and they both have corresponding performance characteristics. Any insight would be appreciated - we really like zfs compared to any filesystem we have EVER worked on and dont want to revert if at all possible! TIA, Andy Lubel
When you say rewrites, can you give more detail? For example, are you rewriting in 8K chunks, random sizes, etc? The reason I ask is because ZFS will, by default, use 128K blocks for large files. If you then rewrite a small chunk at a time, ZFS is forced to read 128K, modify the small chunk you''re changing, and then write 128K. Obviously, this has adverse effects on performance. :) If your typical workload has a preferred block size that it uses, you might try setting the recordsize property in ZFS to match - that should help. If you''re completely rewriting the file, then I can''t imagine why it would be slow. The only thing I can think of is the forced sync that NFS does on a file closed. But if you set zil_disable in /etc/system and reboot, you shouldn''t see poor performance in that case. Other folks have had good success with NFS/ZFS performance (while other have not). If it''s possible, could you characterize your workload in a bit more detail? --Bill On Fri, Apr 20, 2007 at 04:07:44PM -0400, Andy Lubel wrote:> > We are having a really tough time accepting the performance with ZFS > and NFS interaction. I have tried so many different ways trying to > make it work (even zfs set:zil_disable 1) and I''m still no where near > the performance of using a standard NFS mounted UFS filesystem - > insanely slow; especially on file rewrites. > > We have been combing the message boards and it looks like there was a > lot of talk about this interaction of zfs+nfs back in november and > before but since i have not seen much. It seems the only fix up to > that date was to disable zil, is that still the case? Did anyone ever > get closure on this? > > We are running solaris 10 (SPARC) .latest patched 11/06 release > connecting directly via FC to a 6120 with 2 raid 5 volumes over a bge > interface (gigabit). tried raidz, mirror and stripe with no > negligible difference in speed. the clients connecting to this > machine are HP-UX 11i and OS X 10.4.9 and they both have corresponding > performance characteristics. > > Any insight would be appreciated - we really like zfs compared to any > filesystem we have EVER worked on and dont want to revert if at all > possible! > > > TIA, > > Andy Lubel > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
andy.Lubel at gtsi.com said:> We have been combing the message boards and it looks like there was a lot of > talk about this interaction of zfs+nfs back in november and before but since > i have not seen much. It seems the only fix up to that date was to disable > zil, is that still the case? Did anyone ever get closure on this?There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS learns to do that itself. See: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html Regards, Marion
Marion Hakanson wrote:> andy.Lubel at gtsi.com said: > >> We have been combing the message boards and it looks like there was a lot of >> talk about this interaction of zfs+nfs back in november and before but since >> i have not seen much. It seems the only fix up to that date was to disable >> zil, is that still the case? Did anyone ever get closure on this? >> > > There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS > learns to do that itself. See: > http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html > >The 6120 isn''t the same as a 6130/61340/6540. The instructions referenced above won''t work on a T3/T3+/6120/6320
yeah i saw that post about the other arrays but none for this EOL''d hunk of metal. i have some 6130''s but hopefully by the time they are implemented we will have retired this nfs stuff and stepped into zvol iscsi targets. thanks anyways.. back to the drawing board on how to resolve this! -Andy -----Original Message----- From: zfs-discuss-bounces at opensolaris.org on behalf of Torrey McMahon Sent: Fri 4/20/2007 6:00 PM To: Marion Hakanson Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS+NFS on storedge 6120 (sun t4) Marion Hakanson wrote:> andy.Lubel at gtsi.com said: > >> We have been combing the message boards and it looks like there was a lot of >> talk about this interaction of zfs+nfs back in november and before but since >> i have not seen much. It seems the only fix up to that date was to disable >> zil, is that still the case? Did anyone ever get closure on this? >> > > There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS > learns to do that itself. See: > http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html > >The 6120 isn''t the same as a 6130/61340/6540. The instructions referenced above won''t work on a T3/T3+/6120/6320 _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
tmcmahon2 at yahoo.com said:> The 6120 isn''t the same as a 6130/61340/6540. The instructions referenced > above won''t work on a T3/T3+/6120/6320Sigh. I can''t keep up (:-). Thanks for the correction. Marion
Im not sure about the workload but I did configure the volumes with the block size in mind.. didnt seem to do much. it could be due to the fact im basically HW raid then zfs raid and i just dont know the equation to define a smarter blocksize. seems like if i have 2 arrays with 64kb striped together that 128k would be ideal for my zfs datasets, but again.. my logic isnt infinite when it comes to this fun stuff ;) The 6120 has 2 volumes each with 64k stripe size blocks. i then raidz''ed the 2 volumes and tried both 64k and 128k. i do get a bit of a performance gain on rewrite at 128k. These are dd tests by the way: *this one is locally, and works just great. bash-3.00# date ; uname -a Thu Apr 19 21:11:22 EDT 2007 SunOS yuryaku 5.10 Generic_125100-04 sun4u sparc SUNW,Sun-Fire-V210 ^-------^ bash-3.00# df -k Filesystem kbytes used avail capacity Mounted on ... se6120 697761792 26 666303904 1% /pool/se6120 se6120/rfs-v10 31457280 9710895 21746384 31% /pool/se6120/rfs-v10 bash-3.00# time dd if=/dev/zero of=/pool/se6120/rfs-v10/rw-test-1.loo bs=8192 count=131072 131072+0 records in 131072+0 records out real 0m13.783s real 0m14.136s user 0m0.331s sys 0m9.947s *this one is from a HP-UX 11i system mounted to the v210 listed above: onyx:/rfs># date ; uname -a Thu Apr 19 21:15:02 EDT 2007 HP-UX onyx B.11.11 U 9000/800 1196424606 unlimited-user license ^====^ onyx:/rfs># bdf Filesystem kbytes used avail %used Mounted on ... yuryaku.sol:/pool/se6120/rfs-v10 31457280 9710896 21746384 31% /rfs/v10 onyx:/rfs># time dd if=/dev/zero of=/rfs/v10/rw-test-2.loo bs=8192 count=131072 131072+0 records in 131072+0 records out real 1m2.25s real 0m29.02s real 0m50.49s user 0m0.30s sys 0m8.16s *my 6120 tidbits of interest: 6120 Release 3.2.6 Mon Feb 5 02:26:22 MST 2007 (xxx.xxx.xxx.xxx) Copyright (C) 1997-2006 Sun Microsystems, Inc. All Rights Reserved. daikakuji:/:<1>vol mode volume mounted cache mirror v1 yes writebehind off v2 yes writebehind off daikakuji:/:<5>vol list volume capacity raid data standby v1 340.851 GB 5 u1d01-06 u1d07 v2 340.851 GB 5 u1d08-13 u1d14 daikakuji:/:<6>sys list controller : 2.5 blocksize : 64k cache : auto mirror : auto mp_support : none naca : off rd_ahead : off recon_rate : med sys memsize : 256 MBytes cache memsize : 1024 MBytes fc_topology : auto fc_speed : 2Gb disk_scrubber : on ondg : befit ---- Am i missing something? As far as the RW test, i will tinker some more and paste the results soonish. Thanks in advance, Andy Lubel -----Original Message----- From: Bill Moore [mailto:Bill.Moore at sun.com] Sent: Fri 4/20/2007 5:13 PM To: Andy Lubel Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS+NFS on storedge 6120 (sun t4) When you say rewrites, can you give more detail? For example, are you rewriting in 8K chunks, random sizes, etc? The reason I ask is because ZFS will, by default, use 128K blocks for large files. If you then rewrite a small chunk at a time, ZFS is forced to read 128K, modify the small chunk you''re changing, and then write 128K. Obviously, this has adverse effects on performance. :) If your typical workload has a preferred block size that it uses, you might try setting the recordsize property in ZFS to match - that should help. If you''re completely rewriting the file, then I can''t imagine why it would be slow. The only thing I can think of is the forced sync that NFS does on a file closed. But if you set zil_disable in /etc/system and reboot, you shouldn''t see poor performance in that case. Other folks have had good success with NFS/ZFS performance (while other have not). If it''s possible, could you characterize your workload in a bit more detail? --Bill On Fri, Apr 20, 2007 at 04:07:44PM -0400, Andy Lubel wrote:> > We are having a really tough time accepting the performance with ZFS > and NFS interaction. I have tried so many different ways trying to > make it work (even zfs set:zil_disable 1) and I''m still no where near > the performance of using a standard NFS mounted UFS filesystem - > insanely slow; especially on file rewrites. > > We have been combing the message boards and it looks like there was a > lot of talk about this interaction of zfs+nfs back in november and > before but since i have not seen much. It seems the only fix up to > that date was to disable zil, is that still the case? Did anyone ever > get closure on this? > > We are running solaris 10 (SPARC) .latest patched 11/06 release > connecting directly via FC to a 6120 with 2 raid 5 volumes over a bge > interface (gigabit). tried raidz, mirror and stripe with no > negligible difference in speed. the clients connecting to this > machine are HP-UX 11i and OS X 10.4.9 and they both have corresponding > performance characteristics. > > Any insight would be appreciated - we really like zfs compared to any > filesystem we have EVER worked on and dont want to revert if at all > possible! > > > TIA, > > Andy Lubel > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Welcome to the club, Andy... I tried several times to attract the attention of the community to the dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS combination - without any result : <a href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015">[2]</a>. Just look at two graphs in my <a href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated August, 2006</a> to see how bad the situation was and, unfortunately, this situation wasn''t changed much recently: http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png I don''t think the storage array is a source of the problems you reported. It''s somewhere else... [i]-- leon[/i] This message posted from opensolaris.org
Roch, isn''t there another flag in /etc/system to force zfs not to send flush requests to NVRAM? s. On 4/20/07, Marion Hakanson <hakansom at ohsu.edu> wrote:> andy.Lubel at gtsi.com said: > > We have been combing the message boards and it looks like there was a lot of > > talk about this interaction of zfs+nfs back in november and before but since > > i have not seen much. It seems the only fix up to that date was to disable > > zil, is that still the case? Did anyone ever get closure on this? > > There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS > learns to do that itself. See: > http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Sat, Apr 21, 2007 at 09:05:01AM +0200, Selim Daoud wrote:> isn''t there another flag in /etc/system to force zfs not to send flush > requests to NVRAM?I think it''s zfs_nocacheflush=1, according to Matthew Ahrens in http://blogs.digitar.com/jjww/?itemid=44.> s. > > > On 4/20/07, Marion Hakanson <hakansom at ohsu.edu> wrote: > >andy.Lubel at gtsi.com said: > >> We have been combing the message boards and it looks like there was a > >lot of > >> talk about this interaction of zfs+nfs back in november and before but > >since > >> i have not seen much. It seems the only fix up to that date was to > >disable > >> zil, is that still the case? Did anyone ever get closure on this? > > > >There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS > >learns to do that itself. See: > > http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html > > > >Regards, > > > >Marion > > > > > >_______________________________________________ > >zfs-discuss mailing list > >zfs-discuss at opensolaris.org > >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- albert chin (china at thewrittenword.com)
so what you are saying is that if we were using NFS v4 things should be dramatically better? do you think this applies to any NFS v4 client or only Suns? -----Original Message----- From: zfs-discuss-bounces at opensolaris.org on behalf of Erblichs Sent: Sun 4/22/2007 4:50 AM To: Leon Koll Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4) Leon Koll, As a knowldegeable outsider I can say something. The benchbark (SFS) page specifies NFSv3,v2 support, so I question whether you ra n NFSv4. I would expect a major change in performance just to version 4 NFS version and ZFS. The benchmark seems to stress your configuration enough that the latency to service NFS ops increases to the point of non serviced NFS requests. However, you don''t know what is the byte count per IO op. Reads are bottlenecked against rtt of the connection and writes are normally sub 1K with a later commit. However, many ops are probably just file handle verifications which again are limited to your connection rtt (round trip time). So, my initial guess is that the number of NFS threads are somewhat related to the number of non state (v4 now has state) per file handle op. Thus, if a 64k ZFS block is being modified by 1 byte, COW would require a 64k byte read, 1 byte modify, and then allocation of another 64k block. So, for every write op, you COULD be writing a full ZFS block. This COW philosphy works best with extending delayed writes, etc where later reads would make the trade-off of increased latency of the larger block on a read op versus being able to minimize the number of seeks on the write and read. Basicly increasing the block size from say 8k to 64K. Thus, your read latency goes up just to get the data off the disk and minimizing the number of seeks, and dropping the read ahead logic for the needed 8k to 64k file offset. I do NOT know that "THAT" 4000 IO OPS load would match your maximal load and that your actual load would never increase past 2000 IO ops. Secondly, jumping from 2000 to 4000 seems to be too big of a jump for your environment. Going to 2500 or 3000 might be more appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS and/or benchmark) seem to remain that have a negative impact. Lastly, my guess is that this NFS and the benchark are stressing small partial block writes and that is probably one of the worst case scenarios for ZFS. So, my guess is the proper analogy is trying to kill a nat with a sledgehammer. Each write IO OP really needs to be equal to a full size ZFS block to get the full benefit of ZFS on a per byte basis. Mitchell Erblich Sr Software Engineer ----------------- Leon Koll wrote:> > Welcome to the club, Andy... > > I tried several times to attract the attention of the community to the dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS combination - without any result : <a href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015">[2]</a>. > > Just look at two graphs in my <a href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated August, 2006</a> to see how bad the situation was and, unfortunately, this situation wasn''t changed much recently: http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > > I don''t think the storage array is a source of the problems you reported. It''s somewhere else... > > [i]-- leon[/i] > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Apr 21, 2007, at 9:46 AM, Andy Lubel wrote:> so what you are saying is that if we were using NFS v4 things > should be dramatically better?I certainly don''t support this assertion (if it was being made). NFSv4 does have some advantages from the perspective of enabling more aggressive file data caching; that will enable NFSv4 to outperform NFSv3 in some specific workloads. In general, however, NFSv4 performs similarly to NFSv3. Spencer> > do you think this applies to any NFS v4 client or only Suns? > > > > -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org on behalf of Erblichs > Sent: Sun 4/22/2007 4:50 AM > To: Leon Koll > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4) > > Leon Koll, > > As a knowldegeable outsider I can say something. > > The benchbark (SFS) page specifies NFSv3,v2 support, so I question > whether you ra n NFSv4. I would expect a major change in > performance just to version 4 NFS version and ZFS. > > The benchmark seems to stress your configuration enough that > the latency to service NFS ops increases to the point of non > serviced NFS requests. However, you don''t know what is the > byte count per IO op. Reads are bottlenecked against rtt of > the connection and writes are normally sub 1K with a later > commit. However, many ops are probably just file handle > verifications which again are limited to your connection > rtt (round trip time). So, my initial guess is that the number > of NFS threads are somewhat related to the number of non > state (v4 now has state) per file handle op. Thus, if a 64k > ZFS block is being modified by 1 byte, COW would require a > 64k byte read, 1 byte modify, and then allocation of another > 64k block. So, for every write op, you COULD be writing a > full ZFS block. > > This COW philosphy works best with extending delayed writes, etc > where later reads would make the trade-off of increased > latency of the larger block on a read op versus being able > to minimize the number of seeks on the write and read. Basicly > increasing the block size from say 8k to 64K. Thus, your > read latency goes up just to get the data off the disk > and minimizing the number of seeks, and dropping the read > ahead logic for the needed 8k to 64k file offset. > > I do NOT know that "THAT" 4000 IO OPS load would match your maximal > load and that your actual load would never increase past 2000 IO ops. > Secondly, jumping from 2000 to 4000 seems to be too big of a jump > for your environment. Going to 2500 or 3000 might be more > appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS > and/or benchmark) seem to remain that have a negative impact. > > Lastly, my guess is that this NFS and the benchark are stressing > small > partial block writes and that is probably one of the worst case > scenarios for ZFS. So, my guess is the proper analogy is trying to > kill a nat with a sledgehammer. Each write IO OP really needs to be > equal > to a full size ZFS block to get the full benefit of ZFS on a per byte > basis. > > Mitchell Erblich > Sr Software Engineer > ----------------- > > > > > > Leon Koll wrote: >> >> Welcome to the club, Andy... >> >> I tried several times to attract the attention of the community to >> the dramatic performance degradation (about 3 times) of NFZ/ZFS >> vs. ZFS/UFS combination - without any result : <a href="http:// >> www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a >> href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015"> >> [2]</a>. >> >> Just look at two graphs in my <a href="http://napobo3.blogspot.com/ >> 2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated >> August, 2006</a> to see how bad the situation was and, >> unfortunately, this situation wasn''t changed much recently: http:// >> photos1.blogger.com/blogger/7591/428/1600/sfs.1.png >> >> I don''t think the storage array is a source of the problems you >> reported. It''s somewhere else... >> >> [i]-- leon[/i] >> >> >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Don''t take this as gospel, and someone chime in if I''m off here, but I just saw an ARC case about this issue.... The fw in the T3 line might already take the NV_SYNC request. If it doesn''t then we''ll have a conf file where you can set it per array. Also, I would think the module or conf file would come with Sun arrays already listed. Andy Lubel wrote:> yeah i saw that post about the other arrays but none for this EOL''d hunk of metal. i have some 6130''s but hopefully by the time they are implemented we will have retired this nfs stuff and stepped into zvol iscsi targets. > > thanks anyways.. back to the drawing board on how to resolve this! > > -Andy > > -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org on behalf of Torrey McMahon > Sent: Fri 4/20/2007 6:00 PM > To: Marion Hakanson > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] ZFS+NFS on storedge 6120 (sun t4) > > Marion Hakanson wrote: > >> andy.Lubel at gtsi.com said: >> >> >>> We have been combing the message boards and it looks like there was a lot of >>> talk about this interaction of zfs+nfs back in november and before but since >>> i have not seen much. It seems the only fix up to that date was to disable >>> zil, is that still the case? Did anyone ever get closure on this? >>> >>> >> There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS >> learns to do that itself. See: >> http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html >> >> >> > > The 6120 isn''t the same as a 6130/61340/6540. The instructions > referenced above won''t work on a T3/T3+/6120/6320 > >
Leon Koll, As a knowldegeable outsider I can say something. The benchbark (SFS) page specifies NFSv3,v2 support, so I question whether you ra n NFSv4. I would expect a major change in performance just to version 4 NFS version and ZFS. The benchmark seems to stress your configuration enough that the latency to service NFS ops increases to the point of non serviced NFS requests. However, you don''t know what is the byte count per IO op. Reads are bottlenecked against rtt of the connection and writes are normally sub 1K with a later commit. However, many ops are probably just file handle verifications which again are limited to your connection rtt (round trip time). So, my initial guess is that the number of NFS threads are somewhat related to the number of non state (v4 now has state) per file handle op. Thus, if a 64k ZFS block is being modified by 1 byte, COW would require a 64k byte read, 1 byte modify, and then allocation of another 64k block. So, for every write op, you COULD be writing a full ZFS block. This COW philosphy works best with extending delayed writes, etc where later reads would make the trade-off of increased latency of the larger block on a read op versus being able to minimize the number of seeks on the write and read. Basicly increasing the block size from say 8k to 64K. Thus, your read latency goes up just to get the data off the disk and minimizing the number of seeks, and dropping the read ahead logic for the needed 8k to 64k file offset. I do NOT know that "THAT" 4000 IO OPS load would match your maximal load and that your actual load would never increase past 2000 IO ops. Secondly, jumping from 2000 to 4000 seems to be too big of a jump for your environment. Going to 2500 or 3000 might be more appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS and/or benchmark) seem to remain that have a negative impact. Lastly, my guess is that this NFS and the benchark are stressing small partial block writes and that is probably one of the worst case scenarios for ZFS. So, my guess is the proper analogy is trying to kill a nat with a sledgehammer. Each write IO OP really needs to be equal to a full size ZFS block to get the full benefit of ZFS on a per byte basis. Mitchell Erblich Sr Software Engineer ----------------- Leon Koll wrote:> > Welcome to the club, Andy... > > I tried several times to attract the attention of the community to the dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS combination - without any result : <a href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015">[2]</a>. > > Just look at two graphs in my <a href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated August, 2006</a> to see how bad the situation was and, unfortunately, this situation wasn''t changed much recently: http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > > I don''t think the storage array is a source of the problems you reported. It''s somewhere else... > > [i]-- leon[/i] > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Spencer, Summary: I am not sure that v4 would have a significant advantage over v3 or v2 in all envirs. I just believe it can have a significant advantage (no/minimal drawbacks) and one should use it if at all possbile to verify that it is not the bottleneck. So, no, I can not say that NFSv3 has the same performance as v4. I know that at its worst, I don''t belive that v4 performs under v3 and at best, performs up to 2x or more than v3. So,, The assumptions are: - V4 is being actively worked on, - v3 is stable but no major changes are being done on it.. - leases, - better data caching (delagations and client callbacks), - state behaviour, - compound NFS requests (procs) to remove the sequential rtt of individual NFS requests - Significantly improved lookups for pathing (multi-lookup) and later attr requests.. I am sure that the attr calls are/were a significant percentage of NFS ops. - etc... ** I am not telling Spencer this he should already know this because skip... So, with the compound procs in v4, the increased latency''s with some of the ops might have a different congestion type behaviour (it scales better under more environments and allows the IO bandwidth to be more of an issue). So, yes, my assumption is that NFSv4 has a good possibility of significantly outperforming v3.. Either way, I know of no degradation in any op moving to v4. So, again, if we are tuning a setup, I would rather see what ZFS does with v4, knowing that a few performance holes were closed or almost closed versus v3.. I don''t think this is specific to Sun.. It would apply to all NFSv4 environments. **Yes, however even when the public (Paw,Spencer, etc) NFSv4 paper was done, the SFS was stated as not yet done.. -- LASTLY, I would also be interested in the actual times of the different TCP segments. To see, if acks are constantly in the pipeline between the dst and src, or whether "slow-start restart behaviour" is occuring. It is also theorectical that delayed acks of the dst, the number of acks is reduced, which reduces the bandwidth (IO ops) on subsequent data bursts. Also, is Allman''s ABC being used in the TCP implementation. Mitchell Erblich ---------------- Spencer Shepler wrote:> > On Apr 21, 2007, at 9:46 AM, Andy Lubel wrote: > > > so what you are saying is that if we were using NFS v4 things > > should be dramatically better? > > I certainly don''t support this assertion (if it was being made). > > NFSv4 does have some advantages from the perspective of enabling > more aggressive file data caching; that will enable NFSv4 to > outperform NFSv3 in some specific workloads. In general, however, > NFSv4 performs similarly to NFSv3. > > Spencer > > > > > do you think this applies to any NFS v4 client or only Suns? > > > > > > > > -----Original Message----- > > From: zfs-discuss-bounces at opensolaris.org on behalf of Erblichs > > Sent: Sun 4/22/2007 4:50 AM > > To: Leon Koll > > Cc: zfs-discuss at opensolaris.org > > Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4) > > > > Leon Koll, > > > > As a knowldegeable outsider I can say something. > > > > The benchbark (SFS) page specifies NFSv3,v2 support, so I question > > whether you ra n NFSv4. I would expect a major change in > > performance just to version 4 NFS version and ZFS. > > > > The benchmark seems to stress your configuration enough that > > the latency to service NFS ops increases to the point of non > > serviced NFS requests. However, you don''t know what is the > > byte count per IO op. Reads are bottlenecked against rtt of > > the connection and writes are normally sub 1K with a later > > commit. However, many ops are probably just file handle > > verifications which again are limited to your connection > > rtt (round trip time). So, my initial guess is that the number > > of NFS threads are somewhat related to the number of non > > state (v4 now has state) per file handle op. Thus, if a 64k > > ZFS block is being modified by 1 byte, COW would require a > > 64k byte read, 1 byte modify, and then allocation of another > > 64k block. So, for every write op, you COULD be writing a > > full ZFS block. > > > > This COW philosphy works best with extending delayed writes, etc > > where later reads would make the trade-off of increased > > latency of the larger block on a read op versus being able > > to minimize the number of seeks on the write and read. Basicly > > increasing the block size from say 8k to 64K. Thus, your > > read latency goes up just to get the data off the disk > > and minimizing the number of seeks, and dropping the read > > ahead logic for the needed 8k to 64k file offset. > > > > I do NOT know that "THAT" 4000 IO OPS load would match your maximal > > load and that your actual load would never increase past 2000 IO ops. > > Secondly, jumping from 2000 to 4000 seems to be too big of a jump > > for your environment. Going to 2500 or 3000 might be more > > appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS > > and/or benchmark) seem to remain that have a negative impact. > > > > Lastly, my guess is that this NFS and the benchark are stressing > > small > > partial block writes and that is probably one of the worst case > > scenarios for ZFS. So, my guess is the proper analogy is trying to > > kill a nat with a sledgehammer. Each write IO OP really needs to be > > equal > > to a full size ZFS block to get the full benefit of ZFS on a per byte > > basis. > > > > Mitchell Erblich > > Sr Software Engineer > > ----------------- > > > > > > > > > > > > Leon Koll wrote: > >> > >> Welcome to the club, Andy... > >> > >> I tried several times to attract the attention of the community to > >> the dramatic performance degradation (about 3 times) of NFZ/ZFS > >> vs. ZFS/UFS combination - without any result : <a href="http:// > >> www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a > >> href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015"> > >> [2]</a>. > >> > >> Just look at two graphs in my <a href="http://napobo3.blogspot.com/ > >> 2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated > >> August, 2006</a> to see how bad the situation was and, > >> unfortunately, this situation wasn''t changed much recently: http:// > >> photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > >> > >> I don''t think the storage array is a source of the problems you > >> reported. It''s somewhere else... > >> > >> [i]-- leon[/i] > >> > >> > >> This message posted from opensolaris.org > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Albert Chin writes: > On Sat, Apr 21, 2007 at 09:05:01AM +0200, Selim Daoud wrote: > > isn''t there another flag in /etc/system to force zfs not to send flush > > requests to NVRAM? > > I think it''s zfs_nocacheflush=1, according to Matthew Ahrens in > http://blogs.digitar.com/jjww/?itemid=44. > Correct. So one might use this bit while waiting for the SYNC_NV complete solution. However setting zfs_nocacheflush=1 opens a small possibility of pool corruption because it bypassed the cache flushes around uerberblock updates. So definitely not something to use on non-NVRAM storage. It think it''s really best find our how to disable the flushing at the storage array level which is more inline with what SYNC_NV proper fix does. -r > > s. > > > > > > On 4/20/07, Marion Hakanson <hakansom at ohsu.edu> wrote: > > >andy.Lubel at gtsi.com said: > > >> We have been combing the message boards and it looks like there was a > > >lot of > > >> talk about this interaction of zfs+nfs back in november and before but > > >since > > >> i have not seen much. It seems the only fix up to that date was to > > >disable > > >> zil, is that still the case? Did anyone ever get closure on this? > > > > > >There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS > > >learns to do that itself. See: > > > http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html > > > > > >Regards, > > > > > >Marion > > > > > > > > >_______________________________________________ > > >zfs-discuss mailing list > > >zfs-discuss at opensolaris.org > > >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > -- > albert chin (china at thewrittenword.com) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Leon Koll writes: > Welcome to the club, Andy... > > I tried several times to attract the attention of the community to the dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS combination - without any result : <a href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015">[2]</a>. > > Just look at two graphs in my <a href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated August, 2006</a> to see how bad the situation was and, unfortunately, this situation wasn''t changed much recently: http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > > I don''t think the storage array is a source of the problems you reported. It''s somewhere else... > Why do you say this ? My reading is that almost all NFS/ZFS complaints are either complaining about NFS performance vs direct attach, comparing UFS vs ZFS on disk with write cache enabled, or complaining about ZFS running on storage with NVRAM. Your complain is the one exception, SFS being worst with ZFS backend vs say UFS or VxFS. My points being: So NFS cannot match direct attach for some loads. It''s a fact that we can''t get around . Enabling the write cache gives is not a valid way to run NFS over UFS. ZFS on NVRAM storage, we need to make sure the storage does not flush the cache in response to ZFS requests. Then SFS over ZFS is being investigated by others within Sun. I believe we have stuff in the pipe to make ZFS match or exceed UFS on small server level loads. So I think your complaint is being heard. I personally find it always incredibly hard to do performance engineering around SFS. So my perspective is that improving the SFS numbers will more likely come from finding ZFS/NFS performance deficiencies on simpler benchmarks. -r > [i]-- leon[/i] > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Albert Chin wrote:> On Sat, Apr 21, 2007 at 09:05:01AM +0200, Selim Daoud wrote: > >> isn''t there another flag in /etc/system to force zfs not to send flush >> requests to NVRAM? >> > > I think it''s zfs_nocacheflush=1, according to Matthew Ahrens in > http://blogs.digitar.com/jjww/?itemid=44. > > >> s. >> >> >> On 4/20/07, Marion Hakanson <hakansom at ohsu.edu> wrote: >> >>> andy.Lubel at gtsi.com said: >>> >>>> We have been combing the message boards and it looks like there was a >>>> >>> lot of >>> >>>> talk about this interaction of zfs+nfs back in november and before but >>>> >>> since >>> >>>> i have not seen much. It seems the only fix up to that date was to >>>> >>> disable >>> >>>> zil, is that still the case? Did anyone ever get closure on this? >>>> >>> There''s a way to tell your 6120 to ignore ZFS cache flushes, until ZFS >>> learns to do that itself. See: >>> http://mail.opensolaris.org/pipermail/zfs-discuss/2006-December/024194.html >>> >>> Regards, >>> >>> Marion >>> >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> > >-- <http://www.sun.com> * Markus Lippert * SSE-StrategicSupportEngineer Cluster,Highend Server *Sun Microsystems GmbH* Brandenburger Str 2 Ratingen DE-40880 Phone +49-2102-4511-670 / x68670 Mobile +49-1728122707 Fax +49-2102-4511-672 / x68672 Email Markus.Lippert at Sun.COM Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
What I''m saying is ZFS doesn''t play nice with NFS in all the scenarios I could think of: -Single second disk in a v210 (sun72g) write cache on and off = ~1/3 the performance of UFS when writing files using dd over an NFS mount using the same disk. -2 raid 5 volumes composing of 6 spindles each taking ~53 seconds to write 1gb over a NFS mounted zfs stripe,raidz or mirror of a storedge 6120 array with bbc, zil_disable''d and write cache off/on. In some testing dd would even seem to ''hang''. When any volslice is formatted UFS with the same NFS client - its ~17 seconds! We are likely going to just try iscsi instead, the behavior is non-existent. At some point though we would like to use ZFS based NFS mounts for things.. the current difference in performance just scares us! -Andy -----Original Message----- From: zfs-discuss-bounces at opensolaris.org on behalf of Roch - PAE Sent: Mon 4/23/2007 5:32 AM To: Leon Koll Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4) Leon Koll writes: > Welcome to the club, Andy... > > I tried several times to attract the attention of the community to the dramatic performance degradation (about 3 times) of NFZ/ZFS vs. ZFS/UFS combination - without any result : <a href="http://www.opensolaris.org/jive/thread.jspa?messageID=98592">[1]</a> , <a href="http://www.opensolaris.org/jive/thread.jspa?threadID=24015">[2]</a>. > > Just look at two graphs in my <a href="http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html">posting dated August, 2006</a> to see how bad the situation was and, unfortunately, this situation wasn''t changed much recently: http://photos1.blogger.com/blogger/7591/428/1600/sfs.1.png > > I don''t think the storage array is a source of the problems you reported. It''s somewhere else... > Why do you say this ? My reading is that almost all NFS/ZFS complaints are either complaining about NFS performance vs direct attach, comparing UFS vs ZFS on disk with write cache enabled, or complaining about ZFS running on storage with NVRAM. Your complain is the one exception, SFS being worst with ZFS backend vs say UFS or VxFS. My points being: So NFS cannot match direct attach for some loads. It''s a fact that we can''t get around . Enabling the write cache gives is not a valid way to run NFS over UFS. ZFS on NVRAM storage, we need to make sure the storage does not flush the cache in response to ZFS requests. Then SFS over ZFS is being investigated by others within Sun. I believe we have stuff in the pipe to make ZFS match or exceed UFS on small server level loads. So I think your complaint is being heard. I personally find it always incredibly hard to do performance engineering around SFS. So my perspective is that improving the SFS numbers will more likely come from finding ZFS/NFS performance deficiencies on simpler benchmarks. -r > [i]-- leon[/i] > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Apr 23, 2007, at 10:56 AM, Andy Lubel wrote:> > What I''m saying is ZFS doesn''t play nice with NFS in all the > scenarios I could think of: > > -Single second disk in a v210 (sun72g) write cache on and off = > ~1/3 the performance of UFS when writing files using dd over an NFS > mount using the same disk.If the write cache is enabled using UFS then its not a fair comparison as UFS is open to corruption then. ZFS write cache enabled vs. UFS write cache disabled is a fair comparison. ZFS will enable the write cache be default if it owns the whole disk (something to watch out for when doing successive tests - say doing UFS after ZFS as the caches will be enabled w/out you explicitly doing it).> > -2 raid 5 volumes composing of 6 spindles each taking ~53 seconds > to write 1gb over a NFS mounted zfs stripe,raidz or mirror of a > storedge 6120 array with bbc, zil_disable''d and write cache off/ > on. In some testing dd would even seem to ''hang''. When any > volslice is formatted UFS with the same NFS client - its ~17 seconds!Can you show the output to ''zpool status'' for ZFS and the corresponding SVM/UFS setup? eric
Hello, Roch <...>> Then SFS over ZFS is being investigated by others > within > Sun. I believe we have stuff in the pipe to make ZFS > match > or exceed UFS on small server level loads. So I > think your > complaint is being heard.You''re the first one who said this and I am glad I''m being heard.> > I personally find it always incredibly hard to do > performance > engineering around SFS. > So my perspective is that improving the SFS numbers > will more likely come from finding ZFS/NFS > performance > deficiencies on simpler benchmarks. >There is a new version of SPEC SFS in beta phase (w/NFS4 and CIFS support), available to SPEC members only. I am very interested to see the results of it on ZFS. Is there anybody from [i]"others within Sun"[/i] who tested it ? Thanks, -- leon This message posted from opensolaris.org
I am pretty sure the T3/6120/6320 firmware does not support the SYNCHRONIZE_CACHE commands.. Off the top of my head, I do not know if that triggers any change in behavior on the Solaris side... The firmware does support the use of the FUA bit...which would potentially lead to similar flushing behavior... I will try to check in my infinite spare time... -Joel This message posted from opensolaris.org
I think you have a problem with pool fragmentation. We have the same problem and changing recordsize will help. You have to set smaller recordsize for pool ( all filesystem must have the same size or smaller size ). First check if you have problems with finding blocks with this dtrace script: #!/usr/sbin/dtrace -s fbt::space_map_alloc:entry { self->s = arg1; } fbt::space_map_alloc:return /arg1 != -1/ { self->s = 0; } fbt::space_map_alloc:return /self->s && (arg1 == -1)/ { @s = quantize(self->s); self->s = 0; } tick-10s { printa(@s); } This message posted from opensolaris.org
Ok...got a break from the 25xx release... Trying to catch up so...sorry for the late response... The 6120 firmware does not support the Cache Sync command at all... You could try using a smaller blocksize setting on the array to attempt to reduce the number of read/modify/writes that you will incur... It also can be important to understand how zfs attempts to make aligned transactions as well....since a single 128k write that starts on the beginning of a RAID stripe is guaranteed to do a full-stripe write v/s 2 read/modify/write stripes.... I have considered making an unsupported firmware that turns it into a caching JBOD...I just have not had any "infinite spare time".... -Joel This message posted from opensolaris.org