Howdy, I have been doing a good deal of NFS debugging in the past month, and created an NFSv3 client operation tracer to simplify my life: $ nfstrace Executable Operation Type Time Size Path mkdir nfs3_lookup physical 359953 N/A /opt/nfs/htdocs/test mkdir nfs3_getattr logical 7577 N/A /opt/nfs/htdocs/test mkdir nfs3_mkdir physical 843500 N/A /opt/nfs/htdocs/test/test rmdir nfs3_access logical 19772 N/A /opt/nfs/htdocs/test rmdir nfs3_lookup logical 69222 N/A /opt/nfs/htdocs/test/test rmdir nfs3_access logical 7744 N/A /opt/nfs/htdocs/test cat nfs3_read logical 54848 8192 /opt/nfs/htdocs/test/1 cat nfs3_read logical 10082 8192 /opt/nfs/htdocs/test/1 bash nfs3_access logical 19707 N/A /opt/nfs/htdocs/test bash nfs3_lookup logical 66602 N/A /opt/nfs/htdocs/test/. bash nfs3_access logical 7740 N/A /opt/nfs/htdocs/test cp nfs3_write physical 38821 3935 /opt/nfs/htdocs/test/99 I created a pretty comprehensive test suite to validate that the script works as advertised, but there are edge cases (e.g., writes and readahead) where the physical vs. logical output may be slightly off. That said, you can snag the script from my website if you think it''s useful: http://daemons.net/~matty Hope everyone is having a nice day, - Ryan -- UNIX Administrator http://daemons.net/~matty
Joe Little
2006-May-04  02:16 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Are there any scripts to track NFS server performance? More specifically, I''m seeing terrible performance from Linux 2.6 clients against Nevada (B38) with a ZFS shared pool, but only with directorys with many small files. The performance for large files seems reasonable (20MB/sec) but small drops down to low k''s/sec on copying over NFS. On 5/3/06, Matty <matty91 at gmail.com> wrote:> > Howdy, > > I have been doing a good deal of NFS debugging in the past month, and > created an NFSv3 client operation tracer to simplify my life: > > $ nfstrace > Executable Operation Type Time Size Path > mkdir nfs3_lookup physical 359953 N/A /opt/nfs/htdocs/test > mkdir nfs3_getattr logical 7577 N/A /opt/nfs/htdocs/test > mkdir nfs3_mkdir physical 843500 N/A /opt/nfs/htdocs/test/test > rmdir nfs3_access logical 19772 N/A /opt/nfs/htdocs/test > rmdir nfs3_lookup logical 69222 N/A /opt/nfs/htdocs/test/test > rmdir nfs3_access logical 7744 N/A /opt/nfs/htdocs/test > cat nfs3_read logical 54848 8192 /opt/nfs/htdocs/test/1 > cat nfs3_read logical 10082 8192 /opt/nfs/htdocs/test/1 > bash nfs3_access logical 19707 N/A /opt/nfs/htdocs/test > bash nfs3_lookup logical 66602 N/A /opt/nfs/htdocs/test/. > bash nfs3_access logical 7740 N/A /opt/nfs/htdocs/test > cp nfs3_write physical 38821 3935 /opt/nfs/htdocs/test/99 > > I created a pretty comprehensive test suite to validate that the script > works as advertised, but there are edge cases (e.g., writes and > readahead) where the physical vs. logical output may be slightly off. > That said, you can snag the script from my website if you think it''s > useful: > > http://daemons.net/~matty > > Hope everyone is having a nice day, > - Ryan > -- > UNIX Administrator > http://daemons.net/~matty > _______________________________________________ > nfs-discuss mailing list > nfs-discuss at opensolaris.org >
Bill Sommerfeld
2006-May-04  12:22 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Wed, 2006-05-03 at 22:16, Joe Little wrote:> Are there any scripts to track NFS server performance? More > specifically, I''m seeing terrible performance from Linux 2.6 clients > against Nevada (B38) with a ZFS shared pool, but only with directorys > with many small files. The performance for large files seems > reasonable (20MB/sec) but small drops down to low k''s/sec on copying > over NFS.Perhaps not your problem but yesterday, Dave Miner and I traced some astoundingly poor NFS-over-ZFS performance on the Netra X1 to an apparent bug in sparc ATA driver code. The interactive response on the server while a tar extract (of the firefox source tarball, so lots of small files) was running on a (solaris) client reminded me of what it was like to use an overloaded timeshared TOPS-20 system the night before problem sets were due... tar was writing about one file per second. mpstat or vmstat (I forget which one I used) showed 100% system time. It took a few tries (one failed because dtrace aborted due to systemic unresponsiveness; the second started right before tar finished and gave us a nice profile of the idle loop..), but eventually we got useful data from: # lockstat -s 10 -I sleep 10 The system was so overloaded that "sleep 10" took 22.7 seconds essentially all cpu time was spent in stack traces resembling: Count indv cuml rcnt nsec CPU+PIL Caller 1004 42% 42% 0.00 3322 cpu[0] drv_usecwait+0x78 nsec ------ Time Distribution ------ count Stack 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 975 ata_wait+0x90 8192 | 22 acersb_get_intr_status+0x34 16384 | 7 ata_set_feature+0x124 ata_disk_start+0x15c ata_hba_start+0xbc The driver is busy-waiting in drv_usecwait() for long periods of time, and my suspicion is that this is most likely an artifact of ZFS''s regular disk write cache flushes. we filed: 6421427 netra x1 slagged by NFS over ZFS leading to long spins in the ATA driver code (if you see unreasonably slow performance, essentially no idle time, and high system time when you think you shouldn''t, the above use of lockstat''s profiling mechanism or dtrace equivalents thereof aggregating on stack() will point out where in the code the kernel is spinning its wheels.) - Bill
Joe Little
2006-May-04  15:16 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
In my case, I''m using an iscsi backend. However, doing any operations locally are fast via ZFS, even with small files. If I setup an rsyncd service, clients can also write small files relatively fast (3-5MB/sec), but over NFS its 5-64K/sec. So, there is a major problem still with NFS/ZFS interaction. On 5/4/06, Bill Sommerfeld <sommerfeld at sun.com> wrote:> On Wed, 2006-05-03 at 22:16, Joe Little wrote: > > Are there any scripts to track NFS server performance? More > > specifically, I''m seeing terrible performance from Linux 2.6 clients > > against Nevada (B38) with a ZFS shared pool, but only with directorys > > with many small files. The performance for large files seems > > reasonable (20MB/sec) but small drops down to low k''s/sec on copying > > over NFS. > > Perhaps not your problem but yesterday, Dave Miner and I traced some > astoundingly poor NFS-over-ZFS performance on the Netra X1 to an > apparent bug in sparc ATA driver code. > > The interactive response on the server while a tar extract (of the > firefox source tarball, so lots of small files) was running on a > (solaris) client reminded me of what it was like to use an overloaded > timeshared TOPS-20 system the night before problem sets were due... > > tar was writing about one file per second. > > mpstat or vmstat (I forget which one I used) showed 100% system time. > > It took a few tries (one failed because dtrace aborted due to systemic > unresponsiveness; the second started right before tar finished and gave > us a nice profile of the idle loop..), but eventually we got useful data > from: > > # lockstat -s 10 -I sleep 10 > > The system was so overloaded that "sleep 10" took 22.7 seconds > > essentially all cpu time was spent in stack traces resembling: > > Count indv cuml rcnt nsec CPU+PIL Caller > 1004 42% 42% 0.00 3322 cpu[0] drv_usecwait+0x78 > > nsec ------ Time Distribution ------ count Stack > 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 975 ata_wait+0x90 > 8192 | 22 acersb_get_intr_status+0x34 > 16384 | 7 ata_set_feature+0x124 > ata_disk_start+0x15c > ata_hba_start+0xbc > > The driver is busy-waiting in drv_usecwait() for long periods of time, > and my suspicion is that this is most likely an artifact of ZFS''s > regular disk write cache flushes. > > we filed: > > 6421427 netra x1 slagged by NFS over ZFS leading to long spins in the > ATA driver code > > (if you see unreasonably slow performance, essentially no idle time, and > high system time when you think you shouldn''t, the above use of > lockstat''s profiling mechanism or dtrace equivalents thereof aggregating > on stack() will point out where in the code the kernel is spinning its > wheels.) > > - Bill > >
Roch Bourbonnais - Performance Engineering
2006-May-04  15:52 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Joe Little writes: > In my case, I''m using an iscsi backend. However, doing any operations > locally are fast via ZFS, even with small files. If I setup an rsyncd > service, clients can also write small files relatively fast > (3-5MB/sec), but over NFS its 5-64K/sec. So, there is a major problem > still with NFS/ZFS interaction. > Does rsync follow an open/write/close pattern ? Because of NFS close-to-open consistency, that pattern delivers at best a throughput of (avg file size) / (server IO latency) Is your avg file size is < 640B and latency 0.01 sec ? This is true no matter what the backend FS. Or have you observed something different with say UFS ? Mounting with the -nocto option (at your own risk) may help if this is the issue. -r > > On 5/4/06, Bill Sommerfeld <sommerfeld at sun.com> wrote: > > On Wed, 2006-05-03 at 22:16, Joe Little wrote: > > > Are there any scripts to track NFS server performance? More > > > specifically, I''m seeing terrible performance from Linux 2.6 clients > > > against Nevada (B38) with a ZFS shared pool, but only with directorys > > > with many small files. The performance for large files seems > > > reasonable (20MB/sec) but small drops down to low k''s/sec on copying > > > over NFS. > > > > Perhaps not your problem but yesterday, Dave Miner and I traced some > > astoundingly poor NFS-over-ZFS performance on the Netra X1 to an > > apparent bug in sparc ATA driver code. > > > > The interactive response on the server while a tar extract (of the > > firefox source tarball, so lots of small files) was running on a > > (solaris) client reminded me of what it was like to use an overloaded > > timeshared TOPS-20 system the night before problem sets were due... > > > > tar was writing about one file per second. > > > > mpstat or vmstat (I forget which one I used) showed 100% system time. > > > > It took a few tries (one failed because dtrace aborted due to systemic > > unresponsiveness; the second started right before tar finished and gave > > us a nice profile of the idle loop..), but eventually we got useful data > > from: > > > > # lockstat -s 10 -I sleep 10 > > > > The system was so overloaded that "sleep 10" took 22.7 seconds > > > > essentially all cpu time was spent in stack traces resembling: > > > > Count indv cuml rcnt nsec CPU+PIL Caller > > 1004 42% 42% 0.00 3322 cpu[0] drv_usecwait+0x78 > > > > nsec ------ Time Distribution ------ count Stack > > 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 975 ata_wait+0x90 > > 8192 | 22 acersb_get_intr_status+0x34 > > 16384 | 7 ata_set_feature+0x124 > > ata_disk_start+0x15c > > ata_hba_start+0xbc > > > > The driver is busy-waiting in drv_usecwait() for long periods of time, > > and my suspicion is that this is most likely an artifact of ZFS''s > > regular disk write cache flushes. > > > > we filed: > > > > 6421427 netra x1 slagged by NFS over ZFS leading to long spins in the > > ATA driver code > > > > (if you see unreasonably slow performance, essentially no idle time, and > > high system time when you think you shouldn''t, the above use of > > lockstat''s profiling mechanism or dtrace equivalents thereof aggregating > > on stack() will point out where in the code the kernel is spinning its > > wheels.) > > > > - Bill > > > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Joe Little
2006-May-04  17:08 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On 5/4/06, Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at sun.com> wrote:> > Joe Little writes: > > In my case, I''m using an iscsi backend. However, doing any operations > > locally are fast via ZFS, even with small files. If I setup an rsyncd > > service, clients can also write small files relatively fast > > (3-5MB/sec), but over NFS its 5-64K/sec. So, there is a major problem > > still with NFS/ZFS interaction. > > > > Does rsync follow an open/write/close pattern ? > Because of NFS close-to-open consistency, that pattern > delivers at best a throughput of > > (avg file size) / (server IO latency) > > Is your avg file size is < 640B and latency 0.01 sec ? > This is true no matter what the backend FS. Or have you > observed something different with say UFS ?I''ll try the nocto option. Our average file size (I believe) is in the low K''s -- perhaps between 1K-8K. An example where this gets really slow is a prof with untarred versions of every linux kernel release for source code analysis. It processes these directories about 2K/sec. I will confess to not knowing how to measure latency. I haven''t observed the issue with UFS, but I haven''t been trying much since UFS doesn''t interest me as much. That said, our UFS only Solaris 9/10 machines haven''t had this issue.> > Mounting with the -nocto option (at your own risk) may help > if this is the issue. > > > -r > > > > > On 5/4/06, Bill Sommerfeld <sommerfeld at sun.com> wrote: > > > On Wed, 2006-05-03 at 22:16, Joe Little wrote: > > > > Are there any scripts to track NFS server performance? More > > > > specifically, I''m seeing terrible performance from Linux 2.6 clients > > > > against Nevada (B38) with a ZFS shared pool, but only with directorys > > > > with many small files. The performance for large files seems > > > > reasonable (20MB/sec) but small drops down to low k''s/sec on copying > > > > over NFS. > > > > > > Perhaps not your problem but yesterday, Dave Miner and I traced some > > > astoundingly poor NFS-over-ZFS performance on the Netra X1 to an > > > apparent bug in sparc ATA driver code. > > > > > > The interactive response on the server while a tar extract (of the > > > firefox source tarball, so lots of small files) was running on a > > > (solaris) client reminded me of what it was like to use an overloaded > > > timeshared TOPS-20 system the night before problem sets were due... > > > > > > tar was writing about one file per second. > > > > > > mpstat or vmstat (I forget which one I used) showed 100% system time. > > > > > > It took a few tries (one failed because dtrace aborted due to systemic > > > unresponsiveness; the second started right before tar finished and gave > > > us a nice profile of the idle loop..), but eventually we got useful data > > > from: > > > > > > # lockstat -s 10 -I sleep 10 > > > > > > The system was so overloaded that "sleep 10" took 22.7 seconds > > > > > > essentially all cpu time was spent in stack traces resembling: > > > > > > Count indv cuml rcnt nsec CPU+PIL Caller > > > 1004 42% 42% 0.00 3322 cpu[0] drv_usecwait+0x78 > > > > > > nsec ------ Time Distribution ------ count Stack > > > 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 975 ata_wait+0x90 > > > 8192 | 22 acersb_get_intr_status+0x34 > > > 16384 | 7 ata_set_feature+0x124 > > > ata_disk_start+0x15c > > > ata_hba_start+0xbc > > > > > > The driver is busy-waiting in drv_usecwait() for long periods of time, > > > and my suspicion is that this is most likely an artifact of ZFS''s > > > regular disk write cache flushes. > > > > > > we filed: > > > > > > 6421427 netra x1 slagged by NFS over ZFS leading to long spins in the > > > ATA driver code > > > > > > (if you see unreasonably slow performance, essentially no idle time, and > > > high system time when you think you shouldn''t, the above use of > > > lockstat''s profiling mechanism or dtrace equivalents thereof aggregating > > > on stack() will point out where in the code the kernel is spinning its > > > wheels.) > > > > > > - Bill > > > > > > > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org > >
Matty
2006-May-04  17:52 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Thu, 4 May 2006, Joe Little wrote:> I''ll try the nocto option. Our average file size (I believe) is in the > low K''s -- perhaps between 1K-8K. An example where this gets really > slow is a prof with untarred versions of every linux kernel release > for source code analysis. It processes these directories about 2K/sec. > I will confess to not knowing how to measure latency. I haven''t > observed the issue with UFS, but I haven''t been trying much since UFS > doesn''t interest me as much. That said, our UFS only Solaris 9/10 > machines haven''t had this issue.I have found tcptrace, snoop -vv, and the DTrae profile and io providers invaluable for putting together data flow diagrams with latency measurements. There is also some useful information in nfsstat, which might help you better understand activity on the client and server. Hope this helps, - Ryan -- UNIX Administrator http://daemons.net/~matty
Joe Little
2006-May-04  18:16 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Well, nfsstat I can do, and this is the pertinent section. Is this high for getattrs and readdirplus? Version 3: (323847 calls) null getattr setattr lookup access readlink 7 0% 177796 54% 1570 0% 2049 0% 37509 11% 5 0% read write create mkdir symlink mknod 32004 9% 47471 14% 351 0% 221 0% 5 0% 0 0% remove rmdir rename link readdir readdirplus 892 0% 49 0% 0 0% 0 0% 0 0% 22931 7% fsstat fsinfo pathconf commit 619 0% 16 0% 0 0% 352 0% On 5/4/06, Matty <matty91 at gmail.com> wrote:> > On Thu, 4 May 2006, Joe Little wrote: > > > I''ll try the nocto option. Our average file size (I believe) is in the > > low K''s -- perhaps between 1K-8K. An example where this gets really > > slow is a prof with untarred versions of every linux kernel release > > for source code analysis. It processes these directories about 2K/sec. > > I will confess to not knowing how to measure latency. I haven''t > > observed the issue with UFS, but I haven''t been trying much since UFS > > doesn''t interest me as much. That said, our UFS only Solaris 9/10 > > machines haven''t had this issue. > > I have found tcptrace, snoop -vv, and the DTrae profile and io providers > invaluable for putting together data flow diagrams with latency > measurements. There is also some useful information in nfsstat, which > might help you better understand activity on the client and server. > > Hope this helps, > - Ryan > -- > UNIX Administrator > http://daemons.net/~matty > _______________________________________________ > nfs-discuss mailing list > nfs-discuss at opensolaris.org >
Spencer Shepler
2006-May-04  19:34 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Thu, Joe Little wrote:> Well, nfsstat I can do, and this is the pertinent section. Is this > high for getattrs and readdirplus?Not in my opinion based on the workload you have described. The lookup counts are very low but that is replaced by the larger percentage of readdirplus (it is populating the name cache at the client). The getattr counts are large because of the number of files (client is doing a getattr at least for each file for NFS'' close to open behavior). Spencer> Version 3: (323847 calls) > null getattr setattr lookup access readlink > 7 0% 177796 54% 1570 0% 2049 0% 37509 11% 5 0% > read write create mkdir symlink mknod > 32004 9% 47471 14% 351 0% 221 0% 5 0% 0 0% > remove rmdir rename link readdir readdirplus > 892 0% 49 0% 0 0% 0 0% 0 0% 22931 7% > fsstat fsinfo pathconf commit > 619 0% 16 0% 0 0% 352 0% > > > On 5/4/06, Matty <matty91 at gmail.com> wrote: > > > >On Thu, 4 May 2006, Joe Little wrote: > > > >> I''ll try the nocto option. Our average file size (I believe) is in the > >> low K''s -- perhaps between 1K-8K. An example where this gets really > >> slow is a prof with untarred versions of every linux kernel release > >> for source code analysis. It processes these directories about 2K/sec. > >> I will confess to not knowing how to measure latency. I haven''t > >> observed the issue with UFS, but I haven''t been trying much since UFS > >> doesn''t interest me as much. That said, our UFS only Solaris 9/10 > >> machines haven''t had this issue. > > > >I have found tcptrace, snoop -vv, and the DTrae profile and io providers > >invaluable for putting together data flow diagrams with latency > >measurements. There is also some useful information in nfsstat, which > >might help you better understand activity on the client and server. > > > >Hope this helps, > >- Ryan > >-- > >UNIX Administrator > >http://daemons.net/~matty > >_______________________________________________ > >nfs-discuss mailing list > >nfs-discuss at opensolaris.org > > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Jeff A. Smith
2006-May-05  00:51 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Joe Little wrote:> > I will confess to not knowing how to measure latency. I haven''t > observed the issue with UFS, but I haven''t been trying much since UFS > doesn''t interest me as much. That said, our UFS only Solaris 9/10 > machines haven''t had this issue. >Hi Joe, I''ve attached a very basic dtrace script to measure latency for all NFS3 server ops. Notice that this script measures the latency incurred both from NFS3 server processing and the underlying file system. Just run it as root on your server and crank up a NFS3 client. It will produce the following metrics: 1. op count 2. avg response time per op 3. avg system time consumed per op 4. quantized avg resp time 5. quantized avg system time I recommend running the same workload against a NFS mounted tmpfs (to set the utopian baseline), UFS, and ZFS. Then look for the large differences. That will give us an idea of which NFS3 procs are having perf issues, and we can dig deeper. I''m out of town Friday and Monday, but Spencer will cover for me if you run into any problems. Jeff -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: srv3snoop.d URL: <http://mail.opensolaris.org/pipermail/dtrace-discuss/attachments/20060504/02a7e90f/attachment.ksh>
Joe Little
2006-May-05  04:50 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Perhaps I''m finding bugs in the iscsi initiator. I can give the full dtrace reports (should I attach them?) but for now I''ll give the three top segments. I did a mixed directory of only 1000 items of around 3GB as a first test. The NFS->UFS test and the NFS->ZFS (local) were relatively fast, but the NFS->ZFS (over iscsi) was dog slow vs what that ZFS/iscsi link normally does. UFS: NFS3 op counts =============RFS3_SYMLINK 5 RFS3_MKDIR 73 RFS3_COMMIT 885 RFS3_CREATE 901 RFS3_RENAME 901 RFS3_ACCESS 1863 RFS3_SETATTR 3792 RFS3_LOOKUP 3873 RFS3_GETATTR 7568 RFS3_WRITE 46844 NFS3 op avg response time (usec) ===============================RFS3_GETATTR 8 RFS3_ACCESS 9 RFS3_LOOKUP 10 RFS3_RENAME 222 RFS3_SYMLINK 344 RFS3_SETATTR 559 RFS3_CREATE 996 RFS3_MKDIR 4589 RFS3_WRITE 9409 RFS3_COMMIT 13521 NFS3 op avg system time (usec) =============================RFS3_ACCESS 6 RFS3_GETATTR 6 RFS3_LOOKUP 8 RFS3_SETATTR 32 RFS3_CREATE 72 RFS3_RENAME 77 RFS3_SYMLINK 96 RFS3_MKDIR 105 RFS3_COMMIT 112 RFS3_WRITE 125 ZFS (non-iscsi): NFS3 op counts =============RFS3_SYMLINK 5 RFS3_MKDIR 73 RFS3_COMMIT 885 RFS3_CREATE 901 RFS3_RENAME 901 RFS3_ACCESS 1884 RFS3_SETATTR 3792 RFS3_LOOKUP 3945 RFS3_GETATTR 8479 RFS3_WRITE 46844 NFS3 op avg response time (usec) ===============================RFS3_GETATTR 8 RFS3_ACCESS 10 RFS3_LOOKUP 13 RFS3_WRITE 82 RFS3_SETATTR 13785 RFS3_RENAME 14780 RFS3_CREATE 14806 RFS3_SYMLINK 16018 RFS3_MKDIR 16591 RFS3_COMMIT 72752 NFS3 op avg system time (usec) =============================RFS3_GETATTR 7 RFS3_ACCESS 8 RFS3_LOOKUP 9 RFS3_SETATTR 51 RFS3_SYMLINK 53 RFS3_CREATE 63 RFS3_RENAME 67 RFS3_WRITE 80 RFS3_MKDIR 195 RFS3_COMMIT 5069 ZFS (iscsi): NFS3 op counts =============RFS3_FSINFO 1 RFS3_NULL 1 RFS3_FSSTAT 4 RFS3_SYMLINK 5 RFS3_MKDIR 73 RFS3_COMMIT 885 RFS3_CREATE 901 RFS3_RENAME 901 RFS3_ACCESS 1958 RFS3_SETATTR 3792 RFS3_LOOKUP 3963 RFS3_GETATTR 8514 RFS3_WRITE 46844 NFS3 op avg response time (usec) ===============================RFS3_ACCESS 13 RFS3_GETATTR 14 RFS3_FSSTAT 18 RFS3_LOOKUP 24 RFS3_NULL 253 RFS3_WRITE 9931 RFS3_FSINFO 19043 RFS3_SYMLINK 118492 RFS3_SETATTR 138609 RFS3_RENAME 146412 RFS3_CREATE 156186 RFS3_MKDIR 223285 RFS3_COMMIT 273027 NFS3 op avg system time (usec) =============================RFS3_ACCESS 11 RFS3_GETATTR 12 RFS3_LOOKUP 13 RFS3_NULL 14 RFS3_FSSTAT 15 RFS3_SYMLINK 84 RFS3_WRITE 91 RFS3_SETATTR 96 RFS3_CREATE 105 RFS3_RENAME 112 RFS3_FSINFO 118 RFS3_MKDIR 387 RFS3_COMMIT 6256 The ZFS numbers for response times seems worse than UFS in either case, but there is a large difference between ZFS on local SATA disk vs. my iscsi RAIDZ config. On 5/4/06, Jeff A. Smith <Jeff.A.Smith at sun.com> wrote:> Joe Little wrote: > > > > I will confess to not knowing how to measure latency. I haven''t > > observed the issue with UFS, but I haven''t been trying much since UFS > > doesn''t interest me as much. That said, our UFS only Solaris 9/10 > > machines haven''t had this issue. > > > Hi Joe, > > I''ve attached a very basic dtrace script to measure latency for > all NFS3 server ops. Notice that this script measures the latency > incurred both from NFS3 server processing and the underlying file > system. > > Just run it as root on your server and crank up a NFS3 client. It > will produce the following metrics: > > 1. op count > 2. avg response time per op > 3. avg system time consumed per op > 4. quantized avg resp time > 5. quantized avg system time > > I recommend running the same workload against a NFS mounted > tmpfs (to set the utopian baseline), UFS, and ZFS. Then look > for the large differences. That will give us an idea of which > NFS3 procs are having perf issues, and we can dig deeper. > > I''m out of town Friday and Monday, but Spencer will cover for > me if you run into any problems. > > Jeff > > > #!/usr/sbin/dtrace -Fs > > fbt:nfssrv:common_dispatch:entry > /args[0]->rq_vers == 3 && ! self->trace/ > { > self->trace = 1; > this->proc = args[0]->rq_proc; > self->pname = (this->proc >= 0 && this->proc <= 21) ? > stringof (nfssrv`rfscallnames_v3[this->proc]) : > stringof ("invalid proc"); > self->time = timestamp; > self->vtime = vtimestamp; > @opcount[self->pname] = count(); > } > > fbt:nfssrv:common_dispatch:return > /self->trace/ > { > this->usec = (timestamp - self->time) / 1000; > this->systm = (vtimestamp - self->vtime) / 1000; > @avgtime[self->pname] = avg(this->usec); > @qnttime[self->pname] = quantize(this->usec); > > @avgsystm[self->pname] = avg(this->systm); > @qntsystm[self->pname] = quantize(this->systm); > > self->pname = 0; > self->trace = 0; > } > > END > { > printf("\n\nNFS3 op counts\n"); > printf( "==============\n"); > printa("%-20s %@12d\n", @opcount); > > printf("\n\nNFS3 op avg response time (usec)\n"); > printf( "================================\n"); > printa("%-20s %@12d\n", @avgtime); > > printf("\n\nNFS3 op avg system time (usec)\n"); > printf( "==============================\n"); > printa("%-20s %@12d\n", @avgsystm); > > printf("\n\nNFS3 op quantized response time (usec)\n"); > printf( "======================================\n"); > printa("%-20s %@12d\n", @qnttime); > > printf("\n\nNFS3 op quantized system time (usec)\n"); > printf( "====================================\n"); > printa("%-20s %@12d\n", @qntsystm); > } > > >
Joe Little
2006-May-05  04:57 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Adding to the previous test, I also did a local UFS of the same 3G to the iscsi ZFS volume. The dtrace didn''t show anything, but this completed the fastest of all, and so it sheds light that its the NFS->ZFS->iSCSI interaction that is terrible, but ZFS->iSCSI is fastest, with NFS->UFS/ZFS being second fastest. On 5/4/06, Jeff A. Smith <Jeff.A.Smith at sun.com> wrote:> Joe Little wrote: > > > > I will confess to not knowing how to measure latency. I haven''t > > observed the issue with UFS, but I haven''t been trying much since UFS > > doesn''t interest me as much. That said, our UFS only Solaris 9/10 > > machines haven''t had this issue. > > > Hi Joe, > > I''ve attached a very basic dtrace script to measure latency for > all NFS3 server ops. Notice that this script measures the latency > incurred both from NFS3 server processing and the underlying file > system. > > Just run it as root on your server and crank up a NFS3 client. It > will produce the following metrics: > > 1. op count > 2. avg response time per op > 3. avg system time consumed per op > 4. quantized avg resp time > 5. quantized avg system time > > I recommend running the same workload against a NFS mounted > tmpfs (to set the utopian baseline), UFS, and ZFS. Then look > for the large differences. That will give us an idea of which > NFS3 procs are having perf issues, and we can dig deeper. > > I''m out of town Friday and Monday, but Spencer will cover for > me if you run into any problems. > > Jeff > > > #!/usr/sbin/dtrace -Fs > > fbt:nfssrv:common_dispatch:entry > /args[0]->rq_vers == 3 && ! self->trace/ > { > self->trace = 1; > this->proc = args[0]->rq_proc; > self->pname = (this->proc >= 0 && this->proc <= 21) ? > stringof (nfssrv`rfscallnames_v3[this->proc]) : > stringof ("invalid proc"); > self->time = timestamp; > self->vtime = vtimestamp; > @opcount[self->pname] = count(); > } > > fbt:nfssrv:common_dispatch:return > /self->trace/ > { > this->usec = (timestamp - self->time) / 1000; > this->systm = (vtimestamp - self->vtime) / 1000; > @avgtime[self->pname] = avg(this->usec); > @qnttime[self->pname] = quantize(this->usec); > > @avgsystm[self->pname] = avg(this->systm); > @qntsystm[self->pname] = quantize(this->systm); > > self->pname = 0; > self->trace = 0; > } > > END > { > printf("\n\nNFS3 op counts\n"); > printf( "==============\n"); > printa("%-20s %@12d\n", @opcount); > > printf("\n\nNFS3 op avg response time (usec)\n"); > printf( "================================\n"); > printa("%-20s %@12d\n", @avgtime); > > printf("\n\nNFS3 op avg system time (usec)\n"); > printf( "==============================\n"); > printa("%-20s %@12d\n", @avgsystm); > > printf("\n\nNFS3 op quantized response time (usec)\n"); > printf( "======================================\n"); > printa("%-20s %@12d\n", @qnttime); > > printf("\n\nNFS3 op quantized system time (usec)\n"); > printf( "====================================\n"); > printa("%-20s %@12d\n", @qntsystm); > } > > >
Bart Smaalders
2006-May-05  15:07 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Joe Little wrote:> Perhaps I''m finding bugs in the iscsi initiator. > > I can give the full dtrace reports (should I attach them?) but for now > I''ll give the three top segments. I did a mixed directory of only 1000 > items of around 3GB as a first test. The NFS->UFS test and the > NFS->ZFS (local) were relatively fast, but the NFS->ZFS (over iscsi) > was dog slow vs what that ZFS/iscsi link normally does. >Could it be that your NFS implementation is single threaded on synch. operations? In other words, the NFS implementation (or benchmark) is forcing lots of synchronous IO operations that are single threaded. If you''re always waiting for data to end up on the platter before issuing the next IO request, this would kill performance. Perhaps a DTrace script to check workload concurrency is in order... - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Joe Little
2006-May-05  16:01 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Well, I used the dtrace script used here. The NFS implementation (server) is Solaris 11 B38, and the client and the RHEL linux revision, which doesn''t have this problem going through other SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a Linux box as an iscsi initiator, XFS, and Linux''s less than stellar kNFS server revision, and did not see this interaction. Thus, if there are any thread issues, its likely on Solaris'' end or there is particularly bad interaction with linux clients if and only if the solaris backend is iSCSI. That latter doesn''t make sense. On 5/5/06, Bart Smaalders <bart.smaalders at sun.com> wrote:> Joe Little wrote: > > Perhaps I''m finding bugs in the iscsi initiator. > > > > I can give the full dtrace reports (should I attach them?) but for now > > I''ll give the three top segments. I did a mixed directory of only 1000 > > items of around 3GB as a first test. The NFS->UFS test and the > > NFS->ZFS (local) were relatively fast, but the NFS->ZFS (over iscsi) > > was dog slow vs what that ZFS/iscsi link normally does. > > > > Could it be that your NFS implementation is single threaded on synch. > operations? > > In other words, the NFS implementation (or benchmark) is forcing lots > of synchronous IO operations that are single threaded. If you''re always > waiting for data to end up on the platter before issuing the next IO > request, this would kill performance. > > Perhaps a DTrace script to check workload concurrency is in order... > > - Bart > > > > -- > Bart Smaalders Solaris Kernel Performance > barts at cyber.eng.sun.com http://blogs.sun.com/barts >
Joe Little
2006-May-05  16:02 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
..and the client is a RHEL 4 linux revision... (sorry for any confusion) On 5/5/06, Joe Little <jmlittle at gmail.com> wrote:> Well, I used the dtrace script used here. The NFS implementation > (server) is Solaris 11 B38, and the client and the RHEL linux > revision, which doesn''t have this problem going through other > SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > kNFS server revision, and did not see this interaction. Thus, if there > are any thread issues, its likely on Solaris'' end or there is > particularly bad interaction with linux clients if and only if the > solaris backend is iSCSI. That latter doesn''t make sense. > > > On 5/5/06, Bart Smaalders <bart.smaalders at sun.com> wrote: > > Joe Little wrote: > > > Perhaps I''m finding bugs in the iscsi initiator. > > > > > > I can give the full dtrace reports (should I attach them?) but for now > > > I''ll give the three top segments. I did a mixed directory of only 1000 > > > items of around 3GB as a first test. The NFS->UFS test and the > > > NFS->ZFS (local) were relatively fast, but the NFS->ZFS (over iscsi) > > > was dog slow vs what that ZFS/iscsi link normally does. > > > > > > > Could it be that your NFS implementation is single threaded on synch. > > operations? > > > > In other words, the NFS implementation (or benchmark) is forcing lots > > of synchronous IO operations that are single threaded. If you''re always > > waiting for data to end up on the platter before issuing the next IO > > request, this would kill performance. > > > > Perhaps a DTrace script to check workload concurrency is in order... > > > > - Bart > > > > > > > > -- > > Bart Smaalders Solaris Kernel Performance > > barts at cyber.eng.sun.com http://blogs.sun.com/barts > > >
Spencer Shepler
2006-May-05  16:49 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, Joe Little wrote:> Well, I used the dtrace script used here. The NFS implementation > (server) is Solaris 11 B38, and the client and the RHEL linux > revision, which doesn''t have this problem going through other > SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > kNFS server revision, and did not see this interaction. Thus, if there > are any thread issues, its likely on Solaris'' end or there is > particularly bad interaction with linux clients if and only if the > solaris backend is iSCSI. That latter doesn''t make sense.It is a server response time issue as you have demonstrated with data. The server in the NFS/ZFS/iSCSI path is not responding as quickly as other combinations and for this particular application, the overall throughput is subpar. Focusing on the disparity found to understand why the NFS/ZFS/iSCSI combo is not working well seems like the correct path. Spencer
Joe Little
2006-May-05  17:46 UTC
[dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote:> On Fri, Joe Little wrote: > > Well, I used the dtrace script used here. The NFS implementation > > (server) is Solaris 11 B38, and the client and the RHEL linux > > revision, which doesn''t have this problem going through other > > SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > > Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > > kNFS server revision, and did not see this interaction. Thus, if there > > are any thread issues, its likely on Solaris'' end or there is > > particularly bad interaction with linux clients if and only if the > > solaris backend is iSCSI. That latter doesn''t make sense. > > It is a server response time issue as you have demonstrated with data. > The server in the NFS/ZFS/iSCSI path is not responding as quickly > as other combinations and for this particular application, the overall > throughput is subpar. > > Focusing on the disparity found to understand why the NFS/ZFS/iSCSI > combo is not working well seems like the correct path. >That''s where I''m at a loss. Has the NFS/ZFS/iSCSI path been tested by Sun at all?> Spencer > >
Spencer Shepler
2006-May-05  19:32 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, Joe Little wrote:> On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote: > >On Fri, Joe Little wrote: > >> Well, I used the dtrace script used here. The NFS implementation > >> (server) is Solaris 11 B38, and the client and the RHEL linux > >> revision, which doesn''t have this problem going through other > >> SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > >> Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > >> kNFS server revision, and did not see this interaction. Thus, if there > >> are any thread issues, its likely on Solaris'' end or there is > >> particularly bad interaction with linux clients if and only if the > >> solaris backend is iSCSI. That latter doesn''t make sense. > > > >It is a server response time issue as you have demonstrated with data. > >The server in the NFS/ZFS/iSCSI path is not responding as quickly > >as other combinations and for this particular application, the overall > >throughput is subpar. > > > >Focusing on the disparity found to understand why the NFS/ZFS/iSCSI > >combo is not working well seems like the correct path. > > > > That''s where I''m at a loss. Has the NFS/ZFS/iSCSI path been tested by > Sun at all?I don''t know and this seems like a good point to move the discussion to zfs-discuss and nfs-discuss to see if there is additional input. Spencer
Joe Little
2006-May-05  19:55 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
well, it was already an NFS-discuss list message. Someone else added dtrace-discuss to it. I have already noted this to a degree on zfs-discuss, but it seems to be mainly a NFS specific issue at this stage. On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote:> > On Fri, Joe Little wrote: > > On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote: > > >On Fri, Joe Little wrote: > > >> Well, I used the dtrace script used here. The NFS implementation > > >> (server) is Solaris 11 B38, and the client and the RHEL linux > > >> revision, which doesn''t have this problem going through other > > >> SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > > >> Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > > >> kNFS server revision, and did not see this interaction. Thus, if > there > > >> are any thread issues, its likely on Solaris'' end or there is > > >> particularly bad interaction with linux clients if and only if the > > >> solaris backend is iSCSI. That latter doesn''t make sense. > > > > > >It is a server response time issue as you have demonstrated with data. > > >The server in the NFS/ZFS/iSCSI path is not responding as quickly > > >as other combinations and for this particular application, the overall > > >throughput is subpar. > > > > > >Focusing on the disparity found to understand why the NFS/ZFS/iSCSI > > >combo is not working well seems like the correct path. > > > > > > > That''s where I''m at a loss. Has the NFS/ZFS/iSCSI path been tested by > > Sun at all? > > I don''t know and this seems like a good point to move the > discussion to zfs-discuss and nfs-discuss to see if there > is additional input. > > Spencer >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060505/b061d5d8/attachment.html>
Joe Little
2006-May-05  21:15 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
I just did another test, this time using a linux NFS client against B38 with UFS and iscsi disks. It was close to the same speed (over 8MB/sec average) as going to UFS on local disk or ZFS on local disk (around 20MB/sec). My UFS formated iscsi disk was only a single iscsi disk and not like the RAIDZ array I have for ZFS on iscsi. Either way, it looks like all methods of writing (local or NFS) to UFS or ZFS are plenty fast. It took 1minute 7 seconds to write to over NFS to UFS, and 2minutes 15 seconds to write over NFS to UFS on iscsi. With ZFS over iscsi, the same 3G directory sample took 50+ minutes, at low k''s/sec. Here''s the NFS->UFS->iSCSI dtrace latency times for comparison: dtrace: script ''./nfs.dtrace'' matched 3 probes ^C CPU FUNCTION 0 | :END NFS3 op counts =============RFS3_FSSTAT 4 RFS3_SYMLINK 5 RFS3_MKDIR 73 RFS3_COMMIT 886 RFS3_CREATE 901 RFS3_RENAME 901 RFS3_ACCESS 1881 RFS3_SETATTR 3792 RFS3_LOOKUP 3882 RFS3_GETATTR 7579 RFS3_WRITE 46844 NFS3 op avg response time (usec) ===============================RFS3_GETATTR 7 RFS3_ACCESS 8 RFS3_LOOKUP 14 RFS3_FSSTAT 135 RFS3_RENAME 884 RFS3_SETATTR 2172 RFS3_CREATE 6410 RFS3_MKDIR 15766 RFS3_SYMLINK 22340 RFS3_WRITE 30577 RFS3_COMMIT 57034 NFS3 op avg system time (usec) =============================RFS3_ACCESS 6 RFS3_GETATTR 6 RFS3_LOOKUP 8 RFS3_FSSTAT 23 RFS3_SETATTR 27 RFS3_CREATE 66 RFS3_RENAME 75 RFS3_SYMLINK 90 RFS3_MKDIR 120 RFS3_WRITE 188 RFS3_COMMIT 306 On 5/5/06, Joe Little <jmlittle at gmail.com> wrote:> > well, it was already an NFS-discuss list message. Someone else added > dtrace-discuss to it. I have already noted this to a degree on zfs-discuss, > but it seems to be mainly a NFS specific issue at this stage. > > > > On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote: > > > > On Fri, Joe Little wrote: > > > On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote: > > > >On Fri, Joe Little wrote: > > > >> Well, I used the dtrace script used here. The NFS implementation > > > >> (server) is Solaris 11 B38, and the client and the RHEL linux > > > >> revision, which doesn''t have this problem going through other > > > >> SAN-based NAS (NetApp, EMC, etc.. even iSCSI). I previously setup a > > > > > >> Linux box as an iscsi initiator, XFS, and Linux''s less than stellar > > > >> kNFS server revision, and did not see this interaction. Thus, if > > there > > > >> are any thread issues, its likely on Solaris'' end or there is > > > >> particularly bad interaction with linux clients if and only if the > > > >> solaris backend is iSCSI. That latter doesn''t make sense. > > > > > > > >It is a server response time issue as you have demonstrated with > > data. > > > >The server in the NFS/ZFS/iSCSI path is not responding as quickly > > > >as other combinations and for this particular application, the > > overall > > > >throughput is subpar. > > > > > > > >Focusing on the disparity found to understand why the NFS/ZFS/iSCSI > > > >combo is not working well seems like the correct path. > > > > > > > > > > That''s where I''m at a loss. Has the NFS/ZFS/iSCSI path been tested by > > > Sun at all? > > > > I don''t know and this seems like a good point to move the > > discussion to zfs-discuss and nfs-discuss to see if there > > is additional input. > > > > Spencer > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060505/9dc85483/attachment.html>
Eric Schrock
2006-May-05  21:32 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
My gut feeling is that somehow the DKIOCFLUSHWRITECACHE ioctls (which translate to the SCSI flush write cace requests) are throwing iSCSI for a loop. We''ve exposed a number of bugs in our drivers because ZFS is the first filesystem to actually care to issue this request. To turn this off, you can try: # mdb -kw> ::walk spa | ::print spa_t spa_root_vdev | ::vdev -rADDR STATE AUX DESCRIPTION ffffffff82dc16c0 HEALTHY - root ffffffff82dc0640 HEALTHY - /dev/dsk/c0d0s0> ffffffff82dc0640::print -a vdev_t vdev_nowritecacheffffffff82dc0af8 vdev_nowritecache = 0 (B_FALSE)> ffffffff82dc0af8/W10xffffffff82dc0af8: 0 = 0x1>See if that makes a difference. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Spencer Shepler
2006-May-05  21:35 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, Joe Little wrote:> well, it was already an NFS-discuss list message. Someone else added > dtrace-discuss to it. I have already noted this to a degree on zfs-discuss, > but it seems to be mainly a NFS specific issue at this stage.So I took your original data you posted and reformatted it to make it a little easier to compare. The first table, the overall operation counts, shows that the client is generally sending the same number of requests. Ignoring the FSINFO, NULL, FSSTAT numbers now given that they are a data collection anamoly. The numbers are pretty much exact except for GETATTR when the combination if NFS/ZFS. That is a little odd and may be something to look at out of curiousity but given the average response times in the second table for GETATTR, it shouldn''t be an issue. So the client is sending the same number of ops. Good to confirm that. NFS3 op counts UFS ZFS (non-iscsi) ZFS (iscsi) ==================================================================RFS3_FSINFO - - 1 RFS3_NULL - - 1 RFS3_FSSTAT - - 4 RFS3_SYMLINK 5 5 5 RFS3_MKDIR 73 73 73 RFS3_COMMIT 885 885 885 RFS3_CREATE 901 901 901 RFS3_RENAME 901 901 901 RFS3_ACCESS 1863 1884 1958 RFS3_SETATTR 3792 3792 3792 RFS3_LOOKUP 3873 3945 3963 RFS3_GETATTR 7568 8479 8514 RFS3_WRITE 46844 46844 46844 This second table is the most interesting one. It is easy to see that when comparing the ZFS non-iscsi and iscsi columns that the metadata operations: CREATE, MKDIR, RENAME, SETATTR, SYMLINK are on average an order of magnitude great in response time. More at the end on why that is the source of the "throughput" issues you are observing. NFS3 avg rsp time (usec) UFS ZFS (non-iscsi) ZFS (iscsi) ==================================================================RFS3_ACCESS 9 10 13 RFS3_COMMIT 13521 72752 273027 RFS3_CREATE 996 14806 156186 RFS3_FSINFO - - 19043 RFS3_FSSTAT - - 18 RFS3_GETATTR 8 8 14 RFS3_LOOKUP 10 13 24 RFS3_MKDIR 4589 16591 223285 RFS3_NULL - - 253 RFS3_RENAME 222 14780 146412 RFS3_SETATTR 559 13785 138609 RFS3_SYMLINK 344 16018 118492 RFS3_WRITE 9409 82 9931 And the system time measurements aren''t too interesting in that the extra overhead is unlikely to be the biggest issue at this point. NFS3 op avg sys time (usec) UFS ZFS (non-iscsi) ZFS (iscsi) ==================================================================RFS3_ACCESS 6 8 11 RFS3_COMMIT 112 5069 6256 RFS3_CREATE 72 63 105 RFS3_FSINFO - - 118 RFS3_FSSTAT - - 15 RFS3_GETATTR 6 7 12 RFS3_LOOKUP 8 9 13 RFS3_MKDIR 105 195 387 RFS3_NULL - - 14 RFS3_RENAME 77 67 112 RFS3_SETATTR 32 51 96 RFS3_SYMLINK 96 53 84 RFS3_WRITE 125 80 91 ------ So the average response time table is the key. With the workload you described (creating directories and files of "smaller" size) the response time of the operations like CREATE, MKDIR, RENAME, SETATTR and SYMLINK will become key. The NFS protocol requires that for these operations (and COMMIT for that matter), the associated file system metadata must be on stable storage before the server can respond that the operation is complete. Therefore, the order of magnitude greater response times for those operations will slow the overall throughput to what you are observing. With the data that you provided for the UFS/iscsi config there is additional response time being added because of the iscsi backend and that would be expected. However, I don''t have a good explanation for why the ZFS/iscsi combination is so much more extreme than the UFS/iscsi combination. I do know why local access to the ZFS/iscsi config is faster and that is because for local access the requirement for meta-data update is not present as it is for the NFS server so there is a lot of disk updates that can be effectively aggregated and postponed. Spencer
Joe Little
2006-May-05  22:46 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Thanks for the tip. In the local case, I could send to the iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time of 50seconds (17 seconds better than UFS). However, I didn''t even both finishing the NFS client test, since it was taking a few seconds between multiple 27K files. So, it didn''t help NFS at all. I''m wondering if there is something on the NFS end that needs changing, no? Also, how would one easily script the mdb command below to make permanent? On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote:> My gut feeling is that somehow the DKIOCFLUSHWRITECACHE ioctls (which > translate to the SCSI flush write cace requests) are throwing iSCSI for > a loop. We''ve exposed a number of bugs in our drivers because ZFS is > the first filesystem to actually care to issue this request. > > To turn this off, you can try: > > # mdb -kw > > ::walk spa | ::print spa_t spa_root_vdev | ::vdev -r > ADDR STATE AUX DESCRIPTION > ffffffff82dc16c0 HEALTHY - root > ffffffff82dc0640 HEALTHY - /dev/dsk/c0d0s0 > > ffffffff82dc0640::print -a vdev_t vdev_nowritecache > ffffffff82dc0af8 vdev_nowritecache = 0 (B_FALSE) > > ffffffff82dc0af8/W1 > 0xffffffff82dc0af8: 0 = 0x1 > > > > See if that makes a difference. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock >
Eric Schrock
2006-May-05  22:53 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote:> Thanks for the tip. In the local case, I could send to the > iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time > of 50seconds (17 seconds better than UFS). However, I didn''t even both > finishing the NFS client test, since it was taking a few seconds > between multiple 27K files. So, it didn''t help NFS at all. I''m > wondering if there is something on the NFS end that needs changing, > no?Keep in mind that turning off this flag may corrupt on-disk state in the event of power loss, etc. What was the delta in the local case? 17 seconds better than UFS, but percentage wise how much faster than the original? NFS has the property that it does an enormous amount of synchronous activity, which can tickle interesting pathologies. But it''s strange that it didn''t help NFS that much.> Also, how would one easily script the mdb command below to make > permanent?Unfortunately, there''s no good way to do this, since the ''::vdev'' command isn''t pipeable, and you''d have to use ''::map'' with hardcoded offsets that would be extremely brittle in the face of upgrades. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Joe Little
2006-May-05  23:10 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote:> On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: > > Thanks for the tip. In the local case, I could send to the > > iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time > > of 50seconds (17 seconds better than UFS). However, I didn''t even both > > finishing the NFS client test, since it was taking a few seconds > > between multiple 27K files. So, it didn''t help NFS at all. I''m > > wondering if there is something on the NFS end that needs changing, > > no? > > Keep in mind that turning off this flag may corrupt on-disk state in the > event of power loss, etc. What was the delta in the local case? 17 > seconds better than UFS, but percentage wise how much faster than the > original? >I believe it was only about 5-10% faster. I don''t have the time results off hand, just some dtrace latency reports.> NFS has the property that it does an enormous amount of synchronous > activity, which can tickle interesting pathologies. But it''s strange > that it didn''t help NFS that much.Should I also mount via async.. would this be honored on the Solaris end? The other option mentioned with similar caveats was nocto. I just tried with both, and the observed transfer rate was about 1.4k/s. It was painful deleting the 3G directory via NFS, with about 100k/s deletion rate on these 1000 files. Of course, When I went locally the delete was instantaneous. I''ll give the complete results of a minute of this copy in dtrace results: NFS3 op counts =============RFS3_SYMLINK 5 RFS3_MKDIR 27 RFS3_COMMIT 45 RFS3_RENAME 46 RFS3_CREATE 47 RFS3_ACCESS 126 RFS3_WRITE 143 RFS3_SETATTR 229 RFS3_GETATTR 536 RFS3_LOOKUP 1217 NFS3 op avg response time (usec) ===============================RFS3_LOOKUP 12 RFS3_ACCESS 13 RFS3_GETATTR 15 RFS3_WRITE 42 RFS3_RENAME 190052 RFS3_CREATE 199628 RFS3_COMMIT 244217 RFS3_SETATTR 256737 RFS3_MKDIR 258186 RFS3_SYMLINK 1150959 NFS3 op avg system time (usec) =============================RFS3_LOOKUP 10 RFS3_ACCESS 11 RFS3_GETATTR 12 RFS3_WRITE 40 RFS3_SETATTR 76 RFS3_COMMIT 100 RFS3_RENAME 101 RFS3_CREATE 104 RFS3_SYMLINK 105 RFS3_MKDIR 107 NFS3 op quantized response time (usec) =====================================RFS3_ACCESS value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 106 16 |@@@@@@ 20 32 | 0 RFS3_WRITE value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@ 33 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 95 64 |@@@@ 15 128 | 0 RFS3_GETATTR value ------------- Distribution ------------- count 2 | 0 4 | 1 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 347 16 |@@@@@@@@@@@@@ 176 32 |@ 11 64 | 0 128 | 1 256 | 0 RFS3_LOOKUP value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1060 16 |@@@@ 133 32 |@ 24 64 | 0 RFS3_SYMLINK value ------------- Distribution ------------- count 32768 | 0 65536 |@@@@@@@@ 1 131072 |@@@@@@@@ 1 262144 | 0 524288 | 0 1048576 |@@@@@@@@ 1 2097152 |@@@@@@@@@@@@@@@@ 2 4194304 | 0 RFS3_MKDIR value ------------- Distribution ------------- count 16384 | 0 32768 |@@@@ 3 65536 |@@@@@@@@@@@@@@@ 10 131072 |@@@@@@@@@@@@@@@ 10 262144 |@ 1 524288 | 0 1048576 |@@@@ 3 2097152 | 0 RFS3_CREATE value ------------- Distribution ------------- count 256 | 0 512 |@@ 2 1024 | 0 2048 |@ 1 4096 | 0 8192 | 0 16384 |@ 1 32768 |@@@@@@@@@@@@@@@@@@@@ 23 65536 |@@@@@@@@@@@@ 14 131072 |@ 1 262144 | 0 524288 |@ 1 1048576 |@@@ 4 2097152 | 0 RFS3_RENAME value ------------- Distribution ------------- count 256 | 0 512 |@ 1 1024 |@@ 2 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 |@@@@@@@@@@@@@@@@@ 20 65536 |@@@@@@@@@@ 12 131072 |@@@@@ 6 262144 | 0 524288 |@ 1 1048576 |@@@ 4 2097152 | 0 RFS3_COMMIT value ------------- Distribution ------------- count 256 | 0 512 |@ 1 1024 |@ 1 2048 | 0 4096 | 0 8192 | 0 16384 |@ 1 32768 |@@@@@@@@@@@@@@@@@ 19 65536 |@@@@@@@@@@@ 12 131072 |@@@@ 4 262144 | 0 524288 | 0 1048576 |@@@@@@ 7 2097152 | 0 RFS3_SETATTR value ------------- Distribution ------------- count 256 | 0 512 |@ 4 1024 |@ 6 2048 | 0 4096 | 0 8192 | 0 16384 |@ 4 32768 |@@@@@@@@@@@@@@@@ 89 65536 |@@@@@@@@@ 53 131072 |@@@@@@ 34 262144 |@ 4 524288 |@ 3 1048576 |@@@@@ 30 2097152 | 1 4194304 | 0 NFS3 op quantized system time (usec) ===================================RFS3_SYMLINK value ------------- Distribution ------------- count 32 | 0 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5 128 | 0 RFS3_ACCESS value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 112 16 |@@@@ 14 32 | 0 RFS3_MKDIR value ------------- Distribution ------------- count 32 | 0 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 24 128 |@@@@ 3 256 | 0 RFS3_RENAME value ------------- Distribution ------------- count 32 | 0 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41 128 |@@@@ 5 256 | 0 RFS3_COMMIT value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@@@@ 18 64 |@@@@@@@@@ 10 128 |@@@@@@@@@@@@@@@ 17 256 | 0 RFS3_CREATE value ------------- Distribution ------------- count 32 | 0 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41 128 |@@@@ 5 256 |@ 1 512 | 0 RFS3_WRITE value ------------- Distribution ------------- count 8 | 0 16 |@@@@@@@@@@@@@@ 50 32 |@@@@@@@@@@@@@@@@@@@@@@@ 84 64 |@@@ 9 128 | 0 RFS3_GETATTR value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@ 126 8 |@@@@@@@@@@@@@@@@@@@ 253 16 |@@@@@@@@@@@ 148 32 |@ 9 64 | 0 RFS3_LOOKUP value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1104 16 |@@@ 95 32 |@ 18 64 | 0 RFS3_SETATTR value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@@ 81 64 |@@@@@@@@@@@@@@@@@@@@@@@@@ 143 128 |@ 4 256 | 0> > > Also, how would one easily script the mdb command below to make > > permanent? > > Unfortunately, there''s no good way to do this, since the ''::vdev'' > command isn''t pipeable, and you''d have to use ''::map'' with hardcoded > offsets that would be extremely brittle in the face of upgrades. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock >
Joe Little
2006-May-05  23:14 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
And of course, just to circle back, an rsync via ssh from the client to the Solaris ZFS/iscsi server came in at 17.5MB/sec, taking 1minute 16 seconds, or about 20% longer. So, NFS (over TCP) is 1.4k/s, and encrypted ssh is 17.5MB/sec following the same network path. On 5/5/06, Joe Little <jmlittle at gmail.com> wrote:> On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote: > > On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: > > > Thanks for the tip. In the local case, I could send to the > > > iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time > > > of 50seconds (17 seconds better than UFS). However, I didn''t even both > > > finishing the NFS client test, since it was taking a few seconds > > > between multiple 27K files. So, it didn''t help NFS at all. I''m > > > wondering if there is something on the NFS end that needs changing, > > > no? > > > > Keep in mind that turning off this flag may corrupt on-disk state in the > > event of power loss, etc. What was the delta in the local case? 17 > > seconds better than UFS, but percentage wise how much faster than the > > original? > > > > I believe it was only about 5-10% faster. I don''t have the time > results off hand, just some dtrace latency reports. > > > NFS has the property that it does an enormous amount of synchronous > > activity, which can tickle interesting pathologies. But it''s strange > > that it didn''t help NFS that much. > > Should I also mount via async.. would this be honored on the Solaris > end? The other option mentioned with similar caveats was nocto. I just > tried with both, and the observed transfer rate was about 1.4k/s. It > was painful deleting the 3G directory via NFS, with about 100k/s > deletion rate on these 1000 files. Of course, When I went locally the > delete was instantaneous. > > I''ll give the complete results of a minute of this copy in dtrace results: > > NFS3 op counts > =============> RFS3_SYMLINK 5 > RFS3_MKDIR 27 > RFS3_COMMIT 45 > RFS3_RENAME 46 > RFS3_CREATE 47 > RFS3_ACCESS 126 > RFS3_WRITE 143 > RFS3_SETATTR 229 > RFS3_GETATTR 536 > RFS3_LOOKUP 1217 > > > NFS3 op avg response time (usec) > ===============================> RFS3_LOOKUP 12 > RFS3_ACCESS 13 > RFS3_GETATTR 15 > RFS3_WRITE 42 > RFS3_RENAME 190052 > RFS3_CREATE 199628 > RFS3_COMMIT 244217 > RFS3_SETATTR 256737 > RFS3_MKDIR 258186 > RFS3_SYMLINK 1150959 > > > NFS3 op avg system time (usec) > =============================> RFS3_LOOKUP 10 > RFS3_ACCESS 11 > RFS3_GETATTR 12 > RFS3_WRITE 40 > RFS3_SETATTR 76 > RFS3_COMMIT 100 > RFS3_RENAME 101 > RFS3_CREATE 104 > RFS3_SYMLINK 105 > RFS3_MKDIR 107 > > > NFS3 op quantized response time (usec) > =====================================> RFS3_ACCESS > value ------------- Distribution ------------- count > 4 | 0 > 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 106 > 16 |@@@@@@ 20 > 32 | 0 > > RFS3_WRITE > value ------------- Distribution ------------- count > 8 | 0 > 16 |@@@@@@@@@ 33 > 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 95 > 64 |@@@@ 15 > 128 | 0 > > RFS3_GETATTR > value ------------- Distribution ------------- count > 2 | 0 > 4 | 1 > 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 347 > 16 |@@@@@@@@@@@@@ 176 > 32 |@ 11 > 64 | 0 > 128 | 1 > 256 | 0 > > RFS3_LOOKUP > value ------------- Distribution ------------- count > 4 | 0 > 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1060 > 16 |@@@@ 133 > 32 |@ 24 > 64 | 0 > > RFS3_SYMLINK > value ------------- Distribution ------------- count > 32768 | 0 > 65536 |@@@@@@@@ 1 > 131072 |@@@@@@@@ 1 > 262144 | 0 > 524288 | 0 > 1048576 |@@@@@@@@ 1 > 2097152 |@@@@@@@@@@@@@@@@ 2 > 4194304 | 0 > > RFS3_MKDIR > value ------------- Distribution ------------- count > 16384 | 0 > 32768 |@@@@ 3 > 65536 |@@@@@@@@@@@@@@@ 10 > 131072 |@@@@@@@@@@@@@@@ 10 > 262144 |@ 1 > 524288 | 0 > 1048576 |@@@@ 3 > 2097152 | 0 > > RFS3_CREATE > value ------------- Distribution ------------- count > 256 | 0 > 512 |@@ 2 > 1024 | 0 > 2048 |@ 1 > 4096 | 0 > 8192 | 0 > 16384 |@ 1 > 32768 |@@@@@@@@@@@@@@@@@@@@ 23 > 65536 |@@@@@@@@@@@@ 14 > 131072 |@ 1 > 262144 | 0 > 524288 |@ 1 > 1048576 |@@@ 4 > 2097152 | 0 > > RFS3_RENAME > value ------------- Distribution ------------- count > 256 | 0 > 512 |@ 1 > 1024 |@@ 2 > 2048 | 0 > 4096 | 0 > 8192 | 0 > 16384 | 0 > 32768 |@@@@@@@@@@@@@@@@@ 20 > 65536 |@@@@@@@@@@ 12 > 131072 |@@@@@ 6 > 262144 | 0 > 524288 |@ 1 > 1048576 |@@@ 4 > 2097152 | 0 > > RFS3_COMMIT > value ------------- Distribution ------------- count > 256 | 0 > 512 |@ 1 > 1024 |@ 1 > 2048 | 0 > 4096 | 0 > 8192 | 0 > 16384 |@ 1 > 32768 |@@@@@@@@@@@@@@@@@ 19 > 65536 |@@@@@@@@@@@ 12 > 131072 |@@@@ 4 > 262144 | 0 > 524288 | 0 > 1048576 |@@@@@@ 7 > 2097152 | 0 > > RFS3_SETATTR > value ------------- Distribution ------------- count > 256 | 0 > 512 |@ 4 > 1024 |@ 6 > 2048 | 0 > 4096 | 0 > 8192 | 0 > 16384 |@ 4 > 32768 |@@@@@@@@@@@@@@@@ 89 > 65536 |@@@@@@@@@ 53 > 131072 |@@@@@@ 34 > 262144 |@ 4 > 524288 |@ 3 > 1048576 |@@@@@ 30 > 2097152 | 1 > 4194304 | 0 > > > > NFS3 op quantized system time (usec) > ===================================> RFS3_SYMLINK > value ------------- Distribution ------------- count > 32 | 0 > 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5 > 128 | 0 > > RFS3_ACCESS > value ------------- Distribution ------------- count > 4 | 0 > 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 112 > 16 |@@@@ 14 > 32 | 0 > > RFS3_MKDIR > value ------------- Distribution ------------- count > 32 | 0 > 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 24 > 128 |@@@@ 3 > 256 | 0 > > RFS3_RENAME > value ------------- Distribution ------------- count > 32 | 0 > 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41 > 128 |@@@@ 5 > 256 | 0 > > RFS3_COMMIT > value ------------- Distribution ------------- count > 16 | 0 > 32 |@@@@@@@@@@@@@@@@ 18 > 64 |@@@@@@@@@ 10 > 128 |@@@@@@@@@@@@@@@ 17 > 256 | 0 > > RFS3_CREATE > value ------------- Distribution ------------- count > 32 | 0 > 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41 > 128 |@@@@ 5 > 256 |@ 1 > 512 | 0 > > RFS3_WRITE > value ------------- Distribution ------------- count > 8 | 0 > 16 |@@@@@@@@@@@@@@ 50 > 32 |@@@@@@@@@@@@@@@@@@@@@@@ 84 > 64 |@@@ 9 > 128 | 0 > > RFS3_GETATTR > value ------------- Distribution ------------- count > 2 | 0 > 4 |@@@@@@@@@ 126 > 8 |@@@@@@@@@@@@@@@@@@@ 253 > 16 |@@@@@@@@@@@ 148 > 32 |@ 9 > 64 | 0 > > RFS3_LOOKUP > value ------------- Distribution ------------- count > 4 | 0 > 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1104 > 16 |@@@ 95 > 32 |@ 18 > 64 | 0 > > RFS3_SETATTR > value ------------- Distribution ------------- count > 16 | 0 > 32 |@@@@@@@@@@@@@@ 81 > 64 |@@@@@@@@@@@@@@@@@@@@@@@@@ 143 > 128 |@ 4 > 256 | 0 > > > > > > > > > Also, how would one easily script the mdb command below to make > > > permanent? > > > > Unfortunately, there''s no good way to do this, since the ''::vdev'' > > command isn''t pipeable, and you''d have to use ''::map'' with hardcoded > > offsets that would be extremely brittle in the face of upgrades. > > > > - Eric > > > > -- > > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock > > >
Spencer Shepler
2006-May-05  23:23 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, Joe Little wrote:> On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote: > >On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: > >> Thanks for the tip. In the local case, I could send to the > >> iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time > >> of 50seconds (17 seconds better than UFS). However, I didn''t even both > >> finishing the NFS client test, since it was taking a few seconds > >> between multiple 27K files. So, it didn''t help NFS at all. I''m > >> wondering if there is something on the NFS end that needs changing, > >> no? > > > >Keep in mind that turning off this flag may corrupt on-disk state in the > >event of power loss, etc. What was the delta in the local case? 17 > >seconds better than UFS, but percentage wise how much faster than the > >original? > > > > I believe it was only about 5-10% faster. I don''t have the time > results off hand, just some dtrace latency reports. > > >NFS has the property that it does an enormous amount of synchronous > >activity, which can tickle interesting pathologies. But it''s strange > >that it didn''t help NFS that much. > > Should I also mount via async.. would this be honored on the Solaris > end? The other option mentioned with similar caveats was nocto. I just > tried with both, and the observed transfer rate was about 1.4k/s. It > was painful deleting the 3G directory via NFS, with about 100k/s > deletion rate on these 1000 files. Of course, When I went locally the > delete was instantaneous.I wouldn''t change any of the options at the client. The issue is at the server side and none of the other combinations that you originally pointed out have this problem, right? Mount options at the client will just muddy the waters. We need to understand if/what the NFS/ZFS/iscsi interaction is and why it is so much worse. As Eric mentioned, there may be some interesting pathologies at play here and we need to understand what they are so they can be addressed. My suggestion is additional dtrace data collection but I don''t have a specific suggestion as to how/what to track next. Because of the significant additional latency, I would be looking for big increases in the number of I/Os being generated to the iscsi backend as compared to the local attached case. I would also look for some type of serialization of I/Os that is occurring with iscsi vs. the local attach. Spencer
Joe Little
2006-May-06  00:59 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Are there known i/o or iscsi dtrace scripts available? On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote:> On Fri, Joe Little wrote: > > On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote: > > >On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: > > >> Thanks for the tip. In the local case, I could send to the > > >> iSCSI-backed ZFS RAIDZ at even faster rates, with a total elapsed time > > >> of 50seconds (17 seconds better than UFS). However, I didn''t even both > > >> finishing the NFS client test, since it was taking a few seconds > > >> between multiple 27K files. So, it didn''t help NFS at all. I''m > > >> wondering if there is something on the NFS end that needs changing, > > >> no? > > > > > >Keep in mind that turning off this flag may corrupt on-disk state in the > > >event of power loss, etc. What was the delta in the local case? 17 > > >seconds better than UFS, but percentage wise how much faster than the > > >original? > > > > > > > I believe it was only about 5-10% faster. I don''t have the time > > results off hand, just some dtrace latency reports. > > > > >NFS has the property that it does an enormous amount of synchronous > > >activity, which can tickle interesting pathologies. But it''s strange > > >that it didn''t help NFS that much. > > > > Should I also mount via async.. would this be honored on the Solaris > > end? The other option mentioned with similar caveats was nocto. I just > > tried with both, and the observed transfer rate was about 1.4k/s. It > > was painful deleting the 3G directory via NFS, with about 100k/s > > deletion rate on these 1000 files. Of course, When I went locally the > > delete was instantaneous. > > I wouldn''t change any of the options at the client. The issue > is at the server side and none of the other combinations that you > originally pointed out have this problem, right? Mount options at the > client will just muddy the waters. > > We need to understand if/what the NFS/ZFS/iscsi interaction is and why > it is so much worse. As Eric mentioned, there may be some interesting > pathologies at play here and we need to understand what they are so > they can be addressed. > > My suggestion is additional dtrace data collection but I don''t have > a specific suggestion as to how/what to track next. > Because of the significant additional latency, I would be looking for > big increases in the number of I/Os being generated to the iscsi backend > as compared to the local attached case. I would also look for > some type of serialization of I/Os that is occurring with iscsi vs. > the local attach. > > Spencer >
Lisa Week
2006-May-06  01:43 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
These may help:
http://opensolaris.org/os/community/dtrace/scripts/
    Check out iosnoop.d
http://www.solarisinternals.com/si/dtrace/index.php
    Check out iotrace.d
- Lisa
Joe Little wrote On 05/05/06 18:59,:
> Are there known i/o or iscsi dtrace scripts available?
>
> On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote:
>
>> On Fri, Joe Little wrote:
>> > On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote:
>> > >On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote:
>> > >> Thanks for the tip. In the local case, I could send to
the
>> > >> iSCSI-backed ZFS RAIDZ at even faster rates, with a total
>> elapsed time
>> > >> of 50seconds (17 seconds better than UFS). However, I
didn''t
>> even both
>> > >> finishing the NFS client test, since it was taking a few
seconds
>> > >> between multiple 27K files. So, it didn''t help
NFS at all. I''m
>> > >> wondering if there is something on the NFS end that needs
changing,
>> > >> no?
>> > >
>> > >Keep in mind that turning off this flag may corrupt on-disk
state
>> in the
>> > >event of power loss, etc.  What was the delta in the local
case?  17
>> > >seconds better than UFS, but percentage wise how much faster
than the
>> > >original?
>> > >
>> >
>> > I believe it was only about 5-10% faster. I don''t have
the time
>> > results off hand, just some dtrace latency reports.
>> >
>> > >NFS has the property that it does an enormous amount of
synchronous
>> > >activity, which can tickle interesting pathologies.  But
it''s strange
>> > >that it didn''t help NFS that much.
>> >
>> > Should I also mount via async.. would this be honored on the
Solaris
>> > end? The other option mentioned with similar caveats was nocto. I
just
>> > tried with both, and the observed transfer rate was about 1.4k/s.
It
>> > was painful deleting the 3G directory via NFS, with about 100k/s
>> > deletion rate on these 1000 files. Of course, When I went locally
the
>> > delete was instantaneous.
>>
>> I wouldn''t change any of the options at the client.  The issue
>> is at the server side and none of the other combinations that you
>> originally pointed out have this problem, right?  Mount options at the
>> client will just muddy the waters.
>>
>> We need to understand if/what the NFS/ZFS/iscsi interaction is and why
>> it is so much worse.  As Eric mentioned, there may be some interesting
>> pathologies at play here and we need to understand what they are so
>> they can be addressed.
>>
>> My suggestion is additional dtrace data collection but I don''t
have
>> a specific suggestion as to how/what to track next.
>> Because of the significant additional latency, I would be looking for
>> big increases in the number of I/Os being generated to the iscsi
backend
>> as compared to the local attached case.  I would also look for
>> some type of serialization of I/Os that is occurring with iscsi vs.
>> the local attach.
>>
>> Spencer
>>
> _______________________________________________
> nfs-discuss mailing list
> nfs-discuss at opensolaris.org
Joe Little
2006-May-06  04:20 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Thanks. I''m playing with it now, trying to get the most succinct test. This is one thing that bothers me: Regardless of the backend, it appears that a delete of a large tree (say the linux kernel) over NFS takes forever, but its immediate when doing so locally. Is delete over NFS really take such a different code path? On 5/5/06, Lisa Week <Lisa.Week at sun.com> wrote:> These may help: > > http://opensolaris.org/os/community/dtrace/scripts/ > Check out iosnoop.d > > http://www.solarisinternals.com/si/dtrace/index.php > Check out iotrace.d > > - Lisa > > Joe Little wrote On 05/05/06 18:59,: > > > Are there known i/o or iscsi dtrace scripts available? > > > > On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote: > > > >> On Fri, Joe Little wrote: > >> > On 5/5/06, Eric Schrock <eric.schrock at sun.com> wrote: > >> > >On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little wrote: > >> > >> Thanks for the tip. In the local case, I could send to the > >> > >> iSCSI-backed ZFS RAIDZ at even faster rates, with a total > >> elapsed time > >> > >> of 50seconds (17 seconds better than UFS). However, I didn''t > >> even both > >> > >> finishing the NFS client test, since it was taking a few seconds > >> > >> between multiple 27K files. So, it didn''t help NFS at all. I''m > >> > >> wondering if there is something on the NFS end that needs changing, > >> > >> no? > >> > > > >> > >Keep in mind that turning off this flag may corrupt on-disk state > >> in the > >> > >event of power loss, etc. What was the delta in the local case? 17 > >> > >seconds better than UFS, but percentage wise how much faster than the > >> > >original? > >> > > > >> > > >> > I believe it was only about 5-10% faster. I don''t have the time > >> > results off hand, just some dtrace latency reports. > >> > > >> > >NFS has the property that it does an enormous amount of synchronous > >> > >activity, which can tickle interesting pathologies. But it''s strange > >> > >that it didn''t help NFS that much. > >> > > >> > Should I also mount via async.. would this be honored on the Solaris > >> > end? The other option mentioned with similar caveats was nocto. I just > >> > tried with both, and the observed transfer rate was about 1.4k/s. It > >> > was painful deleting the 3G directory via NFS, with about 100k/s > >> > deletion rate on these 1000 files. Of course, When I went locally the > >> > delete was instantaneous. > >> > >> I wouldn''t change any of the options at the client. The issue > >> is at the server side and none of the other combinations that you > >> originally pointed out have this problem, right? Mount options at the > >> client will just muddy the waters. > >> > >> We need to understand if/what the NFS/ZFS/iscsi interaction is and why > >> it is so much worse. As Eric mentioned, there may be some interesting > >> pathologies at play here and we need to understand what they are so > >> they can be addressed. > >> > >> My suggestion is additional dtrace data collection but I don''t have > >> a specific suggestion as to how/what to track next. > >> Because of the significant additional latency, I would be looking for > >> big increases in the number of I/Os being generated to the iscsi backend > >> as compared to the local attached case. I would also look for > >> some type of serialization of I/Os that is occurring with iscsi vs. > >> the local attach. > >> > >> Spencer > >> > > _______________________________________________ > > nfs-discuss mailing list > > nfs-discuss at opensolaris.org > >
Joe Little
2006-May-06  04:48 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Here''s some sample output. Where the I write over NFS to ZFS (no
iscsi) I get high sizes for i/o:
  UID   PID D    BLOCK   SIZE       COMM PATHNAME
    1   427 W 22416320   4096       nfsd <none>
    1   427 W 22416328   4096       nfsd <none>
    1   427 W 22416336   4096       nfsd <none>
    1   427 W 22416344   4096       nfsd <none>
    1   427 W 22474784  16384       nfsd <none>
    1   427 W 22474816  16384       nfsd <none>
    1   427 W 22474848  16384       nfsd <none>
    1   427 W 22416352   4096       nfsd <none>
    1   427 W 22416360   4096       nfsd <none>
    1   427 W 22416368   4096       nfsd <none>
    1   427 W 22416376   4096       nfsd <none>
    1   427 W 22416384   4096       nfsd <none>
    1   427 W 22416392   4096       nfsd <none>
    1   427 W 22416400   4096       nfsd <none>
    1   427 W 22416408   4096       nfsd <none>
    1   427 W 22416416   4096       nfsd <none>
    1   427 W 22416424   4096       nfsd <none>
    1   427 W 22416432   4096       nfsd <none>
    1   427 W 22416440   4096       nfsd <none>
    1   427 W 22416448   4096       nfsd <none>
    1   427 W 22416456   4096       nfsd <none>
    1   427 W 22039040   8192       nfsd <none>
    1   427 W 22039056   8192       nfsd <none>
    1   427 W 22039072   8192       nfsd <none>
    1   427 W 22416464   4096       nfsd <none>
    1   427 W 22416472   4096       nfsd <none>
    1   427 W 22416480   4096       nfsd <none>
    1   427 W 22416488   4096       nfsd <none>
    1   427 W 22416496   4096       nfsd <none>
    1   427 W 22416504   4096       nfsd <none>
    1   427 W 22416512   4096       nfsd <none>
    1   427 W 22416520   4096       nfsd <none>
    1   427 W 22416528   4096       nfsd <none>
    1   427 W 22416536   4096       nfsd <none>
    1   427 W 22416544   4096       nfsd <none>
    1   427 W 22416552   4096       nfsd <none>
    1   427 W 22416560   4096       nfsd <none>
    1   427 W 22416568   4096       nfsd <none>
    1   427 W 22416576   4096       nfsd <none>
    1   427 W 22416584   4096       nfsd <none>
    1   427 W 22416592   4096       nfsd <none>
    1   427 W 22416600   4096       nfsd <none>
    1   427 W 22416608   4096       nfsd <none>
    1   427 W 22416616   4096       nfsd <none>
    1   427 W 22416624   4096       nfsd <none>
    1   427 W 22416632   4096       nfsd <none>
    1   427 W 22416640  12288       nfsd <none>
    1   427 W 22416664  12288       nfsd <none>
    1   427 W 22416688  12288       nfsd <none>
    1   427 W 22416712   4096       nfsd <none>
    1   427 W 22416720   4096       nfsd <none>
    1   427 W 22416728   4096       nfsd <none>
    1   427 W 22416736   4096       nfsd <none>
    1   427 W 22416744   4096       nfsd <none>
    1   427 W 22416752   4096       nfsd <none>
    1   427 W 22416760  36864       nfsd <none>
    0     0 W 22416832  53248      sched <none>
    1   427 W 22416936  53248       nfsd <none>
    1   427 W 22417040  53248       nfsd <none>
    1   427 W 22417144   4096       nfsd <none>
    1   427 W 22417152   4096       nfsd <none>
    1   427 W 22417160   4096       nfsd <none>
    1   427 W 22417168   4096       nfsd <none>
    1   427 W 22417176   4096       nfsd <none>
    1   427 W 22417184   4096       nfsd <none>
    1   427 W 22417192  12288       nfsd <none>
    1   427 W 22417216  12288       nfsd <none>
    1   427 W 22417240  12288       nfsd <none>
    1   427 W 22417264   4096       nfsd <none>
    1   427 W 22417272   4096       nfsd <none>
    1   427 W 22417280   4096       nfsd <none>
In the iscsi-backed case, I get:
   1   427 W 369124510    512       nfsd <none>
    1   427 W 369124510    512       nfsd <none>
    1   427 W 369124510    512       nfsd <none>
    1   427 W 369124510    512       nfsd <none>
    1   427 W 369124510   1024       nfsd <none>
    1   427 W 369124564   1024       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564    512       nfsd <none>
    1   427 W 369124564   1024       nfsd <none>
    1   427 W 369124565   1024       nfsd <none>
    1   427 W 369124566    512       nfsd <none>
    1   427 W 369124566    512       nfsd <none>
    1   427 W 369124565    512       nfsd <none>
    1   427 W 369124565    512       nfsd <none>
    1   427 W 369124565    512       nfsd <none>
    1   427 W 369124565    512       nfsd <none>
    1   427 W 369124565   1024       nfsd <none>
    1   427 W 369124566   1024       nfsd <none>
    1   427 W 369124567    512       nfsd <none>
    1   427 W 369124567    512       nfsd <none>
    1   427 W 369124567    512       nfsd <none>
    1   427 W 369124567    512       nfsd <none>
    1   427 W 369124566    512       nfsd <none>
    1   427 W 369124566    512       nfsd <none>
    1   427 W 369124566   1024       nfsd <none>
    1   427 W 369124567   1024       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124568    512       nfsd <none>
    1   427 W 369124567   1024       nfsd <none>
    1   427 W 369124569   1024       nfsd <none>
    1   427 W 369124569    512       nfsd <none>
Looks to me the bulk of my problem is poor block size scheduling. Is
this tuneable for either ZFS or NFS and/or can be set?
On 5/5/06, Lisa Week <Lisa.Week at sun.com> wrote:> These may help:
>
> http://opensolaris.org/os/community/dtrace/scripts/
>     Check out iosnoop.d
>
> http://www.solarisinternals.com/si/dtrace/index.php
>     Check out iotrace.d
>
> - Lisa
>
> Joe Little wrote On 05/05/06 18:59,:
>
> > Are there known i/o or iscsi dtrace scripts available?
> >
> > On 5/5/06, Spencer Shepler <spencer.shepler at sun.com> wrote:
> >
> >> On Fri, Joe Little wrote:
> >> > On 5/5/06, Eric Schrock <eric.schrock at sun.com>
wrote:
> >> > >On Fri, May 05, 2006 at 03:46:08PM -0700, Joe Little
wrote:
> >> > >> Thanks for the tip. In the local case, I could send
to the
> >> > >> iSCSI-backed ZFS RAIDZ at even faster rates, with a
total
> >> elapsed time
> >> > >> of 50seconds (17 seconds better than UFS). However,
I didn''t
> >> even both
> >> > >> finishing the NFS client test, since it was taking a
few seconds
> >> > >> between multiple 27K files. So, it didn''t
help NFS at all. I''m
> >> > >> wondering if there is something on the NFS end that
needs changing,
> >> > >> no?
> >> > >
> >> > >Keep in mind that turning off this flag may corrupt
on-disk state
> >> in the
> >> > >event of power loss, etc.  What was the delta in the
local case?  17
> >> > >seconds better than UFS, but percentage wise how much
faster than the
> >> > >original?
> >> > >
> >> >
> >> > I believe it was only about 5-10% faster. I don''t
have the time
> >> > results off hand, just some dtrace latency reports.
> >> >
> >> > >NFS has the property that it does an enormous amount of
synchronous
> >> > >activity, which can tickle interesting pathologies.  But
it''s strange
> >> > >that it didn''t help NFS that much.
> >> >
> >> > Should I also mount via async.. would this be honored on the
Solaris
> >> > end? The other option mentioned with similar caveats was
nocto. I just
> >> > tried with both, and the observed transfer rate was about
1.4k/s. It
> >> > was painful deleting the 3G directory via NFS, with about
100k/s
> >> > deletion rate on these 1000 files. Of course, When I went
locally the
> >> > delete was instantaneous.
> >>
> >> I wouldn''t change any of the options at the client.  The
issue
> >> is at the server side and none of the other combinations that you
> >> originally pointed out have this problem, right?  Mount options at
the
> >> client will just muddy the waters.
> >>
> >> We need to understand if/what the NFS/ZFS/iscsi interaction is and
why
> >> it is so much worse.  As Eric mentioned, there may be some
interesting
> >> pathologies at play here and we need to understand what they are
so
> >> they can be addressed.
> >>
> >> My suggestion is additional dtrace data collection but I
don''t have
> >> a specific suggestion as to how/what to track next.
> >> Because of the significant additional latency, I would be looking
for
> >> big increases in the number of I/Os being generated to the iscsi
backend
> >> as compared to the local attached case.  I would also look for
> >> some type of serialization of I/Os that is occurring with iscsi
vs.
> >> the local attach.
> >>
> >> Spencer
> >>
> > _______________________________________________
> > nfs-discuss mailing list
> > nfs-discuss at opensolaris.org
>
>
Spencer Shepler
2006-May-06  04:55 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, Joe Little wrote:> Thanks. I''m playing with it now, trying to get the most succinct test. > This is one thing that bothers me: Regardless of the backend, it > appears that a delete of a large tree (say the linux kernel) over NFS > takes forever, but its immediate when doing so locally. Is delete over > NFS really take such a different code path?Yes. As mentioned in my other email, the NFS protocol requires that operations like REMOVE, RMDIR, CREATE have the filesystem metadata written to stable storage/disk before sending a response to the client. That is not required of local access and therefore the disparity between the two. Spencer
Nicolas Williams
2006-May-07  04:53 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, May 05, 2006 at 09:48:00PM -0700, Joe Little wrote:> Here''s some sample output. Where the I write over NFS to ZFS (no > iscsi) I get high sizes for i/o: > > In the iscsi-backed case, I get:[small block sizes]> Looks to me the bulk of my problem is poor block size scheduling. Is > this tuneable for either ZFS or NFS and/or can be set?Looking at iscsiadm(1M) (and knowing little about iSCSI) I wonder what the maxburstlength/maxrecvdataseglen parameters are set to... Nico --
Joe Little
2006-May-07  06:41 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
I would normally agree, just that I get very large block sizes in the
non-NFS scenario.
However, for completeness, you''ll see my standard settings allow for
pretty large transfers. The maxreceivedataseglen below is the default,
and increasing it may help, but it still doesn''t answer why the local
non-nfs case has no problems, right?
                        First Burst Length: 65536
                        Immediate Data: yes
                        Initial Ready To Transfer (R2T): yes
                        Max Burst Length: 262144
                        Max Outstanding R2T: 1
                        Max Receive Data Segment Length: 8192
                        Max Connections: 1
                        Header Digest: NONE
                        Data Digest: NONE
On 5/6/06, Nicolas Williams <Nicolas.Williams at sun.com>
wrote:> On Fri, May 05, 2006 at 09:48:00PM -0700, Joe Little wrote:
> > Here''s some sample output. Where the I write over NFS to ZFS
(no
> > iscsi) I get high sizes for i/o:
> >
> > In the iscsi-backed case, I get:
>
> [small block sizes]
>
> > Looks to me the bulk of my problem is poor block size scheduling. Is
> > this tuneable for either ZFS or NFS and/or can be set?
>
> Looking at iscsiadm(1M) (and knowing little about iSCSI) I wonder what
> the maxburstlength/maxrecvdataseglen parameters are set to...
>
> Nico
> --
>
Nicolas Williams
2006-May-08  18:02 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote:> On Fri, Joe Little wrote: > > Thanks. I''m playing with it now, trying to get the most succinct test. > > This is one thing that bothers me: Regardless of the backend, it > > appears that a delete of a large tree (say the linux kernel) over NFS > > takes forever, but its immediate when doing so locally. Is delete over > > NFS really take such a different code path? > > Yes. As mentioned in my other email, the NFS protocol requires > that operations like REMOVE, RMDIR, CREATE have the filesystem > metadata written to stable storage/disk before sending a response > to the client. That is not required of local access and therefore > the disparity between the two.So then multi-threading rm/rmdir on the client-side would help, no? Are there/should there be async versions of creat(2)/mkdir(2)/ rmdir(2)/link(2)/unlink(2)/...? Nico --
Joe Little
2006-May-09  04:24 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
I was asked to also snoop the iscsi end of things, trying to find something different between the two. iscsi being relatively opaque, it was easiest to find differences in the patterns. In the local copy to RAIDZ example, the iscsi link would show packets of 1514 in length in series of 5-10, with interjecting packets of 60 or 102, generally 2-4 in number. In the NFS client hitting the RAIDZ/iscsi combo, the iscsi length would have 3-5 on average 1514 length packets with 5-7 packets of 60 or 102 in between. Basically, the averages swapped, and its likely because of a lot more meta data and/or write confirmations going on in the NFS case. At this point in time, I have two very important questions: 1) Is there any options available or planned to make NFS/ZFS work more in concert to avoid this overhead, which with many small iscsi packets (in the iscsi case) kills performance? 2) Is iscsi-backed storage, especially StorageTek acquired products, in the planning matrix for supported ZFS (NAS) solutions? Also, why hasn''t this combination been tested to date since this appears to be an achilles heal. Again, UFS does not have this problem, nor other file systems on other OSes (namely, XFS, JFS, etc which I''ve tested before) On 5/8/06, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote: > > On Fri, Joe Little wrote: > > > Thanks. I''m playing with it now, trying to get the most succinct test. > > > This is one thing that bothers me: Regardless of the backend, it > > > appears that a delete of a large tree (say the linux kernel) over NFS > > > takes forever, but its immediate when doing so locally. Is delete over > > > NFS really take such a different code path? > > > > Yes. As mentioned in my other email, the NFS protocol requires > > that operations like REMOVE, RMDIR, CREATE have the filesystem > > metadata written to stable storage/disk before sending a response > > to the client. That is not required of local access and therefore > > the disparity between the two. > > So then multi-threading rm/rmdir on the client-side would help, no? > > Are there/should there be async versions of creat(2)/mkdir(2)/ > rmdir(2)/link(2)/unlink(2)/...? > > Nico > -- >
David Weibel
2006-May-09  15:02 UTC
[dtrace-discuss] Re: Re: [nfs-discuss] Script to trace NFSv3 client operations
A theoretical problem could occur if both iSCSI/NFS are sharing a 10/100 path. When NFS requests a file, ZFS might request a larger amount of IO from the iSCSI target (pre-caching), this pre-cache burst could then saturate the pipe causing NFS to slow when it returns its data? The Solaris iSCSI initiator team hasn''t done much testing with iSCSI->ZFS->NFS performance. We have done a lot of performance testing with iSCSI->Raw/UFS/ZFS/Raw/etc. Is NFS/iSCSI using the same NIC path? Is it 10/100, 1G, etc? Is this with the Solaris iSCSI initiator or QLogic 4XXX? It also might be interesting if you can let us know the iSCSI storage type. Each target I have worked with has different performance characteristics. Some have verify poor latency on different transfer sizes. This message posted from opensolaris.org
Joe Little
2006-May-09  22:07 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Well, I tried some suggested iscsi tunings to no avail. I did try something else though: I brought up samba. My linux 2.2 source tree copying into the ZFS volume (in other words, SMB->ZFS->iSCSI) did far much better, taking a minute to copy 102MB. And that''s from a 100MB/sec client. My original tests of around 2TB of mixed files (mostly) small, took 51 minutes in the NFS case, but took 3.5 minutes exactly in the SMB case (again, from a 100MB/sec client). I averaged almost 9MB/sec on that transfer. These are not stellar numbers, but they are far better than the low k/sec that I see with NFS. I definitely think the bug is on the NFS server end, even considering that the SMB protocol is different. On 5/8/06, Joe Little <jmlittle at gmail.com> wrote:> > I was asked to also snoop the iscsi end of things, trying to find > something different between the two. iscsi being relatively opaque, it > was easiest to find differences in the patterns. In the local copy to > RAIDZ example, the iscsi link would show packets of 1514 in length in > series of 5-10, with interjecting packets of 60 or 102, generally 2-4 > in number. In the NFS client hitting the RAIDZ/iscsi combo, the iscsi > length would have 3-5 on average 1514 length packets with 5-7 packets > of 60 or 102 in between. Basically, the averages swapped, and its > likely because of a lot more meta data and/or write confirmations > going on in the NFS case. > > At this point in time, I have two very important questions: > > 1) Is there any options available or planned to make NFS/ZFS work more > in concert to avoid this overhead, which with many small iscsi packets > (in the iscsi case) kills performance? > > 2) Is iscsi-backed storage, especially StorageTek acquired products, > in the planning matrix for supported ZFS (NAS) solutions? Also, why > hasn''t this combination been tested to date since this appears to be > an achilles heal. Again, UFS does not have this problem, nor other > file systems on other OSes (namely, XFS, JFS, etc which I''ve tested > before) > > > On 5/8/06, Nicolas Williams <Nicolas.Williams at sun.com> wrote: > > On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote: > > > On Fri, Joe Little wrote: > > > > Thanks. I''m playing with it now, trying to get the most succinct > test. > > > > This is one thing that bothers me: Regardless of the backend, it > > > > appears that a delete of a large tree (say the linux kernel) over > NFS > > > > takes forever, but its immediate when doing so locally. Is delete > over > > > > NFS really take such a different code path? > > > > > > Yes. As mentioned in my other email, the NFS protocol requires > > > that operations like REMOVE, RMDIR, CREATE have the filesystem > > > metadata written to stable storage/disk before sending a response > > > to the client. That is not required of local access and therefore > > > the disparity between the two. > > > > So then multi-threading rm/rmdir on the client-side would help, no? > > > > Are there/should there be async versions of creat(2)/mkdir(2)/ > > rmdir(2)/link(2)/unlink(2)/...? > > > > Nico > > -- > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060509/b40ad1e3/attachment.html>
David Weibel
2006-May-10  22:29 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
I have a test configuration up and running internally.  I''m
not sure I''m seeing the exact same issues.  For this testing
I''m using GRITS.  I don''t know any really great FS performance
tools.  I tend to do my performance testing with vdbench with
raw SCSI IO on purpose to avoid FS caches.  (I''m more focused
on iSCSI performance in the Solaris initiator.)
configuration:
    ultra 20(amd64) - gige connection via nge
    solaris nevada build 40 - nondebug
    Cisco48portGigESwitch
      +-> ultra20 (amd64) host
      +-> w2100z (2x amd64) host
      +-> equal logic iscsi array (3 x 20G LUNs mapped to ultra20)
      |
      +-> lots of other unrelated hosts, arrays, etc...
    3 x 20G iSCSI LUNs -> RAW IO <==> vdbench
       pref: ~100 MB/s arg.
    3 x 20G iSCSI LUNs -> UFS <==> localhost GRITS
       perf: ~76.9 MB/s avg.
    3 x 20G iSCSI LUNs -> NFS <==> w2100z(2x amd64) GRITS
       pref: ~34 MB/s avg.
    3 x 20G iSCSI LUNs -> ZFS raidz <==> localhost GRITS
       pref: ~90 MB/s avg. (more bursty than ufs
                              29MB/s to 36MB/s)
    3 x 20G iSCSI LUNs -> ZFS raidz -> NFS <==>
                                       w2100(2x amd64) GRITS
       pref: ~15 MB/s avg.
GRITS creates 10 directories, write 10 files in each directory,
reads back the files, verifies their contents, deletes the files
and directories and then repeats the process.  This tool is java
based and designed more for file system data verification testing
not performance.
VDBENCH used for performance testing and benchmarking.
I will hand off my configuration for the Sun NFS teams internally
to check out.
-David
Joe Little wrote:
> Well, I tried some suggested iscsi tunings to no avail. I did try 
> something else though: I brought up samba. My linux 2.2 source tree 
> copying into the ZFS volume (in other words, SMB->ZFS->iSCSI) did far
> much better, taking a minute to copy 102MB. And that''s from a 
> 100MB/sec client.
>
> My original tests of around 2TB of mixed files (mostly) small, took 51 
> minutes in the NFS case, but took 3.5 minutes exactly in the SMB case 
> (again, from a 100MB/sec client). I averaged almost 9MB/sec on that 
> transfer. These are not stellar numbers, but they are far better than 
> the low k/sec that I see with NFS. I definitely think the bug is on 
> the NFS server end, even considering that the SMB protocol is different.
>
>
>
>
> On 5/8/06, *Joe Little* <jmlittle at gmail.com 
> <mailto:jmlittle at gmail.com>> wrote:
>
>     I was asked to also snoop the iscsi end of things, trying to find
>     something different between the two. iscsi being relatively opaque, it
>     was easiest to find differences in the patterns. In the local copy to
>     RAIDZ example, the iscsi link would show packets of 1514 in length in
>     series of 5-10, with interjecting packets of 60 or 102, generally 2-4
>     in number. In the NFS client hitting the RAIDZ/iscsi combo, the iscsi
>     length would have 3-5 on average 1514 length packets with 5-7 packets
>     of 60 or 102 in between. Basically, the averages swapped, and its
>     likely because of a lot more meta data and/or write confirmations
>     going on in the NFS case.
>
>     At this point in time, I have two very important questions:
>
>     1) Is there any options available or planned to make NFS/ZFS work more
>     in concert to avoid this overhead, which with many small iscsi packets
>     (in the iscsi case) kills performance?
>
>     2) Is iscsi-backed storage, especially StorageTek acquired products,
>     in the planning matrix for supported ZFS (NAS) solutions? Also, why
>     hasn''t this combination been tested to date since this appears
to be
>     an achilles heal. Again, UFS does not have this problem, nor other
>     file systems on other OSes (namely, XFS, JFS, etc which I''ve
tested
>     before)
>
>
>     On 5/8/06, Nicolas Williams <Nicolas.Williams at sun.com
>     <mailto:Nicolas.Williams at sun.com>> wrote:
>     > On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote:
>     > > On Fri, Joe Little wrote:
>     > > > Thanks. I''m playing with it now, trying to get
the most
>     succinct test.
>     > > > This is one thing that bothers me: Regardless of the
backend, it
>     > > > appears that a delete of a large tree (say the linux
kernel)
>     over NFS
>     > > > takes forever, but its immediate when doing so locally.
Is
>     delete over
>     > > > NFS really take such a different code path?
>     > >
>     > > Yes.  As mentioned in my other email, the NFS protocol
requires
>     > > that operations like REMOVE, RMDIR, CREATE have the
filesystem
>     > > metadata written to stable storage/disk before sending a
response
>     > > to the client.  That is not required of local access and
>     therefore
>     > > the disparity between the two.
>     >
>     > So then multi-threading rm/rmdir on the client-side would help,
no?
>     >
>     > Are there/should there be async versions of creat(2)/mkdir(2)/
>     > rmdir(2)/link(2)/unlink(2)/...?
>     >
>     > Nico
>     > --
>     >
>
>
Joe Little
2006-May-10  22:57 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Thanks for setting this up. To prove me wrong, could someone grab any linux-2.2.*.tar.gz and untar into the array over NFS and measure the time. You have equallogic targets, and those are probably the most performant. I''m using SBEi targets, which go up to ERL2 and tend to also be performant (80-120MB/sec) for me. as for GRITS, I''m not finding it using a simple google search. Is that generally available, and for free? On 5/10/06, David Weibel <David.Weibel at sun.com> wrote:> I have a test configuration up and running internally. I''m > not sure I''m seeing the exact same issues. For this testing > I''m using GRITS. I don''t know any really great FS performance > tools. I tend to do my performance testing with vdbench with > raw SCSI IO on purpose to avoid FS caches. (I''m more focused > on iSCSI performance in the Solaris initiator.) > > configuration: > ultra 20(amd64) - gige connection via nge > solaris nevada build 40 - nondebug > > Cisco48portGigESwitch > +-> ultra20 (amd64) host > +-> w2100z (2x amd64) host > +-> equal logic iscsi array (3 x 20G LUNs mapped to ultra20) > | > +-> lots of other unrelated hosts, arrays, etc... > > 3 x 20G iSCSI LUNs -> RAW IO <==> vdbench > pref: ~100 MB/s arg. > > 3 x 20G iSCSI LUNs -> UFS <==> localhost GRITS > perf: ~76.9 MB/s avg. > > 3 x 20G iSCSI LUNs -> NFS <==> w2100z(2x amd64) GRITS > pref: ~34 MB/s avg. > > 3 x 20G iSCSI LUNs -> ZFS raidz <==> localhost GRITS > pref: ~90 MB/s avg. (more bursty than ufs > 29MB/s to 36MB/s) > > 3 x 20G iSCSI LUNs -> ZFS raidz -> NFS <==> > w2100(2x amd64) GRITS > pref: ~15 MB/s avg. > > GRITS creates 10 directories, write 10 files in each directory, > reads back the files, verifies their contents, deletes the files > and directories and then repeats the process. This tool is java > based and designed more for file system data verification testing > not performance. > > VDBENCH used for performance testing and benchmarking. > > I will hand off my configuration for the Sun NFS teams internally > to check out. > > -David > > Joe Little wrote: > > > Well, I tried some suggested iscsi tunings to no avail. I did try > > something else though: I brought up samba. My linux 2.2 source tree > > copying into the ZFS volume (in other words, SMB->ZFS->iSCSI) did far > > much better, taking a minute to copy 102MB. And that''s from a > > 100MB/sec client. > > > > My original tests of around 2TB of mixed files (mostly) small, took 51 > > minutes in the NFS case, but took 3.5 minutes exactly in the SMB case > > (again, from a 100MB/sec client). I averaged almost 9MB/sec on that > > transfer. These are not stellar numbers, but they are far better than > > the low k/sec that I see with NFS. I definitely think the bug is on > > the NFS server end, even considering that the SMB protocol is different. > > > > > > > > > > On 5/8/06, *Joe Little* <jmlittle at gmail.com > > <mailto:jmlittle at gmail.com>> wrote: > > > > I was asked to also snoop the iscsi end of things, trying to find > > something different between the two. iscsi being relatively opaque, it > > was easiest to find differences in the patterns. In the local copy to > > RAIDZ example, the iscsi link would show packets of 1514 in length in > > series of 5-10, with interjecting packets of 60 or 102, generally 2-4 > > in number. In the NFS client hitting the RAIDZ/iscsi combo, the iscsi > > length would have 3-5 on average 1514 length packets with 5-7 packets > > of 60 or 102 in between. Basically, the averages swapped, and its > > likely because of a lot more meta data and/or write confirmations > > going on in the NFS case. > > > > At this point in time, I have two very important questions: > > > > 1) Is there any options available or planned to make NFS/ZFS work more > > in concert to avoid this overhead, which with many small iscsi packets > > (in the iscsi case) kills performance? > > > > 2) Is iscsi-backed storage, especially StorageTek acquired products, > > in the planning matrix for supported ZFS (NAS) solutions? Also, why > > hasn''t this combination been tested to date since this appears to be > > an achilles heal. Again, UFS does not have this problem, nor other > > file systems on other OSes (namely, XFS, JFS, etc which I''ve tested > > before) > > > > > > On 5/8/06, Nicolas Williams <Nicolas.Williams at sun.com > > <mailto:Nicolas.Williams at sun.com>> wrote: > > > On Fri, May 05, 2006 at 11:55:17PM -0500, Spencer Shepler wrote: > > > > On Fri, Joe Little wrote: > > > > > Thanks. I''m playing with it now, trying to get the most > > succinct test. > > > > > This is one thing that bothers me: Regardless of the backend, it > > > > > appears that a delete of a large tree (say the linux kernel) > > over NFS > > > > > takes forever, but its immediate when doing so locally. Is > > delete over > > > > > NFS really take such a different code path? > > > > > > > > Yes. As mentioned in my other email, the NFS protocol requires > > > > that operations like REMOVE, RMDIR, CREATE have the filesystem > > > > metadata written to stable storage/disk before sending a response > > > > to the client. That is not required of local access and > > therefore > > > > the disparity between the two. > > > > > > So then multi-threading rm/rmdir on the client-side would help, no? > > > > > > Are there/should there be async versions of creat(2)/mkdir(2)/ > > > rmdir(2)/link(2)/unlink(2)/...? > > > > > > Nico > > > -- > > > > > > > > > >
Roch Bourbonnais - Performance Engineering
2006-May-11  08:32 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
# ptime tar xf linux-2.2.22.tar ptime tar xf linux-2.2.22.tar real 50.292 user 1.019 sys 11.417 # ptime tar xf linux-2.2.22.tar ptime tar xf linux-2.2.22.tar real 56.833 user 1.056 sys 11.581 # avg time waiting for async writes is around 3ms. How much are you getting for the tar xf ? -r
Joe Little
2006-May-11  10:56 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
How did you get the average time for async writes? My client (lacking ptime, its linux) comes in at 50 minutes, not 50 seconds. I''m running again right now for a more accurate number. I''m untarring from a local file on the directory to the NFS share. On 5/11/06, Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at sun.com> wrote:> > > # ptime tar xf linux-2.2.22.tar > ptime tar xf linux-2.2.22.tar > > real 50.292 > user 1.019 > sys 11.417 > # ptime tar xf linux-2.2.22.tar > ptime tar xf linux-2.2.22.tar > > real 56.833 > user 1.056 > sys 11.581 > # > > > avg time waiting for async writes is around 3ms. > How much are you getting for the tar xf ? > > -r > >
Roch Bourbonnais - Performance Engineering
2006-May-11  11:06 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
Joe Little writes: > How did you get the average time for async writes? My client (lacking > ptime, its linux) comes in at 50 minutes, not 50 seconds. I''m running > again right now for a more accurate number. I''m untarring from a local > file on the directory to the NFS share. > I used dtrace to measure times (I used the sleep time so it gives a ballpark figure). I untared with the tar file on the NFS share. Just retimed after moving the tar file to /tmp. # ptime tar xf /tmp/linux-2.2.22.tar ptime tar xf /tmp/linux-2.2.22.tar real 49.630 user 1.033 sys 11.405 -r > > On 5/11/06, Roch Bourbonnais - Performance Engineering > <Roch.Bourbonnais at sun.com> wrote: > > > > > > # ptime tar xf linux-2.2.22.tar > > ptime tar xf linux-2.2.22.tar > > > > real 50.292 > > user 1.019 > > sys 11.417 > > # ptime tar xf linux-2.2.22.tar > > ptime tar xf linux-2.2.22.tar > > > > real 56.833 > > user 1.056 > > sys 11.581 > > # > > > > > > avg time waiting for async writes is around 3ms. > > How much are you getting for the tar xf ? > > > > -r > > > >
Joe Little
2006-May-11  13:59 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
well, here''s my first pass result: [root at csl-prime loges1]# time tar xf /root/linux-2.2.26.tar real 114m6.662s user 0m0.049s sys 0m1.354s On 5/11/06, Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at sun.com> wrote:> > Joe Little writes: > > How did you get the average time for async writes? My client (lacking > > ptime, its linux) comes in at 50 minutes, not 50 seconds. I''m running > > again right now for a more accurate number. I''m untarring from a local > > file on the directory to the NFS share. > > > > I used dtrace to measure times (I used the sleep time so it > gives a ballpark figure). > > I untared with the tar file on the NFS share. > Just retimed after moving the tar file to /tmp. > > # ptime tar xf /tmp/linux-2.2.22.tar > ptime tar xf /tmp/linux-2.2.22.tar > > real 49.630 > user 1.033 > sys 11.405 > > > -r > > > > > On 5/11/06, Roch Bourbonnais - Performance Engineering > > <Roch.Bourbonnais at sun.com> wrote: > > > > > > > > > # ptime tar xf linux-2.2.22.tar > > > ptime tar xf linux-2.2.22.tar > > > > > > real 50.292 > > > user 1.019 > > > sys 11.417 > > > # ptime tar xf linux-2.2.22.tar > > > ptime tar xf linux-2.2.22.tar > > > > > > real 56.833 > > > user 1.056 > > > sys 11.581 > > > # > > > > > > > > > avg time waiting for async writes is around 3ms. > > > How much are you getting for the tar xf ? > > > > > > -r > > > > > > > >
Spencer Shepler
2006-May-12  00:01 UTC
[zfs-discuss] Re: [dtrace-discuss] Re: [nfs-discuss] Script to trace NFSv3 client operations
So, there is a set of us that have taken this discussion off list while we collect/share more data. Once we get a little further in diagnosis, we will summarize and bring the discussion back to the alias. Spencer On Thu, Joe Little wrote:> well, here''s my first pass result: > > [root at csl-prime loges1]# time tar xf /root/linux-2.2.26.tar > > real 114m6.662s > user 0m0.049s > sys 0m1.354s > > > On 5/11/06, Roch Bourbonnais - Performance Engineering > <Roch.Bourbonnais at sun.com> wrote: > > > >Joe Little writes: > > > How did you get the average time for async writes? My client (lacking > > > ptime, its linux) comes in at 50 minutes, not 50 seconds. I''m running > > > again right now for a more accurate number. I''m untarring from a local > > > file on the directory to the NFS share. > > > > > > >I used dtrace to measure times (I used the sleep time so it > >gives a ballpark figure). > > > >I untared with the tar file on the NFS share. > >Just retimed after moving the tar file to /tmp. > > > ># ptime tar xf /tmp/linux-2.2.22.tar > >ptime tar xf /tmp/linux-2.2.22.tar > > > >real 49.630 > >user 1.033 > >sys 11.405 > > > > > >-r > > > > > > > > On 5/11/06, Roch Bourbonnais - Performance Engineering > > > <Roch.Bourbonnais at sun.com> wrote: > > > > > > > > > > > > # ptime tar xf linux-2.2.22.tar > > > > ptime tar xf linux-2.2.22.tar > > > > > > > > real 50.292 > > > > user 1.019 > > > > sys 11.417 > > > > # ptime tar xf linux-2.2.22.tar > > > > ptime tar xf linux-2.2.22.tar > > > > > > > > real 56.833 > > > > user 1.056 > > > > sys 11.581 > > > > # > > > > > > > > > > > > avg time waiting for async writes is around 3ms. > > > > How much are you getting for the tar xf ? > > > > > > > > -r > > > > > > > > > > > >