Matthew B Sweeney - Sun Microsystems Inc.
2006-Nov-21 15:59 UTC
[zfs-discuss] poor NFS/ZFS performance
Hi I have an application that use NFS between a Thumper and a 4600. The Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and outqueue. The machines are connected via a point to point 10GE link, all NFS is done over that link. The NFS performance doing a simple cp from one partition to the other is well below what I''d expect , 58 MB/s. I''ve some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and tweaks to the TCP stack on both sides to no avail. Jumbo frames are enabled and working, which improves performance, but doesn''t make it fly. I''ve tested the link with iperf and have been able to sustain 5 - 6 Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) perform very well (450 - 500 MB/s sustained). My research points to disabling the ZIL. So far the only way I''ve found to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My question is can I achieve this setting via a /kernel/drv/zfs.conf or /etc/system parameter? Thanks Matt -- Matt Sweeney Engagement Architect Sun Microsystems 585-368-5930/x29097 desk 585-727-0573 cell
Matthew B Sweeney - Sun Microsystems Inc. writes: > Hi > I have an application that use NFS between a Thumper and a 4600. The > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > outqueue. > > The machines are connected via a point to point 10GE link, all NFS is > done over that link. The NFS performance doing a simple cp from one > partition to the other is well below what I''d expect , 58 MB/s. I''ve > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > enabled and working, which improves performance, but doesn''t make it fly. > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > perform very well (450 - 500 MB/s sustained). > > My research points to disabling the ZIL. So far the only way I''ve found > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > question is can I achieve this setting via a /kernel/drv/zfs.conf or > /etc/system parameter? > You may set in in /etc/system. We''re thinking of renaming the variable to set zfs_please_corrupt_my_client''s_data = 1 Just kidding (about the name) but it will corrupt your data. -r > > Thanks > Matt > > -- > > Matt Sweeney > Engagement Architect > Sun Microsystems > 585-368-5930/x29097 desk > 585-727-0573 cell > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 11/21/06, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> > Matthew B Sweeney - Sun Microsystems Inc. writes: > > Hi > > I have an application that use NFS between a Thumper and a 4600. The > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > outqueue. > > > > The machines are connected via a point to point 10GE link, all NFS is > > done over that link. The NFS performance doing a simple cp from one > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > enabled and working, which improves performance, but doesn''t make it fly. > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > perform very well (450 - 500 MB/s sustained). > > > > My research points to disabling the ZIL. So far the only way I''ve found > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > /etc/system parameter? > > > > You may set in in /etc/system. We''re thinking of renaming > the variable to > > set zfs_please_corrupt_my_client''s_data = 1 > > Just kidding (about the name) but it will corrupt your data. > > -r > >Yes, we''ve entered this thread multiple times before, where NFS basically sucks compared to the relative performance locally. I''m waiting, ever so eagerly, for the per pool (or was it per FS) properties that give finer grained control of the ZIL, named "sync_deferred". Where is that by the way? (From Neil:)> NP> We once had plans to add a mount option to allow the admin > NP> to control the ZIL. Here''s a brief section of the RFE (6280630): > > NP> sync={deferred,standard,forced} > > NP> Controls synchronous semantics for the dataset. > > NP> When set to ''standard'' (the default), synchronous operations > NP> such as fsync(3C) behave precisely as defined in > NP> fcntl.h(3HEAD). > > NP> When set to ''deferred'', requests for synchronous semantics > NP> are ignored. However, ZFS still guarantees that ordering > NP> is preserved -- that is, consecutive operations reach stable > NP> storage in order. (If a thread performs operation A followed > NP> by operation B, then the moment that B reaches stable storage, > NP> A is guaranteed to be on stable storage as well.) ZFS also > NP> guarantees that all operations will be scheduled for write to > NP> stable storage within a few seconds, so that an unexpected > NP> power loss only takes the last few seconds of change with it. > > NP> When set to ''forced'', all operations become synchronous. > NP> No operation will return until all previous operations > NP> have been committed to stable storage. This option can be > NP> useful if an application is found to depend on synchronous > NP> semantics without actually requesting them; otherwise, it > NP> will just make everything slow, and is not recommended.
Matthew B Sweeney - Sun Microsystems Inc.
2006-Nov-21 17:54 UTC
[zfs-discuss] poor NFS/ZFS performance
Roch, Am I barking up the wrong tree? Or is ZFS over NFS not the right solution? As I understand zil''s functionality I may lose updates, but the filesystem would remain intact from http://www.opensolaris.org/jive/thread.jspa?messageID=20935凇> The ZIL is not required for fsckless operation. If you turned off > the ZIL, all it would mean is that in the event of a crash, it would > appear that some of the most recent (last few seconds) synchronous > system calls never happened. In other words, we wouldn''t have net > the O_DSYNC specification, but the filesystem would nevertheless > still be perfectly consistent on disk. > > JeffThis application isn''t anything transactional, a file is read, processed and a new (modified) file is written to another store. So if all I''m risking is the current open file, I can have the application rewrite it. I haven''t had a chance to test this yet , the machines are physically somewhere else and not networked to the outside world. Should I be using UFS over NFS? Thanks Matt Roch - PAE wrote On 11/21/06 11:28,:>Matthew B Sweeney - Sun Microsystems Inc. writes: > > Hi > > I have an application that use NFS between a Thumper and a 4600. The > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > outqueue. > > > > The machines are connected via a point to point 10GE link, all NFS is > > done over that link. The NFS performance doing a simple cp from one > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > enabled and working, which improves performance, but doesn''t make it fly. > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > perform very well (450 - 500 MB/s sustained). > > > > My research points to disabling the ZIL. So far the only way I''ve found > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > /etc/system parameter? > > > >You may set in in /etc/system. We''re thinking of renaming >the variable to > > set zfs_please_corrupt_my_client''s_data = 1 > >Just kidding (about the name) but it will corrupt your data. > >-r > > > > > > Thanks > > Matt > > > > -- > > > > Matt Sweeney > > Engagement Architect > > Sun Microsystems > > 585-368-5930/x29097 desk > > 585-727-0573 cell > > > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- Matt Sweeney Engagement Architect Sun Microsystems 585-368-5930/x29097 desk 585-727-0573 cell -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061121/edd5ed46/attachment.html>
On Nov 21, 2006, at 12:19, Joe Little wrote:> On 11/21/06, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: >> >> Matthew B Sweeney - Sun Microsystems Inc. writes: >> > Hi >> > I have an application that use NFS between a Thumper and a >> 4600. The >> > Thumper exports 2 ZFS filesystems that the 4600 uses as an >> inqueue and >> > outqueue. >> > >> > The machines are connected via a point to point 10GE link, all >> NFS is >> > done over that link. The NFS performance doing a simple cp >> from one >> > partition to the other is well below what I''d expect , 58 MB/ >> s. I''ve >> > some NFS tweaks, tweaks to the neterion cards (soft rings >> etc) , and >> > tweaks to the TCP stack on both sides to no avail. Jumbo >> frames are >> > enabled and working, which improves performance, but doesn''t >> make it fly. >> > >> > I''ve tested the link with iperf and have been able to sustain 5 >> - 6 >> > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) >> > perform very well (450 - 500 MB/s sustained). >> > >> > My research points to disabling the ZIL. So far the only way >> I''ve found >> > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb - >> kw. My >> > question is can I achieve this setting via a /kernel/drv/ >> zfs.conf or >> > /etc/system parameter? >> > >> >> You may set in in /etc/system. We''re thinking of renaming >> the variable to >> >> set zfs_please_corrupt_my_client''s_data = 1 >> >> Just kidding (about the name) but it will corrupt your data. >> >> -r >> >> > > Yes, we''ve entered this thread multiple times before, where NFS > basically sucks compared to the relative performance locally. I''m > waiting, ever so eagerly, for the per pool (or was it per FS) > properties that give finer grained control of the ZIL, named > "sync_deferred". Where is that by the way? > > (From Neil:) >> NP> We once had plans to add a mount option to allow the admin >> NP> to control the ZIL. Here''s a brief section of the RFE (6280630): >> >> NP> sync={deferred,standard,forced} >> >> NP> Controls synchronous semantics for the dataset. >> >> NP> When set to ''standard'' (the default), >> synchronous operations >> NP> such as fsync(3C) behave precisely as defined in >> NP> fcntl.h(3HEAD). >> >> NP> When set to ''deferred'', requests for >> synchronous semantics >> NP> are ignored. However, ZFS still guarantees >> that ordering >> NP> is preserved -- that is, consecutive >> operations reach stable >> NP> storage in order. (If a thread performs >> operation A followed >> NP> by operation B, then the moment that B >> reaches stable storage, >> NP> A is guaranteed to be on stable storage as >> well.) ZFS also >> NP> guarantees that all operations will be >> scheduled for write to >> NP> stable storage within a few seconds, so that >> an unexpected >> NP> power loss only takes the last few seconds of >> change with it. >> >> NP> When set to ''forced'', all operations become >> synchronous. >> NP> No operation will return until all previous >> operations >> NP> have been committed to stable storage. This >> option can be >> NP> useful if an application is found to depend >> on synchronous >> NP> semantics without actually requesting them; >> otherwise, it >> NP> will just make everything slow, and is not >> recommended.the problem with this is that it''s essentially cheating particularly when you get a commit operation from NFS pushing an fsync() .. I know that most linux kernels do this sort of thing with NFS async by lying to the client, but with non-battery backed memory - this pretty much seems like a bad idea to me. What you really want is some sort of fast write-through, but that would require a major rewrite in several places and a rethink on some of the design philosophy. .je
Matthew B Sweeney - Sun Microsystems Inc. writes: > Roch, > > Am I barking up the wrong tree? Or is ZFS over NFS not the right solution? > > As I understand zil''s functionality I may lose updates, but the > filesystem would remain intact > The server filesystem will remain intact. The problem is the view provided to the client in the face of a server crash/reboot. The issue is not specific to ZFS and affects any client talking to any server-side FS that would ignore the client''s commit request (such as linux over a WCE device). The issue would show up on the client as: #cd /mynfsmount #sum fileA 31998 1 fileA #cp fileA fileB (in the middle of the cp, server crashes/reboot, this small cp takes 5 minutes to run) #sum fileB 60712 1 fileB So the cp is succesful but the output fileB is corrupted. Nothing ZFS related. Just a server side FS ignoring commits. > > from http://www.opensolaris.org/jive/thread.jspa?messageID=20935凇 > > > The ZIL is not required for fsckless operation. If you turned off > > the ZIL, all it would mean is that in the event of a crash, it would > > appear that some of the most recent (last few seconds) synchronous > > system calls never happened. In other words, we wouldn''t have net > > the O_DSYNC specification, but the filesystem would nevertheless > > still be perfectly consistent on disk. > > > > Jeff > > > This application isn''t anything transactional, a file is read, processed > and a new (modified) file is written to another store. So if all I''m > risking is the current open file, I can have the application rewrite it. > > I haven''t had a chance to test this yet , the machines are physically > somewhere else and not networked to the outside world. Should I be > using UFS over NFS? > If UFS is much faster than ZFS/latest bits (and it may be I don''t know), then we need to work on whatever is causing this but disabling the ZIL is not the proper way (IMO). -r > > Thanks > Matt > > > > Roch - PAE wrote On 11/21/06 11:28,: > > >Matthew B Sweeney - Sun Microsystems Inc. writes: > > > Hi > > > I have an application that use NFS between a Thumper and a 4600. The > > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > > outqueue. > > > > > > The machines are connected via a point to point 10GE link, all NFS is > > > done over that link. The NFS performance doing a simple cp from one > > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > > enabled and working, which improves performance, but doesn''t make it fly. > > > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > > perform very well (450 - 500 MB/s sustained). > > > > > > My research points to disabling the ZIL. So far the only way I''ve found > > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > > /etc/system parameter? > > > > > > >You may set in in /etc/system. We''re thinking of renaming > >the variable to > > > > set zfs_please_corrupt_my_client''s_data = 1 > > > >Just kidding (about the name) but it will corrupt your data. > > > >-r > > > > > > > > > > Thanks > > > Matt > > > > > > -- > > > > > > Matt Sweeney > > > Engagement Architect > > > Sun Microsystems > > > 585-368-5930/x29097 desk > > > 585-727-0573 cell > > > > > > > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > >_______________________________________________ > >zfs-discuss mailing list > >zfs-discuss at opensolaris.org > >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > -- > > Matt Sweeney > Engagement Architect > Sun Microsystems > 585-368-5930/x29097 desk > 585-727-0573 cell >
This works for me in /etc/system: set zfs:zil_disable=1 I had to run bootadm update-archive then reboot on x86 -- not sure if that''s needed on SPARC. This message posted from opensolaris.org
On 11/21/06, Matthew B Sweeney - Sun Microsystems Inc. <Matthew.Sweeney at sun.com> wrote:> > Roch, > > Am I barking up the wrong tree? Or is ZFS over NFS not the right solution? >I strongly believe it is.. We just are at odds as to some philosophy. Either we need NVRAM backed storage between NFS and ZFS, battery backed-memory that can survive other subsystem failure, or a change in the code path to allow some discretion here. Currently, the third option, 6280630, ZIL syncronicity, or as I reference it, sync_deferred functionality. A combination is best, but the sooner this arrives, the better for anyone who needs a general purpose file server / NAS that compares anywhere near to the competion.> As I understand zil''s functionality I may lose updates, but the filesystem > would remain intact > > > from > http://www.opensolaris.org/jive/thread.jspa?messageID=20935凇 > > The ZIL is not required for fsckless operation. If you turned off > the ZIL, all it would mean is that in the event of a crash, it would > appear that some of the most recent (last few seconds) synchronous > system calls never happened. In other words, we wouldn''t have net > the O_DSYNC specification, but the filesystem would nevertheless > still be perfectly consistent on disk. > > Jeff > This application isn''t anything transactional, a file is read, processed > and a new (modified) file is written to another store. So if all I''m > risking is the current open file, I can have the application rewrite it. > > I haven''t had a chance to test this yet , the machines are physically > somewhere else and not networked to the outside world. Should I be using > UFS over NFS? > > > Thanks > Matt > > > > Roch - PAE wrote On 11/21/06 11:28,: > Matthew B Sweeney - Sun Microsystems Inc. writes: > > Hi > > I have an application that use NFS between a Thumper and a 4600. The > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > outqueue. > > > > The machines are connected via a point to point 10GE link, all NFS is > > done over that link. The NFS performance doing a simple cp from one > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > enabled and working, which improves performance, but doesn''t make it fly. > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > perform very well (450 - 500 MB/s sustained). > > > > My research points to disabling the ZIL. So far the only way I''ve found > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > /etc/system parameter? > > > > You may set in in /etc/system. We''re thinking of renaming > the variable to > > set zfs_please_corrupt_my_client''s_data = 1 > > Just kidding (about the name) but it will corrupt your data. > > -r > > > > > > Thanks > > Matt > > > > -- > > > > Matt Sweeney > > Engagement Architect > > Sun Microsystems > > 585-368-5930/x29097 desk > > 585-727-0573 cell > > > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > -- > > Matt Sweeney > Engagement Architect > Sun Microsystems > 585-368-5930/x29097 desk > 585-727-0573 cell > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >
On Nov 21, 2006, at 1:36 PM, Joe Little wrote:> On 11/21/06, Matthew B Sweeney - Sun Microsystems Inc. > <Matthew.Sweeney at sun.com> wrote: >> >> Roch, >> >> Am I barking up the wrong tree? Or is ZFS over NFS not the right >> solution? >> > > I strongly believe it is.. We just are at odds as to some philosophy. > Either we need NVRAM backed storage between NFS and ZFS, battery > backed-memory that can survive other subsystem failure, or a change in > the code path to allow some discretion here. Currently, the third > option, 6280630, ZIL syncronicity, or as I reference it, sync_deferred > functionality. > > A combination is best, but the sooner this arrives, the better for > anyone who needs a general purpose file server / NAS that compares > anywhere near to the competion.I had heard that some stuff in the latest OS and coming in Sol10 U3 should greatly help in NFS/ZFS performance. Something to do with ZFS not synching the entire pool on every sync but just the stuff needed or something like that. I heard it kind of 2nd or 3rd hand so cannot be to detailed in my description. Can someone here "in the know" confirm that this is so (or not)? Thanks Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad at shire.net -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2411 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061121/7d2d2104/attachment.bin>
To accelerate NFS (in particular single threaded loads) you need (somewhat badly) some *RAM between the Server FS and it''s storage; that *RAM is where NFS commited data may be stored. If the *RAM does not survive a server reboot, the client is at risk of seeing corruption. For example, UFS over WCE storage will be fast and corrupt prone (from the client side point of view). ZFS over WCE storage behaves differently because it manages the writecache is a way that makes serving NFS slow but safe. zil_disable can be used to make ZFS serve NFS fast and corrupt prone (from the client side point of view). -r Joe Little writes: > On 11/21/06, Matthew B Sweeney - Sun Microsystems Inc. > <Matthew.Sweeney at sun.com> wrote: > > > > Roch, > > > > Am I barking up the wrong tree? Or is ZFS over NFS not the right solution? > > > > I strongly believe it is.. We just are at odds as to some philosophy. > Either we need NVRAM backed storage between NFS and ZFS, battery > backed-memory that can survive other subsystem failure, or a change in > the code path to allow some discretion here. Currently, the third > option, 6280630, ZIL syncronicity, or as I reference it, sync_deferred > functionality. > > A combination is best, but the sooner this arrives, the better for > anyone who needs a general purpose file server / NAS that compares > anywhere near to the competion. > > > As I understand zil''s functionality I may lose updates, but the filesystem > > would remain intact > > > > > > from > > http://www.opensolaris.org/jive/thread.jspa?messageID=20935凇 > > > > The ZIL is not required for fsckless operation. If you turned off > > the ZIL, all it would mean is that in the event of a crash, it would > > appear that some of the most recent (last few seconds) synchronous > > system calls never happened. In other words, we wouldn''t have net > > the O_DSYNC specification, but the filesystem would nevertheless > > still be perfectly consistent on disk. > > > > Jeff > > This application isn''t anything transactional, a file is read, processed > > and a new (modified) file is written to another store. So if all I''m > > risking is the current open file, I can have the application rewrite it. > > > > I haven''t had a chance to test this yet , the machines are physically > > somewhere else and not networked to the outside world. Should I be using > > UFS over NFS? > > > > > > Thanks > > Matt > > > > > > > > Roch - PAE wrote On 11/21/06 11:28,: > > Matthew B Sweeney - Sun Microsystems Inc. writes: > > > Hi > > > I have an application that use NFS between a Thumper and a 4600. The > > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > > outqueue. > > > > > > The machines are connected via a point to point 10GE link, all NFS is > > > done over that link. The NFS performance doing a simple cp from one > > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > > enabled and working, which improves performance, but doesn''t make it fly. > > > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > > perform very well (450 - 500 MB/s sustained). > > > > > > My research points to disabling the ZIL. So far the only way I''ve found > > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > > /etc/system parameter? > > > > > > > You may set in in /etc/system. We''re thinking of renaming > > the variable to > > > > set zfs_please_corrupt_my_client''s_data = 1 > > > > Just kidding (about the name) but it will corrupt your data. > > > > -r > > > > > > > > > > Thanks > > > Matt > > > > > > -- > > > > > > Matt Sweeney > > > Engagement Architect > > > Sun Microsystems > > > 585-368-5930/x29097 desk > > > 585-727-0573 cell > > > > > > > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > -- > > > > Matt Sweeney > > Engagement Architect > > Sun Microsystems > > 585-368-5930/x29097 desk > > 585-727-0573 cell > > > > > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > >
On Tue, 21 Nov 2006, Joe Little wrote:> On 11/21/06, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: > > > > Matthew B Sweeney - Sun Microsystems Inc. writes: > > > Hi > > > I have an application that use NFS between a Thumper and a 4600. The > > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > > outqueue. > > > > > > The machines are connected via a point to point 10GE link, all NFS is > > > done over that link. The NFS performance doing a simple cp from one > > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > > enabled and working, which improves performance, but doesn''t make it fly. > > > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > > perform very well (450 - 500 MB/s sustained). > > > > > > My research points to disabling the ZIL. So far the only way I''ve found > > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > > /etc/system parameter? > > > > > > > You may set in in /etc/system. We''re thinking of renaming > > the variable to > > > > set zfs_please_corrupt_my_client''s_data = 1 > > > > Just kidding (about the name) but it will corrupt your data. > > > > -r > > > > > > Yes, we''ve entered this thread multiple times before, where NFS > basically sucks compared to the relative performance locally. I''m > waiting, ever so eagerly, for the per pool (or was it per FS) > properties that give finer grained control of the ZIL, named > "sync_deferred". Where is that by the way?Agreed - it sucks - especially for small file use. Here''s a 5,000 ft view of the performance while unzipping and extracting a tar archive. First the test is run on a SPARC 280R running Build 51a with dual 900MHz USIII CPUs and 4Gb of RAM: $ cp emacs-21.4a.tar.gz /tmp $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 13.092 user 2.083 sys 0.183 Next, the test in run on the same box in /tmp $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 2.983 user 2.038 sys 0.201 Next the test is run on a NFS mount of a zfs filesystem on a 5 disk raidz device over a gigabit ethernet interface with only two hosts on the VLAN (the zfs server is a dual socket AMD whitebox with two dual-core 2.2GHz CPUs): $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 2:32.667 user 2.410 sys 0.233 Houston - we have a problem. What OS is the ZFS based NFS server running? I can''t say, but lets say that its close to Update 3. Next we move emacs-21.4a.tar.gz to the NFS server and run it in the same filesystem that is NFS mounted to the 280R: $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 3.365 user 0.880 sys 0.154 No problem there! ZFS rocks. NFS/ZFS is a bad combination. Happy Thanksgiving (to those stateside). Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
On Nov 22, 2006, at 4:11 PM, Al Hopper wrote:> No problem there! ZFS rocks. NFS/ZFS is a bad combination.Has anyone tried sharing a ZFS fs using samba or afs or something else besides nfs? Do we have the same issues? Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad at shire.net -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2411 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061122/f1df4b6b/attachment.bin>
On 11/22/06, Chad Leigh -- Shire.Net LLC <chad at shire.net> wrote:> > On Nov 22, 2006, at 4:11 PM, Al Hopper wrote: > > > No problem there! ZFS rocks. NFS/ZFS is a bad combination. > > Has anyone tried sharing a ZFS fs using samba or afs or something > else besides nfs? Do we have the same issues? >I''ve done some CIFS tests in the past, and off the top of my head, it was about 3-5x faster than NFS.> Chad > > --- > Chad Leigh -- Shire.Net LLC > Your Web App and Email hosting provider > chad at shire.net > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > >
Yes, I''ve tried NFS and CIFS. I wouldn''t call this a problem though. This is the way it was designed to work to prevent loss of client data. If you want faster performance put a battery-backed RAID card in your system and turn on write-back caching on the card so that the RAM in the RAID controller effectively acts as your NVRAM. I''ve tested this and obviously get much better performance for small files. If you want to compare with other filesystems that don''t guarantee your data, then as others have pointed out you can disable the ZIL and take your chances. Here you''re no worst off than other os/fs implementations that lie to you when they tell you that they''ve committed your data to persistent storage and keep it in RAM and risk the failure modes associated with that. If you avoid remote filesystems like NFS/CIFS and run locally, then this is obviously not an issue. Cameron -- On 11/22/06, Chad Leigh -- Shire.Net LLC <chad at shire.net> wrote:> > > On Nov 22, 2006, at 4:11 PM, Al Hopper wrote: > > > No problem there! ZFS rocks. NFS/ZFS is a bad combination. > > Has anyone tried sharing a ZFS fs using samba or afs or something > else besides nfs? Do we have the same issues? > > Chad > > --- > Chad Leigh -- Shire.Net LLC > Your Web App and Email hosting provider > chad at shire.net > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061122/ad993526/attachment.html>
Have a gander below :> Agreed - it sucks - especially for small file use. Here''s a 5,000 ft view > of the performance while unzipping and extracting a tar archive. First > the test is run on a SPARC 280R running Build 51a with dual 900MHz USIII > CPUs and 4Gb of RAM: > > $ cp emacs-21.4a.tar.gz /tmp > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 13.092 > user 2.083 > sys 0.183here is my machine here ( Solaris 8 Ultra 2 200MHz ) # cd /tmp # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes 74570.00k). /export/home/dclarke/star: Total time 11.057sec (6744 kBytes/sec) real 11.146 user 0.300 sys 1.762 and the same test on the same machine with a local UFS filesystem : # cd /mnt/test # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes 74570.00k). /export/home/dclarke/star: Total time 92.378sec (807 kBytes/sec) real 1:32.463 user 0.351 sys 3.658 Pretty much what I expect for an old old Solaris 8 box. Then I try using a mounted NFS filesystem shared from ZFS on snv_46 # cat /etc/release Solaris Nevada snv_46 SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 August 2006 # zfs set sharenfs=nosub,nosuid,rw=pluto,root=pluto zfs0/backup # zfs get sharenfs zfs0/backup NAME PROPERTY VALUE SOURCE zfs0/backup sharenfs nosub,nosuid,rw=pluto,root=pluto local # # tip hardwire connected pluto console login: root Password: Nov 22 18:41:50 pluto login: ROOT LOGIN /dev/console Last login: Tue Nov 21 02:07:39 on console Sun Microsystems Inc. SunOS 5.8 Generic Patch February 2004 # cat /etc/release Solaris 8 2/04 s28s_hw4wos_05a SPARC Copyright 2004 Sun Microsystems, Inc. All Rights Reserved. Assembled 08 January 2004 # dfshares mars RESOURCE SERVER ACCESS TRANSPORT mars:/export/zfs/backup mars - - mars:/export/zfs/qemu mars - - # # mkdir /export/nfs # mount -F nfs -o bg,intr,nosuid mars:/export/zfs/backup /export/nfs # # cd /export/nfs/titan # ls -lap total 142780 drwxr-xr-x 3 dclarke other 8 Nov 22 19:08 ./ drwxr-xr-x 9 root sys 12 Nov 15 20:14 ../ -rw-r--r-- 1 phil csw 13102 Jul 12 12:32 README.csw -rw-r--r-- 1 dclarke csw 189389 Sep 14 19:33 ae-2.2.0.tar.gz -rw-r--r-- 1 dclarke csw 91965440 Jul 25 12:56 dclarke.tar -rw-r--r-- 1 dclarke csw 20403483 Nov 22 19:07 emacs-21.4a.tar.gz -rw-r--r-- 1 dclarke csw 5468160 Jul 25 12:57 root.tar drwxr-xr-x 5 dclarke csw 5 May 24 2006 schily/ # Now that my Solaris 8 box has a mounted ZFS/NFS filesystem I test again # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes 74570.00k). /export/home/dclarke/star: Total time 215.958sec (345 kBytes/sec) real 3:36.048 user 0.397 sys 5.961 # That is based on the ZFS/NFS mounted filesystem. What if I run the same test on my server locally? On ZFS ? # ptime /root/bin/star -x -time -z file=/tmp/emacs-21.4a.tar.gz /root/bin/star: 7457 blocks + 0 bytes (total of 76359680 bytes = 74570.00k). /root/bin/star: Total time 32.238sec (2313 kBytes/sec) real 32.680 user 6.973 sys 9.945 # So gee ... thats all pretty slow but really really slow with ZFS shared out via NFS. wow .. good to know. I *never* would have seen that coming. Dennis
Hi Al, You conclude: No problem there! ZFS rocks. NFS/ZFS is a bad combination. But my reading of your data leads to: single threaded small file creation is much slower over NFS than locally. regardless of the server FS. It''s been posted on this alias before, Change ZFS to anything else and it won''t change the conclusion. NFS/AnyFS is a bad combination for single threaded tar x. -r Al Hopper writes: > On Tue, 21 Nov 2006, Joe Little wrote: > > > On 11/21/06, Roch - PAE <Roch.Bourbonnais at sun.com> wrote: > > > > > > Matthew B Sweeney - Sun Microsystems Inc. writes: > > > > Hi > > > > I have an application that use NFS between a Thumper and a 4600. The > > > > Thumper exports 2 ZFS filesystems that the 4600 uses as an inqueue and > > > > outqueue. > > > > > > > > The machines are connected via a point to point 10GE link, all NFS is > > > > done over that link. The NFS performance doing a simple cp from one > > > > partition to the other is well below what I''d expect , 58 MB/s. I''ve > > > > some NFS tweaks, tweaks to the neterion cards (soft rings etc) , and > > > > tweaks to the TCP stack on both sides to no avail. Jumbo frames are > > > > enabled and working, which improves performance, but doesn''t make it fly. > > > > > > > > I''ve tested the link with iperf and have been able to sustain 5 - 6 > > > > Gb/s. The local ZFS file systems (12 disk stripe, 34 disk stripe) > > > > perform very well (450 - 500 MB/s sustained). > > > > > > > > My research points to disabling the ZIL. So far the only way I''ve found > > > > to disable the ZIL is through mdb, echo ''zil_disable/W 1''|mdb -kw. My > > > > question is can I achieve this setting via a /kernel/drv/zfs.conf or > > > > /etc/system parameter? > > > > > > > > > > You may set in in /etc/system. We''re thinking of renaming > > > the variable to > > > > > > set zfs_please_corrupt_my_client''s_data = 1 > > > > > > Just kidding (about the name) but it will corrupt your data. > > > > > > -r > > > > > > > > > > Yes, we''ve entered this thread multiple times before, where NFS > > basically sucks compared to the relative performance locally. I''m > > waiting, ever so eagerly, for the per pool (or was it per FS) > > properties that give finer grained control of the ZIL, named > > "sync_deferred". Where is that by the way? > > Agreed - it sucks - especially for small file use. Here''s a 5,000 ft view > of the performance while unzipping and extracting a tar archive. First > the test is run on a SPARC 280R running Build 51a with dual 900MHz USIII > CPUs and 4Gb of RAM: > > $ cp emacs-21.4a.tar.gz /tmp > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 13.092 > user 2.083 > sys 0.183 > > Next, the test in run on the same box in /tmp > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 2.983 > user 2.038 > sys 0.201 > > Next the test is run on a NFS mount of a zfs filesystem on a 5 disk > raidz device over a gigabit ethernet interface with only two hosts on the > VLAN (the zfs server is a dual socket AMD whitebox with two dual-core > 2.2GHz CPUs): > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 2:32.667 > user 2.410 > sys 0.233 > > Houston - we have a problem. What OS is the ZFS based NFS server running? > I can''t say, but lets say that its close to Update 3. > > Next we move emacs-21.4a.tar.gz to the NFS server and run it in the same > filesystem that is NFS mounted to the 280R: > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 3.365 > user 0.880 > sys 0.154 > > No problem there! ZFS rocks. NFS/ZFS is a bad combination. > > Happy Thanksgiving (to those stateside). > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > OpenSolaris Governing Board (OGB) Member - Feb 2006 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
"Dennis Clarke" <dclarke at blastwave.org> wrote:> here is my machine here ( Solaris 8 Ultra 2 200MHz ) > > # cd /tmp > # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz > /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes > 74570.00k). > /export/home/dclarke/star: Total time 11.057sec (6744 kBytes/sec) > > real 11.146 > user 0.300 > sys 1.762 > > and the same test on the same machine with a local UFS filesystem : > > # cd /mnt/test > # ptime /export/home/dclarke/star -x -time -z file=/tmp/emacs-21.4a.tar.gz > /export/home/dclarke/star: 7457 blocks + 0 bytes (total of 76359680 bytes > 74570.00k). > /export/home/dclarke/star: Total time 92.378sec (807 kBytes/sec) > > real 1:32.463 > user 0.351 > sys 3.658 > > Pretty much what I expect for an old old Solaris 8 box.If you do this kind of tests, it makes sense, to repeat the test with star -no-fsync J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Hi, I haven''t followed all the details in this discussion, but it seems to me that it all breaks down to: - NFS on ZFS is slow due to NFS being very conservative when sending ACK to clients only after writes have definitely committed to disk. - Therefore, the problem is not that much ZFS specific, it''s just a conscious focus on data correctness vs. speed on ZFS/NFS'' part. - Currently known workarounds include: - Sacrifice correctness for speed by disabling ZIL or using a less conservative network file system. - Optimize NFS/ZFS to get as much speed as possible within the constraints of the NFS protocol. But one aspect I haven''t seen so far is: How can we optimize ZFS on a more hardware oriented level to both achieve good NFS speeds and still preserve the NFS level of correctness? One possibility might be to give the ZFS pool enough spindles so it can comfortably handle many small IOs fast enough for them not to become NFS commit bottlenecks. This may require some tweaking on the ZFS side so it doesn''t queue up write IOs for too long as to not delay commits more than necessary. Has anyone investigated this branch or am I too simplistic in my view of the underlying root of the problem? Best regards, Constantin -- Constantin Gonzalez Sun Microsystems GmbH, Germany Platform Technology Group, Client Solutions http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/
Nope, wrong conclusion again. This large performance degradation has nothing whatsoever to do with ZFS. I have not seen data that would show a possible slowness on the part of ZFS vfs AnyFS on the backend; there may well be and that would be an entirely diffenrent discussion to the large slowdown the NFS induces compares to localFS for single threaded loads. more inline. Constantin Gonzalez writes: > Hi, > > I haven''t followed all the details in this discussion, but it seems to me > that it all breaks down to: > > - NFS on ZFS is slow due to NFS being very conservative when sending > ACK to clients only after writes have definitely committed to disk. Nope. NFS is slow for single threaded tar extract. The conservative approach of NFS is needed with the NFS protocol in order to ensure client''s side data integrity. Nothing ZFS related. > > - Therefore, the problem is not that much ZFS specific, it''s just a > conscious focus on data correctness vs. speed on ZFS/NFS'' part. > So nope. Purely NFS related. Nothing ZFS related. > - Currently known workarounds include: > > - Sacrifice correctness for speed by disabling ZIL or using a less > conservative network file system. Disabling the ZIL means the backing FS fails to commit properly. The NFS protocol is cheated leading to client side integrity issue. With UFS, which has similar slowness to ZFS, we can fix the slowness by enabling the WCE. > > - Optimize NFS/ZFS to get as much speed as possible within the constraints > of the NFS protocol. Not possible. Nothing related to ZFS here and if NFS had ways to make this better i think it would have been done in v4. If we extended the protocol to allow for exclusive mounts (single client access) then, I would think that the extra knowledge could be used to gain speed... I don''t know if this was considered by the NFS forum. > > But one aspect I haven''t seen so far is: How can we optimize ZFS on a more > hardware oriented level to both achieve good NFS speeds and still preserve > the NFS level of correctness? > > One possibility might be to give the ZFS pool enough spindles so it can > comfortably handle many small IOs fast enough for them not to become > NFS commit bottlenecks. This may require some tweaking on the ZFS side so > it doesn''t queue up write IOs for too long as to not delay commits more than > necessary. NFS is plenty fast in a throughput context (not that it does not need work). The complaints we have here are about single threaded code. > > Has anyone investigated this branch or am I too simplistic in my view of the > underlying root of the problem? > I''ll let you be the judge of that ;-) -r > Best regards, > Constantin > > -- > Constantin Gonzalez Sun Microsystems GmbH, Germany > Platform Technology Group, Client Solutions http://www.sun.de/ > Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Roch, thanks, now I better understand the issue :).> Nope. NFS is slow for single threaded tar extract. The > conservative approach of NFS is needed with the NFS protocol > in order to ensure client''s side data integrity. Nothing ZFS > related....> NFS is plenty fast in a throughput context (not that it does > not need work). The complaints we have here are about single > threaded code.ok, then it''s "just" a single thread client latency of request issue, which (as increasingly often) software vendors need to realize. The proper way to deal with this, then, is to multi-thread on the application layer. Reminds my of many UltraSPARC T1 issues, which don''t sit in hardware nor OS, but in the way applications have been developed for years :). Best regards, Constantin -- Constantin Gonzalez Sun Microsystems GmbH, Germany Platform Technology Group, Client Solutions http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/
On Thu, 23 Nov 2006, Roch - PAE wrote:> > Hi Al, You conclude: > > No problem there! ZFS rocks. NFS/ZFS is a bad combination. > > But my reading of your data leads to: > > single threaded small file creation is much slower > over NFS than locally. regardless of the server FS. > > It''s been posted on this alias before, Change ZFS to > anything else and it won''t change the conclusion. > NFS/AnyFS is a bad combination for single threaded tar x.Hi Roch - you are correct in that the data presented was incomplete. I did''nt present data for the same test with an NFS mount from the same server, for a UFS based filesystem. So here is that data point: $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - real 12.671 user 2.356 sys 0.228 This test is not totally fair, in that the UFS filesystem being shared is on a single 400Gb SATA drive being used as the boot device - versus the 5-way raidz config which consists of 5 of those same 400Gb SATA drives. But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Al Hopper writes: > On Thu, 23 Nov 2006, Roch - PAE wrote: > > > > > Hi Al, You conclude: > > > > No problem there! ZFS rocks. NFS/ZFS is a bad combination. > > > > But my reading of your data leads to: > > > > single threaded small file creation is much slower > > over NFS than locally. regardless of the server FS. > > > > It''s been posted on this alias before, Change ZFS to > > anything else and it won''t change the conclusion. > > NFS/AnyFS is a bad combination for single threaded tar x. > > Hi Roch - you are correct in that the data presented was incomplete. I > did''nt present data for the same test with an NFS mount from the same > server, for a UFS based filesystem. So here is that data point: > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > real 12.671 > user 2.356 > sys 0.228 > > This test is not totally fair, in that the UFS filesystem being shared is > on a single 400Gb SATA drive being used as the boot device - versus the > 5-way raidz config which consists of 5 of those same 400Gb SATA drives. > But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 > Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. > Thanks Al. I''d put 100$ on the table that the WCE is enabled on the SATA drive backing UFS. Even if format says it''s not, are there not some drives which just ignore the WC disable commands ? -r > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > OpenSolaris Governing Board (OGB) Member - Feb 2006
Roch - PAE wrote:> > Not possible. Nothing related to ZFS here and if NFS had > ways to make this better i think it would have been done in v4. > > If we extended the protocol to allow for exclusive mounts > (single client access) then, I would think that the extra > knowledge could be used to gain speed... I don''t know if > this was considered by the NFS forum.This sounds very similar to delegations, quoting from RFC 3530 1.4.6. Client Caching and Delegation ... The major addition to NFS version 4 in the area of caching is the ability of the server to delegate certain responsibilities to the client. When the server grants a delegation for a file to a client, the client is guaranteed certain semantics with respect to the sharing of that file with other clients. At OPEN, the server may provide the client either a read or write delegation for the file. If the client is granted a read delegation, it is assured that no other client has the ability to write to the file for the duration of the delegation. If the client is granted a write delegation, the client is assured that no other client has read or write access to the file. However IIRC the current state of the NFSv4 client in OpenSolaris is that this functionality is not supported. My knowledge of what we have done in NFSv4 in OpenSolaris is a little rusty though so I highly recommend follow up with the NFS community. As for NFS performance it isn''t all bad it is just as you have pointed out that there is a pathological case that is being used in these tests. For some idea of what NFS can do see: http://blogs.sun.com/shepler/entry/spec_sfs_over_the_years -- Darren J Moffat
We have had file delegation on by default in NFSv4 since Solaris 10 FCS, putback in July 2004. We''re currently working on also providing directory delegations - client caching of directory contents - as part of the upcoming NFSv4.1. cheers, calum. Darren J Moffat wrote:> Roch - PAE wrote: >> >> Not possible. Nothing related to ZFS here and if NFS had >> ways to make this better i think it would have been done in v4. >> >> If we extended the protocol to allow for exclusive mounts >> (single client access) then, I would think that the extra >> knowledge could be used to gain speed... I don''t know if >> this was considered by the NFS forum. > > This sounds very similar to delegations, quoting from RFC 3530 > > 1.4.6. Client Caching and Delegation > > ... > > The major addition to NFS version 4 in the area of caching is the > ability of the server to delegate certain responsibilities to the > client. When the server grants a delegation for a file to a client, > the client is guaranteed certain semantics with respect to the > sharing of that file with other clients. At OPEN, the server may > provide the client either a read or write delegation for the file. > If the client is granted a read delegation, it is assured that no > other client has the ability to write to the file for the duration of > the delegation. If the client is granted a write delegation, the > client is assured that no other client has read or write access to > the file. > > > However IIRC the current state of the NFSv4 client in OpenSolaris is > that this functionality is not supported. My knowledge of what we have > done in NFSv4 in OpenSolaris is a little rusty though so I highly > recommend follow up with the NFS community. > > As for NFS performance it isn''t all bad it is just as you have pointed > out that there is a pathological case that is being used in these tests. > > For some idea of what NFS can do see: > http://blogs.sun.com/shepler/entry/spec_sfs_over_the_years > >
On Thu, Nov 23, 2006 at 03:37:33PM +0100, Roch - PAE wrote:> Al Hopper writes: > > Hi Roch - you are correct in that the data presented was incomplete. I > > did''nt present data for the same test with an NFS mount from the same > > server, for a UFS based filesystem. So here is that data point: > > > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > > > real 12.671 > > user 2.356 > > sys 0.228 > > > > This test is not totally fair, in that the UFS filesystem being shared is > > on a single 400Gb SATA drive being used as the boot device - versus the > > 5-way raidz config which consists of 5 of those same 400Gb SATA drives. > > But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 > > Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. > > I''d put 100$ on the table that the WCE is enabled on the > SATA drive backing UFS. Even if format says it''s not, are > there not some drives which just ignore the WC disable > commands ?I agree with Roch here. With UFS, if WCE is enabled on the drives (which I''m sure it is on Al''s SATA drives), UFS is fooled into thinking that when it writes a block to disk, it''s safe. The drive returns from the write amazingly fast (since the data only landed in cache - not the media), so you get quick turnarounds (low latency) on NFS, which is the only thing that matters on single-threaded performance. With ZFS, on the other hand, not only do we write data to the drive when NFS tells us to, but we issue a DKIOCFLUSHWRITECACHE ioctl to the underlying device (FLUSH_CACHE on ATA, SYNCHRONIZE_CACHE on SCSI) to ensure that the data that''s supposed to be on the disk is really, truly on the disk. This takes typically around 4-6ms, which is quite a while. Again, this dictates the single-threaded NFS performance. If you want an apples-to-apples comparison, either try the UFS/ZFS tests on a drive that has WCE disabled, or turn off the ZIL on a drive that has WCE enabled. I''ll bet the difference will be rather slight, perhaps in favor of ZFS. --Bill
Bill, I did the same test on the Thumper I''m working on with the NFS vols converted from ZFS stripes to SVM stripes. In both cases same number/type of disks in the stripe. In my very simple test ,time for file in frame*; do cp /inq/$file /outq/$file; done, UFS did approximately 64 MB/s, the best run for ZFS was approx 58 MB/s. Not a huge difference for sure, but enough to make you think about switching. This was single stream over a 10GE link. (x4600 mounting vols from an x4500) Matt Bill Moore wrote:> On Thu, Nov 23, 2006 at 03:37:33PM +0100, Roch - PAE wrote: > >> Al Hopper writes: >> > Hi Roch - you are correct in that the data presented was incomplete. I >> > did''nt present data for the same test with an NFS mount from the same >> > server, for a UFS based filesystem. So here is that data point: >> > >> > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - >> > >> > real 12.671 >> > user 2.356 >> > sys 0.228 >> > >> > This test is not totally fair, in that the UFS filesystem being shared is >> > on a single 400Gb SATA drive being used as the boot device - versus the >> > 5-way raidz config which consists of 5 of those same 400Gb SATA drives. >> > But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 >> > Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. >> >> I''d put 100$ on the table that the WCE is enabled on the >> SATA drive backing UFS. Even if format says it''s not, are >> there not some drives which just ignore the WC disable >> commands ? >> > > I agree with Roch here. With UFS, if WCE is enabled on the drives > (which I''m sure it is on Al''s SATA drives), UFS is fooled into thinking > that when it writes a block to disk, it''s safe. The drive returns from > the write amazingly fast (since the data only landed in cache - not the > media), so you get quick turnarounds (low latency) on NFS, which is the > only thing that matters on single-threaded performance. > > With ZFS, on the other hand, not only do we write data to the drive when > NFS tells us to, but we issue a DKIOCFLUSHWRITECACHE ioctl to the > underlying device (FLUSH_CACHE on ATA, SYNCHRONIZE_CACHE on SCSI) to > ensure that the data that''s supposed to be on the disk is really, truly > on the disk. This takes typically around 4-6ms, which is quite a while. > Again, this dictates the single-threaded NFS performance. > > If you want an apples-to-apples comparison, either try the UFS/ZFS tests > on a drive that has WCE disabled, or turn off the ZIL on a drive that > has WCE enabled. I''ll bet the difference will be rather slight, perhaps > in favor of ZFS. > > > --Bill > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061123/386cda0f/attachment.html>
Calum Mackay wrote:> We have had file delegation on by default in NFSv4 since Solaris 10 FCS, > putback in July 2004.The delegation of a file gives the client certain guarantees about how that file may be accessed by other clients (regardless of NFS version) or processes local to the NFS server. If the client receives a read delegation it knows that no other client (or server process) may write to the file. If the client receives a write delegation, then no other client (or server process) may read or write the file. Once a client has a delegation for a file, it may then deal locally - i.e. with no need to contact the server - with reads (assuming it has the required data cached), writes and locking. The file may even be closed, locally, and the client will still retain the delegation, such that other applications on this client may open the file, still without contact with the server. Note that delegation only provides a benefit when there is no conflicting access. If conflicting accesses occur, the delegation is recalled, the client flushes its data to the server, and the conflicting accesses may proceed. In addition, note that a server is not *required* to return a delegation to the client. I would expect a write delegation to provide some performance increase in the case noted, although I don''t have any numbers in front of me. cheers, c.
This thread started with the notion that ZFS provided a very bad NFS service to the point of > 10X degradation over say UFS. What I hope we have an agreement on, is that these scale of performance difference does not come from ZFS but from an NFS service that would sacrifice integrity. Enabling the write-cache (UFS) or disabling the zil (ZFS) are 2 ways to get such speedup. Here, you have a fair comparison of UFS/ZFS serving NFS and that shows a 10% effect. It would be nice to analyze where that difference comes from. -r Matt Sweeney writes: > Bill, > > I did the same test on the Thumper I''m working on with the NFS vols > converted from ZFS stripes to SVM stripes. In both cases same > number/type of disks in the stripe. In my very simple test ,time for > file in frame*; do cp /inq/$file /outq/$file; done, UFS did > approximately 64 MB/s, the best run for ZFS was approx 58 MB/s. Not a > huge difference for sure, but enough to make you think about switching. > This was single stream over a 10GE link. (x4600 mounting vols from an x4500) > > Matt > > Bill Moore wrote: > > On Thu, Nov 23, 2006 at 03:37:33PM +0100, Roch - PAE wrote: > > > >> Al Hopper writes: > >> > Hi Roch - you are correct in that the data presented was incomplete. I > >> > did''nt present data for the same test with an NFS mount from the same > >> > server, for a UFS based filesystem. So here is that data point: > >> > > >> > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > >> > > >> > real 12.671 > >> > user 2.356 > >> > sys 0.228 > >> > > >> > This test is not totally fair, in that the UFS filesystem being shared is > >> > on a single 400Gb SATA drive being used as the boot device - versus the > >> > 5-way raidz config which consists of 5 of those same 400Gb SATA drives. > >> > But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 > >> > Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. > >> > >> I''d put 100$ on the table that the WCE is enabled on the > >> SATA drive backing UFS. Even if format says it''s not, are > >> there not some drives which just ignore the WC disable > >> commands ? > >> > > > > I agree with Roch here. With UFS, if WCE is enabled on the drives > > (which I''m sure it is on Al''s SATA drives), UFS is fooled into thinking > > that when it writes a block to disk, it''s safe. The drive returns from > > the write amazingly fast (since the data only landed in cache - not the > > media), so you get quick turnarounds (low latency) on NFS, which is the > > only thing that matters on single-threaded performance. > > > > With ZFS, on the other hand, not only do we write data to the drive when > > NFS tells us to, but we issue a DKIOCFLUSHWRITECACHE ioctl to the > > underlying device (FLUSH_CACHE on ATA, SYNCHRONIZE_CACHE on SCSI) to > > ensure that the data that''s supposed to be on the disk is really, truly > > on the disk. This takes typically around 4-6ms, which is quite a while. > > Again, this dictates the single-threaded NFS performance. > > > > If you want an apples-to-apples comparison, either try the UFS/ZFS tests > > on a drive that has WCE disabled, or turn off the ZIL on a drive that > > has WCE enabled. I''ll bet the difference will be rather slight, perhaps > > in favor of ZFS. > > > > > > --Bill > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> > <title></title> > </head> > <body bgcolor="#ffffff" text="#000000"> > Bill,<br> > <br> > I did the same test on the Thumper I''m working on with the NFS vols > converted from ZFS stripes to SVM stripes. In both cases same > number/type of disks in the stripe. In my very simple test ,time for > file in frame*; do cp /inq/$file /outq/$file; done, UFS did > approximately 64 MB/s, the best run for ZFS was approx 58 MB/s. Not a > huge difference for sure, but enough to make you think about > switching. This was single stream over a 10GE link. (x4600 mounting > vols from an x4500)<br> > <br> > Matt<br> > <br> > Bill Moore wrote: > <blockquote cite="mid20061123220937.GA4944 at eng.sun.com" type="cite"> > <pre wrap="">On Thu, Nov 23, 2006 at 03:37:33PM +0100, Roch - PAE wrote: > </pre> > <blockquote type="cite"> > <pre wrap="">Al Hopper writes: > > Hi Roch - you are correct in that the data presented was incomplete. I > > did''nt present data for the same test with an NFS mount from the same > > server, for a UFS based filesystem. So here is that data point: > > > > $ ptime gunzip -c /tmp/emacs-21.4a.tar.gz |tar xf - > > > > real 12.671 > > user 2.356 > > sys 0.228 > > > > This test is not totally fair, in that the UFS filesystem being shared is > > on a single 400Gb SATA drive being used as the boot device - versus the > > 5-way raidz config which consists of 5 of those same 400Gb SATA drives. > > But the data clearly shows the NFS/ZFS is a bad combination: 2 minutes 33 > > Seconds for NFS/ZFS versus 13 Seconds (rouding up) for NFS/UFS. > > I''d put 100$ on the table that the WCE is enabled on the > SATA drive backing UFS. Even if format says it''s not, are > there not some drives which just ignore the WC disable > commands ? > </pre> > </blockquote> > <pre wrap=""><!----> > I agree with Roch here. With UFS, if WCE is enabled on the drives > (which I''m sure it is on Al''s SATA drives), UFS is fooled into thinking > that when it writes a block to disk, it''s safe. The drive returns from > the write amazingly fast (since the data only landed in cache - not the > media), so you get quick turnarounds (low latency) on NFS, which is the > only thing that matters on single-threaded performance. > > With ZFS, on the other hand, not only do we write data to the drive when > NFS tells us to, but we issue a DKIOCFLUSHWRITECACHE ioctl to the > underlying device (FLUSH_CACHE on ATA, SYNCHRONIZE_CACHE on SCSI) to > ensure that the data that''s supposed to be on the disk is really, truly > on the disk. This takes typically around 4-6ms, which is quite a while. > Again, this dictates the single-threaded NFS performance. > > If you want an apples-to-apples comparison, either try the UFS/ZFS tests > on a drive that has WCE disabled, or turn off the ZIL on a drive that > has WCE enabled. I''ll bet the difference will be rather slight, perhaps > in favor of ZFS. > > > --Bill > _______________________________________________ > zfs-discuss mailing list > <a class="moz-txt-link-abbreviated" href="mailto:zfs-discuss at opensolaris.org">zfs-discuss at opensolaris.org</a> > <a class="moz-txt-link-freetext" href="http://mail.opensolaris.org/mailman/listinfo/zfs-discuss">http://mail.opensolaris.org/mailman/listinfo/zfs-discuss</a> > </pre> > </blockquote> > <br> > </body> > </html>
Calum Mackay
2006-Nov-24 12:24 UTC
[nfs-discuss] Re: [zfs-discuss] poor NFS/ZFS performance
I should perhaps note that my last email on delegation describes the optimisations possible under the NFSv4 protocol, as per RFC, all of which are not necessarily implemented in our own Solaris client. In particular, I think that fsync and committed writes do still go through to the server, so we may not see that performance enhancements. We should at least do away with the regular GETATTRs that would otherwise make up a lot of the conflicting access detection in NFSv3. cheers, c.