Timothy Coalson
2012-Jun-13 22:56 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
I noticed recently that the SSDs hosting the ZIL for my pool had a large number in the SMART attribute for total LBAs written (with some calculation, it seems to be the total amount of data written to the pool so far), did some testing, and found that the ZIL is being used quite heavily (matching the writing speed) on writes that should be asynchronous. I did a capture with wireshark during a simple copy of a large file (8GB), and both the write packets and the write responses showed the "UNSTABLE" I would expect in asynchronous writes. The copy didn''t finish before the server started writing heavily to the ZIL, so I wouldn''t expect it to tell NFS to commit then. Here are some relevant pieces of info (with hostnames, etc removed): client: ubuntu 11.10 /etc/fstab entry: <server>:/mainpool/storage /mnt/myelin nfs bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 0 server: OpenIndiana oi_151a4 $ zfs get sync mainpool NAME PROPERTY VALUE SOURCE mainpool sync standard default $ zfs get sync mainpool/storage NAME PROPERTY VALUE SOURCE mainpool/storage sync standard default $ zfs get sharenfs mainpool/storage NAME PROPERTY VALUE SOURCE mainpool/storage sharenfs rw=@xxx.xxx.37 local $ zpool get version mainpool NAME PROPERTY VALUE SOURCE mainpool version 28 default The pool consists of 24 sata disks arranged as 2 raid-z2 groups of 12, and originally had a mirrored log across 10GB slices on two intel 320 80GB SSDs, but I have since rearranged the logs as non-mirrored (since a single SSD doesn''t quite keep up with gigabit network throughput, and some testing convinced me that a pool survives a failing log device gracefully, as long as there isn''t a simultaneous crash). I would like to switch them back to a mirrored configuration, but without impacting the asynchronous throughput, and obviously it would be nice to reduce the write load to them so they live longer. As far as I can tell, all nfs writes are being written to the ZIL when many should be cached in memory and bypass the ZIL. Any help appreciated, Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120613/0632e247/attachment.html>
Daniel Carosone
2012-Jun-13 23:51 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:> client: ubuntu 11.10 > /etc/fstab entry: <server>:/mainpool/storage /mnt/myelin nfs > bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 > 0nfsvers=3> NAME PROPERTY VALUE SOURCE > mainpool/storage sync standard defaultsync=standard This is expected behaviour for this combination. NFS 3 semantics are for persistent writes at the server regardless - and mostly also for NFS 4. The async client mount option relates to when the writes get shipped to the server (immediately or delayed in dirty pages), rather than to how the server should handle those writes once they arrive. You could set sync=disabled if you''re happy with the consequences, or even just as a temporary test to confirm the impact. It sounds like you would be since that''s what you''re trying to achieve. There is a difference: async on the client means data is lost on a client reboot, async on the server means data may be lost on a server reboot (and the client/application confused by inconsistencies as a result). Separate datasets (and mounts) for data with different persistence requirements can help. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120614/b3d330d1/attachment.bin>
Timothy Coalson
2012-Jun-14 04:14 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
Interesting...from what I had read about NFSv3 asynchronous writes, especially bits about "does not require the server to commit to stable storage", led me to expect different behavior. The performance impact on large writes (which we do a lot of) wasn''t severe, so sync=disabled is probably not worth the risk. The SSDs should also be able to take a fair amount of abuse, so I can live with the behavior as-is. Here''s one example of documentation that led me to expect something else: http://nfs.sourceforge.net/ , search for "unstable". It did indeed seem like the nfs client was delaying writes, then doing synchronous nfs calls, which is why I looked into the packets with wireshark (and found them to be advertised as "UNSTABLE", but with the pool acting synchronous). Thanks, Tim On Wed, Jun 13, 2012 at 6:51 PM, Daniel Carosone <dan at geek.com.au> wrote:> > On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: > > client: ubuntu 11.10 > > /etc/fstab entry: <server>:/mainpool/storage ? ? ? /mnt/myelin ? ? nfs > > bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async > > 0 > > ? ? 0 > > nfsvers=3 > > > NAME ? ? ? ? ? ? ?PROPERTY ?VALUE ? ? SOURCE > > mainpool/storage ?sync ? ? ?standard ?default > > sync=standard > > This is expected behaviour for this combination. NFS 3 semantics are > for persistent writes at the server regardless - and mostly also > for NFS 4. > > The async client mount option relates to when the writes get shipped > to the server (immediately or delayed in dirty pages), rather than to > how the server should handle those writes once they arrive. > > You could set sync=disabled if you''re happy with the consequences, or > even just as a temporary test to confirm the impact. ?It sounds like > you would be since that''s what you''re trying to achieve. > > There is a difference: async on the client means data is lost on a > client reboot, async on the server means data may be lost on a server > reboot (and the client/application confused by inconsistencies as a > result). > > Separate datasets (and mounts) for data with different persistence > requirements can help. > > -- > Dan. > >
Richard Elling
2012-Jun-14 14:24 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:> On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: >> client: ubuntu 11.10 >> /etc/fstab entry: <server>:/mainpool/storage /mnt/myelin nfs >> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 >> 0 > > nfsvers=3 > >> NAME PROPERTY VALUE SOURCE >> mainpool/storage sync standard default > > sync=standard > > This is expected behaviour for this combination. NFS 3 semantics are > for persistent writes at the server regardless - and mostly also > for NFS 4.NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4 async and sync activity, try nfssvrtop https://github.com/richardelling/tools -- richard -- ZFS and performance consulting http://www.RichardElling.com
Timothy Coalson
2012-Jun-14 19:20 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
Thanks for the script. Here is some sample output from ''sudo ./nfssvrtop -b 512 5'' (my disks are 512B-sector emulated and the pool is ashift=9, some benchmarking didn''t show much difference with ashift=12 other than giving up 8% of available space) during a copy operation from 37.30 with sync=standard: 2012 Jun 14 13:59:13, load: 0.68, read: 0 KB, swrite: 0 KB, awrite: 557056 KB Ver Client NFSOPS Reads SWrites AWrites Commits Rd_bw SWr_bw AWr_bw Rd_t SWr_t AWr_t Com_t Align% 3 xxx.xxx.37.30 108 0 0 108 0 0 0 111206 0 0 396 1917419 100 a bit later... 3 xxx.xxx.37.30 109 0 0 108 0 0 0 111411 0 0 427 0 100 sample output from the end of ''zpool iostat -v 5 mainpool'' concurrently: logs - - - - - - c31t3d0s0 260M 9.68G 0 1.21K 0 85.3M c31t4d0s0 260M 9.68G 0 1.21K 0 85.1M In case the alignment fails, the nonzero entries are under NFSOPS, AWrites, AWr_bw, AWr_t, Com_t and Align%. The Com_t (average commit time?) column alternates between zero and a million or two (the other columns stay about the same, the zeros stay zero), while the "Commits" column stays zero during the copy. The write throughput to the logs varies quite a bit, that sample is a very high mark, it mainly alternates between almost zero and 30M each, which is kind of odd considering the copy speed (using gigabit network, copy speed averages around 110MB/s). When I ''zfs set sync=disabled'', the output of nfssrvtop stays about the same, except the Com_t stays 0, and the log devices also stay 0 for throughput. Could you enlighten me as to what "Com_t" measures when "Commits" stays zero? Perhaps the nfs server caches asynchronous nfs writes how I expect, but flushes its cache with synchronous writes? Thanks, Tim On Thu, Jun 14, 2012 at 9:24 AM, Richard Elling <richard.elling at gmail.com> wrote:> On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote: > >> On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: >>> client: ubuntu 11.10 >>> /etc/fstab entry: <server>:/mainpool/storage ? ? ? /mnt/myelin ? ? nfs >>> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async ? ? ? 0 >>> ? ?0 >> >> nfsvers=3 >> >>> NAME ? ? ? ? ? ? ?PROPERTY ?VALUE ? ? SOURCE >>> mainpool/storage ?sync ? ? ?standard ?default >> >> sync=standard >> >> This is expected behaviour for this combination. NFS 3 semantics are >> for persistent writes at the server regardless - and mostly also >> for NFS 4. > > NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4 > async and sync activity, try nfssvrtop > ? ? ? ?https://github.com/richardelling/tools > > ?-- richard > > -- > > ZFS and performance consulting > http://www.RichardElling.com > > > > > > > > > > > > > > > >
Richard Elling
2012-Jun-14 20:18 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
Hi Tim, On Jun 14, 2012, at 12:20 PM, Timothy Coalson wrote:> Thanks for the script. Here is some sample output from ''sudo > ./nfssvrtop -b 512 5'' (my disks are 512B-sector emulated and the pool > is ashift=9, some benchmarking didn''t show much difference with > ashift=12 other than giving up 8% of available space) during a copy > operation from 37.30 with sync=standard: > > 2012 Jun 14 13:59:13, load: 0.68, read: 0 KB, swrite: 0 > KB, awrite: 557056 KB > Ver Client NFSOPS Reads SWrites AWrites Commits Rd_bw SWr_bw AWr_bw Rd_t SWr_t AWr_t Com_t Align% > 3 xxx.xxx.37.30 108 0 0 108 0 0 0 111206 0 0 396 1917419 100 > a bit later... > 3 xxx.xxx.37.30 109 0 0 108 0 0 0 111411 0 0 427 0 100 > > sample output from the end of ''zpool iostat -v 5 mainpool'' concurrently: > logs - - - - - - > c31t3d0s0 260M 9.68G 0 1.21K 0 85.3M > c31t4d0s0 260M 9.68G 0 1.21K 0 85.1M > > In case the alignment fails, the nonzero entries are under NFSOPS, > AWrites, AWr_bw, AWr_t, Com_t and Align%. The Com_t (average commit > time?) column alternates between zero and a million or two (the other > columns stay about the same, the zeros stay zero), while the "Commits" > column stays zero during the copy. The write throughput to the logs > varies quite a bit, that sample is a very high mark, it mainly > alternates between almost zero and 30M each, which is kind of odd > considering the copy speed (using gigabit network, copy speed averages > around 110MB/s).The client is using async writes, that include commits. Sync writes do not need commits. What happens is that the ZFS transaction group commit occurs at more-or-less regular intervals, likely 5 seconds for more modern ZFS systems. When the commit occurs, any data that is in the ARC but not commited in a prior transaction group gets sent to the ZIL. This is why you might see a very different amount of ZIL activity relative to the expected write workload.> When I ''zfs set sync=disabled'', the output of nfssrvtop stays about > the same, except the Com_t stays 0, and the log devices also stay 0 > for throughput. Could you enlighten me as to what "Com_t" measures > when "Commits" stays zero? Perhaps the nfs server caches asynchronous > nfs writes how I expect, but flushes its cache with synchronous > writes? >With sync=disabled, the ZIL is not used, thus the commit response to the client is a lie, breaking the covenant between the server and client. In other words, the server is supposed to respond to the commit only when the data is written to permanent media, but the administrator overruled this action by disabling the ZIL. If the server was to unexpectedly restart or other conditions occur such that the write cannot be completed, then the server and client will have different views of the data, a form of data loss. Different applications can react to long commit times differently. In this example, we see 1.9 seconds for the commit versus about 400 microseconds for each async write. The cause of the latency of the commit is not apparent from any bandwidth measurements (eg zpool iostat) and you should consider looking more closely at the "iostat -x" latency to see if the log devices are performing well. -- richard -- ZFS and performance consulting http://www.RichardElling.com
Robert Milkowski
2012-Jun-14 20:35 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
> The client is using async writes, that include commits. Sync writes do > not need commits. > > What happens is that the ZFS transaction group commit occurs at more- > or-less regular intervals, likely 5 seconds for more modern ZFS > systems. When the commit occurs, any data that is in the ARC but not > commited in a prior transaction group gets sent to the ZILAre you sure? I don''t think this is the case unless I misunderstood you or this is some recent change to Illumos. Whatever is being committed when zfs txg closes goes directly to pool and not to zil. Only sync writes will go to zil right a way (and not always, see logbias, etc.) and to arc to be committed later to a pool when txg closes. -- Robert Milkowski http://milek.blogspot.com
Timothy Coalson
2012-Jun-14 22:15 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
> The client is using async writes, that include commits. Sync writes do not > need commits.Are you saying nfs commit operations sent by the client aren''t always reported by that script?> What happens is that the ZFS transaction group commit occurs at more-or-less > regular intervals, likely 5 seconds for more modern ZFS systems. When the > commit occurs, any data that is in the ARC but not commited in a prior transaction > group gets sent to the ZIL. This is why you might see a very different amount of > ZIL activity relative to the expected write workload.I thought this was the other way around, synchronous writes immediately get copied to ZIL blocks (logbias may cause them to be allocated from the data disks despite the existence of a separate log device) in order to not require that a txg go through immediately to service them, then a txg that includes the same data eventually goes directly to the data disks, and then they get removed from the ZIL.> With sync=disabled, the ZIL is not used, thus the commit response to the client > is a lie, breaking the covenant between the server and client.I was not clear, the behavior with sync=disabled was exactly what I expected it to be, my question was why the Com_t attribute was nonzero with sync=standard, while Commits stayed zero. To be clear, while it was in the middle of copying data, the "Commits" column always read zero, never even one commit reported (in fact, Commits was always zero, even when a copy completed). I would expect this to mean the nfs client was issuing asynchronous writes without a single commit request (eventually filling the server''s cache and requiring the server to do something to resolve it). I am copying a single large file, 8GB, and usually interrupting it before it finishes, after I have some data that matches the period where it is writing to the log devices (takes a few seconds to get through the client and server caches and start using the disks).> Different applications can react to long commit times differently. In this example, > we see 1.9 seconds for the commit versus about 400 microseconds for each > async write. The cause of the latency of the commit is not apparent from any > bandwidth measurements (eg zpool iostat) and you should consider looking > more closely at the "iostat -x" latency to see if the log devices are performing > well.I am still unclear on the commit time being reported for zero reported commit requests. Here is what I think are the log devices (I am unsure how to translate the sd## mapping to solaris style controller/target devices) under iostat -x 5 when they are being written to heavily: device r/s w/s kr/s kw/s wait actv svc_t %w %b sd30 0.0 1073.1 0.0 72940.6 0.0 7.6 7.1 1 77 sd31 0.0 1075.3 0.0 73090.2 0.0 7.6 7.1 1 77 Considering their maximum write speed is reported as 90MB/s, and they each wrote 70MB/s in that interval, I do not expect the svc_t to reflect the seek time they can achieve. There were no particularly suspicious spikes in svc_t on any device while the Com_t was large. I have a guess that your script is reporting the time taken to flush a cache that got filled with async writes, and needed to be emptied to be able to continue to service async writes (a lot of data, therefore a lot of time). I''m wondering if this belongs in nfs-discuss, if the nfs server issues synchronous writes to empty its cache (without receiving an nfs commit request), it seems to explain everything. Tim
Phil Harman
2012-Jun-15 03:30 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote:>> The client is using async writes, that include commits. Sync writes do not >> need commits. > > Are you saying nfs commit operations sent by the client aren''t always > reported by that script?They are not reported in your case because the commit rate is less than one per second. DTrace is an amazing tool, but it does dictate certain coding compromises, particularly when it comes to output scaling, grouping, sorting and formatting. In this script the commit rate is calculated using integer division. In your case the sample interval is 5 seconds, so up to 4 commits per second will be reported as a big fat zero. If you use a sample interval of 1 second you should see occasional commits. We know they are there because we see a non-zero commit time.
Timothy Coalson
2012-Jun-15 04:20 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
Indeed they are there, shown with 1 second interval. So, it is the client''s fault after all. I''ll have to see whether it is somehow possible to get the server to write cached data sooner (and hopefully asynchronous), and the client to issue commits less often. Luckily I can live with the current behavior (and the SSDs shouldn''t give out any time soon even being used like this), if it isn''t possible to change it. Thanks for all the help, Tim On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at gmail.com> wrote:> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote: > >>> The client is using async writes, that include commits. Sync writes do not >>> need commits. >> >> Are you saying nfs commit operations sent by the client aren''t always >> reported by that script? > > They are not reported in your case because the commit rate is less than one per second. > > DTrace is an amazing tool, but it does dictate certain coding compromises, particularly when it comes to output scaling, grouping, sorting and formatting. > > In this script the commit rate is calculated using integer division. In your case the sample interval is 5 seconds, so up to 4 commits per second will be reported as a big fat zero. > > If you use a sample interval of 1 second you should see occasional commits. We know they are there because we see a non-zero commit time. > >
Richard Elling
2012-Jun-15 15:56 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
[Phil beat me to it] Yes, the 0s are a result of integer division in DTrace/kernel. On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:> Indeed they are there, shown with 1 second interval. So, it is the > client''s fault after all. I''ll have to see whether it is somehow > possible to get the server to write cached data sooner (and hopefully > asynchronous), and the client to issue commits less often. Luckily I > can live with the current behavior (and the SSDs shouldn''t give out > any time soon even being used like this), if it isn''t possible to > change it.If this is the proposed workload, then it is possible to tune the DMU to manage commits more efficiently. In an ideal world, it does this automatically, but the algorithms are based on a bandwidth calculation and those are not suitable for HDD capacity planning. The efficiency goal would be to do less work, more often and there are two tunables that can apply: 1. the txg_timeout controls the default maximum transaction group commit interval and is set to 5 seconds on modern ZFS implementations. 2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg will be committed when the size reaches this limit, rather than waiting for the txg_timeout. For streaming writes, this can work better than tuning the txg_timeout. -- richard> > Thanks for all the help, > Tim > > On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at gmail.com> wrote: >> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote: >> >>>> The client is using async writes, that include commits. Sync writes do not >>>> need commits. >>> >>> Are you saying nfs commit operations sent by the client aren''t always >>> reported by that script? >> >> They are not reported in your case because the commit rate is less than one per second. >> >> DTrace is an amazing tool, but it does dictate certain coding compromises, particularly when it comes to output scaling, grouping, sorting and formatting. >> >> In this script the commit rate is calculated using integer division. In your case the sample interval is 5 seconds, so up to 4 commits per second will be reported as a big fat zero. >> >> If you use a sample interval of 1 second you should see occasional commits. We know they are there because we see a non-zero commit time. >> >>-- ZFS and performance consulting http://www.RichardElling.com
Richard Elling
2012-Jun-15 15:57 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
On Jun 14, 2012, at 1:35 PM, Robert Milkowski wrote:>> The client is using async writes, that include commits. Sync writes do >> not need commits. >> >> What happens is that the ZFS transaction group commit occurs at more- >> or-less regular intervals, likely 5 seconds for more modern ZFS >> systems. When the commit occurs, any data that is in the ARC but not >> commited in a prior transaction group gets sent to the ZIL > > Are you sure? I don''t think this is the case unless I misunderstood you or > this is some recent change to Illumos.Need to make sure we are clear here, there is time between the txg being closed and the txg being on disk. During that period, a sync write of the data in the closed txg is written to the ZIL.> Whatever is being committed when zfs txg closes goes directly to pool and > not to zil. Only sync writes will go to zil right a way (and not always, see > logbias, etc.) and to arc to be committed later to a pool when txg closes.In this specific case, there are separate log devices, so logbias doesn''t apply. -- richard -- ZFS and performance consulting http://www.RichardElling.com
Timothy Coalson
2012-Jun-15 17:56 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
Thanks for the suggestions. I think it would also depend on whether the nfs server has tried to write asynchronously to the pool in the meantime, which I am unsure how to test, other than making the txgs extremely frequent and watching the load on the log devices. As for the integer division giving misleading zeros, one possible solution is to add (delay-1) to the count before dividing by delay, so if there are any, it will show at least 1 (or you could get fancy and do fixed point numbers). As for very frequent txgs, I imagine this could cause more fragmentation (more metadata written and discarded more frequently), is there a way to estimate or test for the impact of it? Depending on how it allocates the metadata blocks, I suppose it could write it to the blocks recently vacated by old metadata due to the previous txg, and have almost no impact until a snapshot is taken, is it smart enough to do this? Tim On Fri, Jun 15, 2012 at 10:56 AM, Richard Elling <richard.elling at gmail.com> wrote:> [Phil beat me to it] > Yes, the 0s are a result of integer division in DTrace/kernel. > > On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote: > >> Indeed they are there, shown with 1 second interval. ?So, it is the >> client''s fault after all. ?I''ll have to see whether it is somehow >> possible to get the server to write cached data sooner (and hopefully >> asynchronous), and the client to issue commits less often. ?Luckily I >> can live with the current behavior (and the SSDs shouldn''t give out >> any time soon even being used like this), if it isn''t possible to >> change it. > > If this is the proposed workload, then it is possible to tune the DMU to > manage commits more efficiently. In an ideal world, it does this automatically, > but the algorithms are based on a bandwidth calculation and those are not > suitable for HDD capacity planning. The efficiency goal would be to do less > work, more often and there are two tunables that can apply: > > 1. the txg_timeout controls the default maximum transaction group commit > interval and is set to 5 seconds on modern ZFS implementations. > > 2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg will > be committed when the size reaches this limit, rather than waiting for the > txg_timeout. For streaming writes, this can work better than tuning the > txg_timeout. > > ?-- richard > >> >> Thanks for all the help, >> Tim >> >> On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at gmail.com> wrote: >>> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote: >>> >>>>> The client is using async writes, that include commits. Sync writes do not >>>>> need commits. >>>> >>>> Are you saying nfs commit operations sent by the client aren''t always >>>> reported by that script? >>> >>> They are not reported in your case because the commit rate is less than one per second. >>> >>> DTrace is an amazing tool, but it does dictate certain coding compromises, particularly when it comes to output scaling, grouping, sorting and formatting. >>> >>> In this script the commit rate is calculated using integer division. In your case the sample interval is 5 seconds, so up to 4 commits per second will be reported as a big fat zero. >>> >>> If you use a sample interval of 1 second you should see occasional commits. We know they are there because we see a non-zero commit time. >>> >>> > > -- > > ZFS and performance consulting > http://www.RichardElling.com > > > > > > > > > > > > > > > >
Timothy Coalson
2012-Jun-15 22:04 UTC
[zfs-discuss] NFS asynchronous writes being written to ZIL
On Fri, Jun 15, 2012 at 12:56 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:> Thanks for the suggestions. ?I think it would also depend on whether > the nfs server has tried to write asynchronously to the pool in the > meantime, which I am unsure how to test, other than making the txgs > extremely frequent and watching the load on the log devices.I didn''t want to reboot the main file server to test this, so I used zilstat on the backup nfs server (which has nearly identical hardware and configuration, but doesn''t have SSDs for a separate ZIL) to see if I could estimate the difference it would make, and the story got stranger: it wrote far less data to the ZIL for the same copy operation (single 8GB file): $ sudo ./zilstat -M -l 20 -p backuppool txg waiting for txg commit... txg N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB 2833307 1 0 1 1 0 1 15 0 0 15 2833308 0 0 0 0 0 0 0 0 0 0 2833309 1 0 1 1 0 1 8 0 0 8 2833310 0 0 0 0 0 0 4 0 0 4 2833311 1 0 0 1 0 0 9 0 0 9 2833312 0 0 0 0 0 0 0 0 0 0 2833313 2 0 2 2 0 2 21 0 0 21 2833314 7 1 7 8 1 8 63 0 0 63 2833315 1 0 1 2 0 2 18 0 0 18 2833316 0 0 0 0 0 0 5 0 0 5 A small sample from the server with SSD log devices doing the same operation: $ sudo ./zilstat -M -l 20 -p mainpool txg waiting for txg commit... txg N-MB N-MB/s N-Max-Rate B-MB B-MB/s B-Max-Rate ops <=4kB 4-32kB >=32kB 2808483 989 197 593 1967 393 1180 15010 0 0 15010 2808484 599 99 208 1134 189 393 8653 0 0 8653 2808485 0 0 0 0 0 0 0 0 0 0 2808486 137 27 126 255 51 235 1953 0 0 1953 2808487 460 92 460 859 171 859 6555 0 0 6555 2808488 530 75 530 1031 147 1031 7871 0 0 7871 Setting logbias=throughput makes the server with the SSD log devices act the same as the server without them, as far as I can tell, which I somewhat expected. However, I did not expect use of separate log devices to change how often ZIL ops are performed, other than to raise the upper limit if the device can service more IOPS. Additionally, nfssvrtop showed a lower value for Com_t when not using the separate log device (2.1s with logbias=latency, 0.24s with throughput). Copying a folder with small files and subdirectories pushes the server to ~400 ZIL ops per txg with logbias=throughput, so it shouldn''t be the device performance making it only issue ~15 ops per txg copying a large file without using a separate log device. I am thinking of transplanting one of the SSDs temporarily for testing, but I would be interested to know the cause of this behavior. I don''t know why more asynchronous writes seem to be making it into txgs without being caught by an nfs commit when a separate log device isn''t used. Tim