thr3ads.net - zfs discuss - [zfs-discuss] NFS asynchronous writes being written to ZIL [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Timothy Coalson

2012-Jun-13 22:56 UTC

[zfs-discuss] NFS asynchronous writes being written to ZIL

I noticed recently that the SSDs hosting the ZIL for my pool had a large
number in the SMART attribute for total LBAs written (with some
calculation, it seems to be the total amount of data written to the pool so
far), did some testing, and found that the ZIL is being used quite heavily
(matching the writing speed) on writes that should be asynchronous.  I did
a capture with wireshark during a simple copy of a large file (8GB), and
both the write packets and the write responses showed the "UNSTABLE" I
would expect in asynchronous writes.  The copy didn''t finish before the
server started writing heavily to the ZIL, so I wouldn''t expect it to
tell
NFS to commit then.  Here are some relevant pieces of info (with hostnames,
etc removed):

client: ubuntu 11.10
/etc/fstab entry: <server>:/mainpool/storage       /mnt/myelin     nfs
bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async       0
    0

server: OpenIndiana oi_151a4
$ zfs get sync mainpool
NAME      PROPERTY  VALUE     SOURCE
mainpool  sync      standard  default
$ zfs get sync mainpool/storage
NAME              PROPERTY  VALUE     SOURCE
mainpool/storage  sync      standard  default
$ zfs get sharenfs mainpool/storage
NAME              PROPERTY  VALUE           SOURCE
mainpool/storage  sharenfs  rw=@xxx.xxx.37  local
$ zpool get version mainpool
NAME      PROPERTY  VALUE    SOURCE
mainpool  version   28       default

The pool consists of 24 sata disks arranged as 2 raid-z2 groups of 12, and
originally had a mirrored log across 10GB slices on two intel 320 80GB
SSDs, but I have since rearranged the logs as non-mirrored (since a single
SSD doesn''t quite keep up with gigabit network throughput, and some
testing
convinced me that a pool survives a failing log device gracefully, as long
as there isn''t a simultaneous crash).  I would like to switch them back
to
a mirrored configuration, but without impacting the asynchronous
throughput, and obviously it would be nice to reduce the write load to them
so they live longer.  As far as I can tell, all nfs writes are being
written to the ZIL when many should be cached in memory and bypass the ZIL.

Any help appreciated,
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120613/0632e247/attachment.html>

Daniel Carosone

2012-Jun-13 23:51 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson
wrote:> client: ubuntu 11.10
> /etc/fstab entry: <server>:/mainpool/storage       /mnt/myelin    
nfs
> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async       0
>     0
nfsvers=3
> NAME              PROPERTY  VALUE     SOURCE
> mainpool/storage  sync      standard  default
sync=standard

This is expected behaviour for this combination. NFS 3 semantics are
for persistent writes at the server regardless - and mostly also 
for NFS 4.

The async client mount option relates to when the writes get shipped
to the server (immediately or delayed in dirty pages), rather than to
how the server should handle those writes once they arrive.

You could set sync=disabled if you''re happy with the consequences, or
even just as a temporary test to confirm the impact.  It sounds like
you would be since that''s what you''re trying to achieve.

There is a difference: async on the client means data is lost on a
client reboot, async on the server means data may be lost on a server
reboot (and the client/application confused by inconsistencies as a
result). 

Separate datasets (and mounts) for data with different persistence
requirements can help.

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120614/b3d330d1/attachment.bin>

Timothy Coalson

2012-Jun-14 04:14 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

Interesting...from what I had read about NFSv3 asynchronous writes,
especially bits about "does not require the server to commit to stable
storage", led me to expect different behavior.  The performance impact
on large writes (which we do a lot of) wasn''t severe, so sync=disabled
is probably not worth the risk.  The SSDs should also be able to take
a fair amount of abuse, so I can live with the behavior as-is.

Here''s one example of documentation that led me to expect something
else: http://nfs.sourceforge.net/ , search for "unstable".  It did
indeed seem like the nfs client was delaying writes, then doing
synchronous nfs calls, which is why I looked into the packets with
wireshark (and found them to be advertised as "UNSTABLE", but with the
pool acting synchronous).

Thanks,
Tim

On Wed, Jun 13, 2012 at 6:51 PM, Daniel Carosone <dan at geek.com.au>
wrote:>
> On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
> > client: ubuntu 11.10
> > /etc/fstab entry: <server>:/mainpool/storage ? ? ? /mnt/myelin ?
? nfs
> > bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async
> > 0
> > ? ? 0
>
> nfsvers=3
>
> > NAME ? ? ? ? ? ? ?PROPERTY ?VALUE ? ? SOURCE
> > mainpool/storage ?sync ? ? ?standard ?default
>
> sync=standard
>
> This is expected behaviour for this combination. NFS 3 semantics are
> for persistent writes at the server regardless - and mostly also
> for NFS 4.
>
> The async client mount option relates to when the writes get shipped
> to the server (immediately or delayed in dirty pages), rather than to
> how the server should handle those writes once they arrive.
>
> You could set sync=disabled if you''re happy with the consequences,
or
> even just as a temporary test to confirm the impact. ?It sounds like
> you would be since that''s what you''re trying to achieve.
>
> There is a difference: async on the client means data is lost on a
> client reboot, async on the server means data may be lost on a server
> reboot (and the client/application confused by inconsistencies as a
> result).
>
> Separate datasets (and mounts) for data with different persistence
> requirements can help.
>
> --
> Dan.
>
>

Richard Elling

2012-Jun-14 14:24 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:
> On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
>> client: ubuntu 11.10
>> /etc/fstab entry: <server>:/mainpool/storage       /mnt/myelin   
nfs
>> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async      
0
>>    0
> 
> nfsvers=3
> 
>> NAME              PROPERTY  VALUE     SOURCE
>> mainpool/storage  sync      standard  default
> 
> sync=standard
> 
> This is expected behaviour for this combination. NFS 3 semantics are
> for persistent writes at the server regardless - and mostly also 
> for NFS 4.
NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4
async and sync activity, try nfssvrtop
	https://github.com/richardelling/tools

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com

Timothy Coalson

2012-Jun-14 19:20 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

Thanks for the script.  Here is some sample output from ''sudo
./nfssvrtop -b 512 5'' (my disks are 512B-sector emulated and the pool
is ashift=9, some benchmarking didn''t show much difference with
ashift=12 other than giving up 8% of available space) during a copy
operation from 37.30 with sync=standard:

2012 Jun 14 13:59:13, load: 0.68, read: 0        KB, swrite: 0
KB, awrite: 557056   KB
Ver     Client           NFSOPS   Reads SWrites AWrites Commits
Rd_bw  SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t   Com_t  Align%
3       xxx.xxx.37.30       108       0       0     108       0
0       0  111206       0       0     396 1917419     100
a bit later...
3       xxx.xxx.37.30       109       0       0     108       0
0       0  111411       0       0     427       0     100

sample output from the end of ''zpool iostat -v 5 mainpool''
concurrently:
logs                           -      -      -      -      -      -
  c31t3d0s0                 260M  9.68G      0  1.21K      0  85.3M
  c31t4d0s0                 260M  9.68G      0  1.21K      0  85.1M

In case the alignment fails, the nonzero entries are under NFSOPS,
AWrites, AWr_bw, AWr_t, Com_t and Align%.  The Com_t (average commit
time?) column alternates between zero and a million or two (the other
columns stay about the same, the zeros stay zero), while the "Commits"
column stays zero during the copy.  The write throughput to the logs
varies quite a bit, that sample is a very high mark, it mainly
alternates between almost zero and 30M each, which is kind of odd
considering the copy speed (using gigabit network, copy speed averages
around 110MB/s).

When I ''zfs set sync=disabled'', the output of nfssrvtop stays
about
the same, except the Com_t stays 0, and the log devices also stay 0
for throughput.  Could you enlighten me as to what "Com_t" measures
when "Commits" stays zero?  Perhaps the nfs server caches asynchronous
nfs writes how I expect, but flushes its cache with synchronous
writes?

Thanks,
Tim

On Thu, Jun 14, 2012 at 9:24 AM, Richard Elling
<richard.elling at gmail.com> wrote:> On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:
>
>> On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
>>> client: ubuntu 11.10
>>> /etc/fstab entry: <server>:/mainpool/storage ? ? ?
/mnt/myelin ? ? nfs
>>> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async ?
? ? 0
>>> ? ?0
>>
>> nfsvers=3
>>
>>> NAME ? ? ? ? ? ? ?PROPERTY ?VALUE ? ? SOURCE
>>> mainpool/storage ?sync ? ? ?standard ?default
>>
>> sync=standard
>>
>> This is expected behaviour for this combination. NFS 3 semantics are
>> for persistent writes at the server regardless - and mostly also
>> for NFS 4.
>
> NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4
> async and sync activity, try nfssvrtop
> ? ? ? ?https://github.com/richardelling/tools
>
> ?-- richard
>
> --
>
> ZFS and performance consulting
> http://www.RichardElling.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Richard Elling

2012-Jun-14 20:18 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

Hi Tim,

On Jun 14, 2012, at 12:20 PM, Timothy Coalson wrote:
> Thanks for the script.  Here is some sample output from ''sudo
> ./nfssvrtop -b 512 5'' (my disks are 512B-sector emulated and the
pool
> is ashift=9, some benchmarking didn''t show much difference with
> ashift=12 other than giving up 8% of available space) during a copy
> operation from 37.30 with sync=standard:
> 
> 2012 Jun 14 13:59:13, load: 0.68, read: 0        KB, swrite: 0
> KB, awrite: 557056   KB
> Ver     Client           NFSOPS   Reads SWrites AWrites Commits Rd_bw 
SWr_bw  AWr_bw    Rd_t   SWr_t   AWr_t   Com_t  Align%
> 3       xxx.xxx.37.30       108       0       0     108       0    0      
0  111206       0       0     396 1917419     100
> a bit later...
> 3       xxx.xxx.37.30       109       0       0     108       0    0      
0  111411       0       0     427       0     100
> 
> sample output from the end of ''zpool iostat -v 5
mainpool'' concurrently:
> logs                           -      -      -      -      -      -
>  c31t3d0s0                 260M  9.68G      0  1.21K      0  85.3M
>  c31t4d0s0                 260M  9.68G      0  1.21K      0  85.1M
> 
> In case the alignment fails, the nonzero entries are under NFSOPS,
> AWrites, AWr_bw, AWr_t, Com_t and Align%.  The Com_t (average commit
> time?) column alternates between zero and a million or two (the other
> columns stay about the same, the zeros stay zero), while the
"Commits"
> column stays zero during the copy.  The write throughput to the logs
> varies quite a bit, that sample is a very high mark, it mainly
> alternates between almost zero and 30M each, which is kind of odd
> considering the copy speed (using gigabit network, copy speed averages
> around 110MB/s).
The client is using async writes, that include commits. Sync writes do not
need commits.

What happens is that the ZFS transaction group commit occurs at more-or-less
regular intervals, likely 5 seconds for more modern ZFS systems. When the 
commit occurs, any data that is in the ARC but not commited in a prior
transaction
group gets sent to the ZIL. This is why you might see a very different amount of
ZIL activity relative to the expected write workload.
> When I ''zfs set sync=disabled'', the output of nfssrvtop
stays about
> the same, except the Com_t stays 0, and the log devices also stay 0
> for throughput.  Could you enlighten me as to what "Com_t"
measures
> when "Commits" stays zero?  Perhaps the nfs server caches
asynchronous
> nfs writes how I expect, but flushes its cache with synchronous
> writes?
> 
With sync=disabled, the ZIL is not used, thus the commit response to the client
is a lie, breaking the covenant between the server and client. In other words, 
the server is supposed to respond to the commit only when the data is written
to permanent media, but the administrator overruled this action by disabling
the ZIL. If the server was to unexpectedly restart or other conditions occur
such that the write cannot be completed, then the server and client will have
different views of the data, a form of data loss.

Different applications can react to long commit times differently. In this
example,
we see 1.9 seconds for the commit versus about 400 microseconds for each 
async write. The cause of the latency of the commit is not apparent from any
bandwidth measurements (eg zpool iostat) and you should consider looking 
more closely at the "iostat -x" latency to see if the log devices are
performing
well. 
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com

Robert Milkowski

2012-Jun-14 20:35 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

> The client is using async writes, that include commits. Sync writes do
> not need commits.
> 
> What happens is that the ZFS transaction group commit occurs at more-
> or-less regular intervals, likely 5 seconds for more modern ZFS
> systems. When the commit occurs, any data that is in the ARC but not
> commited in a prior transaction group gets sent to the ZIL
Are you sure? I don''t think this is the case unless I misunderstood you
or
this is some recent change to Illumos.

Whatever is being committed when zfs txg closes goes directly to pool and
not to zil. Only sync writes will go to zil right a way (and not always, see
logbias, etc.) and to arc to be committed later to a pool when txg closes.


-- 
Robert Milkowski
http://milek.blogspot.com

Timothy Coalson

2012-Jun-14 22:15 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

> The client is using async writes, that include commits. Sync writes do not
> need commits.
Are you saying nfs commit operations sent by the client aren''t always
reported by that script?
> What happens is that the ZFS transaction group commit occurs at
more-or-less
> regular intervals, likely 5 seconds for more modern ZFS systems. When the
> commit occurs, any data that is in the ARC but not commited in a prior
transaction
> group gets sent to the ZIL. This is why you might see a very different
amount of
> ZIL activity relative to the expected write workload.
I thought this was the other way around, synchronous writes
immediately get copied to ZIL blocks (logbias may cause them to be
allocated from the data disks despite the existence of a separate log
device) in order to not require that a txg go through immediately to
service them, then a txg that includes the same data eventually goes
directly to the data disks, and then they get removed from the ZIL.
> With sync=disabled, the ZIL is not used, thus the commit response to the
client
> is a lie, breaking the covenant between the server and client.
I was not clear, the behavior with sync=disabled was exactly what I
expected it to be, my question was why the Com_t attribute was nonzero
with sync=standard, while Commits stayed zero.  To be clear, while it
was in the middle of copying data, the "Commits" column always read
zero, never even one commit reported (in fact, Commits was always
zero, even when a copy completed).  I would expect this to mean the
nfs client was issuing asynchronous writes without a single commit
request (eventually filling the server''s cache and requiring the
server to do something to resolve it).  I am copying a single large
file, 8GB, and usually interrupting it before it finishes, after I
have some data that matches the period where it is writing to the log
devices (takes a few seconds to get through the client and server
caches and start using the disks).
> Different applications can react to long commit times differently. In this
example,
> we see 1.9 seconds for the commit versus about 400 microseconds for each
> async write. The cause of the latency of the commit is not apparent from
any
> bandwidth measurements (eg zpool iostat) and you should consider looking
> more closely at the "iostat -x" latency to see if the log devices
are performing
> well.
I am still unclear on the commit time being reported for zero reported
commit requests.  Here is what I think are the log devices (I am
unsure how to translate the sd## mapping to solaris style
controller/target devices) under iostat -x 5 when they are being
written to heavily:

device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd30      0.0 1073.1    0.0 72940.6  0.0  7.6    7.1   1  77
sd31      0.0 1075.3    0.0 73090.2  0.0  7.6    7.1   1  77

Considering their maximum write speed is reported as 90MB/s, and they
each wrote 70MB/s in that interval, I do not expect the svc_t to
reflect the seek time they can achieve.  There were no particularly
suspicious spikes in svc_t on any device while the Com_t was large.  I
have a guess that your script is reporting the time taken to flush a
cache that got filled with async writes, and needed to be emptied to
be able to continue to service async writes (a lot of data, therefore
a lot of time).

I''m wondering if this belongs in nfs-discuss, if the nfs server issues
synchronous writes to empty its cache (without receiving an nfs commit
request), it seems to explain everything.

Tim

Phil Harman

2012-Jun-15 03:30 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote:
>> The client is using async writes, that include commits. Sync writes do
not
>> need commits.
> 
> Are you saying nfs commit operations sent by the client aren''t
always
> reported by that script?
They are not reported in your case because the commit rate is less than one per
second.

DTrace is an amazing tool, but it does dictate certain coding compromises,
particularly when it comes to output scaling, grouping, sorting and formatting.

In this script the commit rate is calculated using integer division. In your
case the sample interval is 5 seconds, so up to 4 commits per second will be
reported as a big fat zero.

If you use a sample interval of 1 second you should see occasional commits. We
know they are there because we see a non-zero commit time.

Timothy Coalson

2012-Jun-15 04:20 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

Indeed they are there, shown with 1 second interval.  So, it is the
client''s fault after all.  I''ll have to see whether it is
somehow
possible to get the server to write cached data sooner (and hopefully
asynchronous), and the client to issue commits less often.  Luckily I
can live with the current behavior (and the SSDs shouldn''t give out
any time soon even being used like this), if it isn''t possible to
change it.

Thanks for all the help,
Tim

On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at gmail.com>
wrote:> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu> wrote:
>
>>> The client is using async writes, that include commits. Sync writes
do not
>>> need commits.
>>
>> Are you saying nfs commit operations sent by the client aren''t
always
>> reported by that script?
>
> They are not reported in your case because the commit rate is less than one
per second.
>
> DTrace is an amazing tool, but it does dictate certain coding compromises,
particularly when it comes to output scaling, grouping, sorting and formatting.
>
> In this script the commit rate is calculated using integer division. In
your case the sample interval is 5 seconds, so up to 4 commits per second will
be reported as a big fat zero.
>
> If you use a sample interval of 1 second you should see occasional commits.
We know they are there because we see a non-zero commit time.
>
>

Richard Elling

2012-Jun-15 15:56 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

[Phil beat me to it]
Yes, the 0s are a result of integer division in DTrace/kernel.

On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:
> Indeed they are there, shown with 1 second interval.  So, it is the
> client''s fault after all.  I''ll have to see whether it is
somehow
> possible to get the server to write cached data sooner (and hopefully
> asynchronous), and the client to issue commits less often.  Luckily I
> can live with the current behavior (and the SSDs shouldn''t give
out
> any time soon even being used like this), if it isn''t possible to
> change it.
If this is the proposed workload, then it is possible to tune the DMU to
manage commits more efficiently. In an ideal world, it does this automatically,
but the algorithms are based on a bandwidth calculation and those are not
suitable for HDD capacity planning. The efficiency goal would be to do less
work, more often and there are two tunables that can apply:

1. the txg_timeout controls the default maximum transaction group commit
interval and is set to 5 seconds on modern ZFS implementations.

2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg
will
be committed when the size reaches this limit, rather than waiting for the
txg_timeout. For streaming writes, this can work better than tuning the 
txg_timeout.

 -- richard
> 
> Thanks for all the help,
> Tim
> 
> On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at
gmail.com> wrote:
>> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu>
wrote:
>> 
>>>> The client is using async writes, that include commits. Sync
writes do not
>>>> need commits.
>>> 
>>> Are you saying nfs commit operations sent by the client
aren''t always
>>> reported by that script?
>> 
>> They are not reported in your case because the commit rate is less than
one per second.
>> 
>> DTrace is an amazing tool, but it does dictate certain coding
compromises, particularly when it comes to output scaling, grouping, sorting and
formatting.
>> 
>> In this script the commit rate is calculated using integer division. In
your case the sample interval is 5 seconds, so up to 4 commits per second will
be reported as a big fat zero.
>> 
>> If you use a sample interval of 1 second you should see occasional
commits. We know they are there because we see a non-zero commit time.
>> 
>> 
-- 

ZFS and performance consulting
http://www.RichardElling.com

Richard Elling

2012-Jun-15 15:57 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

On Jun 14, 2012, at 1:35 PM, Robert Milkowski wrote:
>> The client is using async writes, that include commits. Sync writes do
>> not need commits.
>> 
>> What happens is that the ZFS transaction group commit occurs at more-
>> or-less regular intervals, likely 5 seconds for more modern ZFS
>> systems. When the commit occurs, any data that is in the ARC but not
>> commited in a prior transaction group gets sent to the ZIL
> 
> Are you sure? I don''t think this is the case unless I
misunderstood you or
> this is some recent change to Illumos.
Need to make sure we are clear here, there is time between the txg being
closed and the txg being on disk. During that period, a sync write of the
data in the closed txg is written to the ZIL.
> Whatever is being committed when zfs txg closes goes directly to pool and
> not to zil. Only sync writes will go to zil right a way (and not always,
see
> logbias, etc.) and to arc to be committed later to a pool when txg closes.
In this specific case, there are separate log devices, so logbias
doesn''t apply.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com

Timothy Coalson

2012-Jun-15 17:56 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

Thanks for the suggestions.  I think it would also depend on whether
the nfs server has tried to write asynchronously to the pool in the
meantime, which I am unsure how to test, other than making the txgs
extremely frequent and watching the load on the log devices.  As for
the integer division giving misleading zeros, one possible solution is
to add (delay-1) to the count before dividing by delay, so if there
are any, it will show at least 1 (or you could get fancy and do fixed
point numbers).

As for very frequent txgs, I imagine this could cause more
fragmentation (more metadata written and discarded more frequently),
is there a way to estimate or test for the impact of it?  Depending on
how it allocates the metadata blocks, I suppose it could write it to
the blocks recently vacated by old metadata due to the previous txg,
and have almost no impact until a snapshot is taken, is it smart
enough to do this?

Tim

On Fri, Jun 15, 2012 at 10:56 AM, Richard Elling
<richard.elling at gmail.com> wrote:> [Phil beat me to it]
> Yes, the 0s are a result of integer division in DTrace/kernel.
>
> On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:
>
>> Indeed they are there, shown with 1 second interval. ?So, it is the
>> client''s fault after all. ?I''ll have to see whether
it is somehow
>> possible to get the server to write cached data sooner (and hopefully
>> asynchronous), and the client to issue commits less often. ?Luckily I
>> can live with the current behavior (and the SSDs shouldn''t
give out
>> any time soon even being used like this), if it isn''t possible
to
>> change it.
>
> If this is the proposed workload, then it is possible to tune the DMU to
> manage commits more efficiently. In an ideal world, it does this
automatically,
> but the algorithms are based on a bandwidth calculation and those are not
> suitable for HDD capacity planning. The efficiency goal would be to do less
> work, more often and there are two tunables that can apply:
>
> 1. the txg_timeout controls the default maximum transaction group commit
> interval and is set to 5 seconds on modern ZFS implementations.
>
> 2. the zfs_write_limit is a size limit for txg commit. The idea is that a
txg will
> be committed when the size reaches this limit, rather than waiting for the
> txg_timeout. For streaming writes, this can work better than tuning the
> txg_timeout.
>
> ?-- richard
>
>>
>> Thanks for all the help,
>> Tim
>>
>> On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman <phil.harman at
gmail.com> wrote:
>>> On 14 Jun 2012, at 23:15, Timothy Coalson <tsc5yc at mst.edu>
wrote:
>>>
>>>>> The client is using async writes, that include commits.
Sync writes do not
>>>>> need commits.
>>>>
>>>> Are you saying nfs commit operations sent by the client
aren''t always
>>>> reported by that script?
>>>
>>> They are not reported in your case because the commit rate is less
than one per second.
>>>
>>> DTrace is an amazing tool, but it does dictate certain coding
compromises, particularly when it comes to output scaling, grouping, sorting and
formatting.
>>>
>>> In this script the commit rate is calculated using integer
division. In your case the sample interval is 5 seconds, so up to 4 commits per
second will be reported as a big fat zero.
>>>
>>> If you use a sample interval of 1 second you should see occasional
commits. We know they are there because we see a non-zero commit time.
>>>
>>>
>
> --
>
> ZFS and performance consulting
> http://www.RichardElling.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Timothy Coalson

2012-Jun-15 22:04 UTC

head link

[zfs-discuss] NFS asynchronous writes being written to ZIL

On Fri, Jun 15, 2012 at 12:56 PM, Timothy Coalson <tsc5yc at mst.edu>
wrote:> Thanks for the suggestions. ?I think it would also depend on whether
> the nfs server has tried to write asynchronously to the pool in the
> meantime, which I am unsure how to test, other than making the txgs
> extremely frequent and watching the load on the log devices.
I didn''t want to reboot the main file server to test this, so I used
zilstat on the backup nfs server (which has nearly identical hardware
and configuration, but doesn''t have SSDs for a separate ZIL) to see if
I could estimate the difference it would make, and the story got
stranger: it wrote far less data to the ZIL for the same copy
operation (single 8GB file):

$ sudo ./zilstat -M -l 20 -p backuppool txg
waiting for txg commit...
       txg       N-MB     N-MB/s N-Max-Rate       B-MB     B-MB/s
B-Max-Rate    ops  <=4kB 4-32kB >=32kB
   2833307          1          0          1          1          0
    1     15      0      0     15
   2833308          0          0          0          0          0
    0      0      0      0      0
   2833309          1          0          1          1          0
    1      8      0      0      8
   2833310          0          0          0          0          0
    0      4      0      0      4
   2833311          1          0          0          1          0
    0      9      0      0      9
   2833312          0          0          0          0          0
    0      0      0      0      0
   2833313          2          0          2          2          0
    2     21      0      0     21
   2833314          7          1          7          8          1
    8     63      0      0     63
   2833315          1          0          1          2          0
    2     18      0      0     18
   2833316          0          0          0          0          0
    0      5      0      0      5

A small sample from the server with SSD log devices doing the same operation:

$ sudo ./zilstat -M -l 20 -p mainpool txg
waiting for txg commit...
       txg       N-MB     N-MB/s N-Max-Rate       B-MB     B-MB/s
B-Max-Rate    ops  <=4kB 4-32kB >=32kB
   2808483        989        197        593       1967        393
 1180  15010      0      0  15010
   2808484        599         99        208       1134        189
  393   8653      0      0   8653
   2808485          0          0          0          0          0
    0      0      0      0      0
   2808486        137         27        126        255         51
  235   1953      0      0   1953
   2808487        460         92        460        859        171
  859   6555      0      0   6555
   2808488        530         75        530       1031        147
 1031   7871      0      0   7871

Setting logbias=throughput makes the server with the SSD log devices
act the same as the server without them, as far as I can tell, which I
somewhat expected.  However, I did not expect use of separate log
devices to change how often ZIL ops are performed, other than to raise
the upper limit if the device can service more IOPS.  Additionally,
nfssvrtop showed a lower value for Com_t when not using the separate
log device (2.1s with logbias=latency, 0.24s with throughput).
Copying a folder with small files and subdirectories pushes the server
to ~400 ZIL ops per txg with logbias=throughput, so it shouldn''t be
the device performance making it only issue ~15 ops per txg copying a
large file without using a separate log device.  I am thinking of
transplanting one of the SSDs temporarily for testing, but I would be
interested to know the cause of this behavior.  I don''t know why more
asynchronous writes seem to be making it into txgs without being
caught by an nfs commit when a separate log device isn''t used.

Tim

zfs discuss - Jun 2012 - NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL

[zfs-discuss] NFS asynchronous writes being written to ZIL