thr3ads.net - zfs discuss - [zfs-discuss] Bandwidth disparity between NFS and ZFS [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Chris Csanady

2006-Jun-24 05:45 UTC

[zfs-discuss] Bandwidth disparity between NFS and ZFS

While dd''ing to an nfs filesystem, half of the bandwidth is unaccounted
for.  What dd reports amounts to almost exactly half of what zpool iostat
or iostat show; even after accounting for the overhead of the two mirrored
vdevs.  Would anyone care to guess where it may be going?

(This is measured over 10 second intervals.  For 1 second intervals,
the bandwidth to the disks jumps around from <40MB/s to >240MB/s)

With a local dd, everything adds up.  This is with a b41 server, and a
MacOS 10.4 nfs client.  I have verified that the bandwidth at the network
interface is approximately that reported by dd, so the issue would appear
to be within the server.

Any suggestions would be welcome.

Chris

Neil Perrin

2006-Jun-24 13:46 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Chris,

The data will be written twice on ZFS using NFS. This is because NFS
on closing the file internally uses fsync to cause the writes to be
committed. This causes the ZIL to immediately write the data to the intent log.
Later the data is also written committed as part of the pools transaction group
commit, at which point the intent block blocks are freed.

It does seem inefficient to doubly write the data. In fact for blocks
larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed)
we write the data block and also an intent log record with the block pointer.
During txg commit we link this block into the pool tree. By experimentation
we found 32K to be the (current) cutoff point. As the nfsd at most write 32K
they do not benefit from this.

Anyway this is an area we are actively working on.

Neil.

Chris Csanady wrote On 06/23/06 23:45,:> While dd''ing to an nfs filesystem, half of the bandwidth is
unaccounted
> for.  What dd reports amounts to almost exactly half of what zpool iostat
> or iostat show; even after accounting for the overhead of the two mirrored
> vdevs.  Would anyone care to guess where it may be going?
> 
> (This is measured over 10 second intervals.  For 1 second intervals,
> the bandwidth to the disks jumps around from <40MB/s to >240MB/s)
> 
> With a local dd, everything adds up.  This is with a b41 server, and a
> MacOS 10.4 nfs client.  I have verified that the bandwidth at the network
> interface is approximately that reported by dd, so the issue would appear
> to be within the server.
> 
> Any suggestions would be welcome.
> 
> Chris
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

Neil

Chris Csanady

2006-Jun-25 03:11 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

On 6/24/06, Neil Perrin <Neil.Perrin at sun.com>
wrote:>
> The data will be written twice on ZFS using NFS. This is because NFS
> on closing the file internally uses fsync to cause the writes to be
> committed. This causes the ZIL to immediately write the data to the intent
log.
> Later the data is also written committed as part of the pools transaction
group
> commit, at which point the intent block blocks are freed.
In this case though, the file is left open, so there should be no synchronous
I/O.  (tcpdump -vv confirms that all writes are marked as unstable, and there
are no commits.)   Perhaps the NFS server is issuing the I/O synchronously
when it should not be, thus causing the double writes?
> It does seem inefficient to doubly write the data. In fact for blocks
> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499
fixed)
> we write the data block and also an intent log record with the block
pointer.
> During txg commit we link this block into the pool tree. By experimentation
> we found 32K to be the (current) cutoff point. As the nfsd at most write
32K
> they do not benefit from this.
That seems like an interesting coincidence, though perhaps it may not be the
same issue after all.  While the disparity does not afflict UFS, maybe it has
merely gone unnoticed, due to the nature of the logging.  It seems possible that
it is entirely an NFS server issue.
> Anyway this is an area we are actively working on.
This is good to know; the synchronous I/O performance is kind of painful at the
moment.  Not that it would be a problem, but disk image based filesystems in
MacOS X appear to do *all* I/O to their backing store synchronously; as you
might imagine, the performance is spectacularly bad over NFS!

Anyway, thanks Neil.  I appologize if this is in fact unrelated to ZFS.

Chris

Robert Milkowski

2006-Jun-25 10:12 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Hello Neil,

Saturday, June 24, 2006, 3:46:34 PM, you wrote:

NP> Chris,

NP> The data will be written twice on ZFS using NFS. This is because NFS
NP> on closing the file internally uses fsync to cause the writes to be
NP> committed. This causes the ZIL to immediately write the data to the
intent log.
NP> Later the data is also written committed as part of the pools transaction
group
NP> commit, at which point the intent block blocks are freed.

NP> It does seem inefficient to doubly write the data. In fact for blocks
NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499
fixed)
NP> we write the data block and also an intent log record with the block
pointer.
NP> During txg commit we link this block into the pool tree. By
experimentation
NP> we found 32K to be the (current) cutoff point. As the nfsd at most write
32K
NP> they do not benefit from this.

Is 32KB easily tuned (mdb?)?
I guess not but perhaps.

And why only for blocks larger than zfs_immediate_write_sz?


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Neil Perrin

2006-Jun-26 17:08 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Robert Milkowski wrote On 06/25/06 04:12,:> Hello Neil,
> 
> Saturday, June 24, 2006, 3:46:34 PM, you wrote:
> 
> NP> Chris,
> 
> NP> The data will be written twice on ZFS using NFS. This is because NFS
> NP> on closing the file internally uses fsync to cause the writes to be
> NP> committed. This causes the ZIL to immediately write the data to the
intent log.
> NP> Later the data is also written committed as part of the pools
transaction group
> NP> commit, at which point the intent block blocks are freed.
> 
> NP> It does seem inefficient to doubly write the data. In fact for
blocks
> NP> larger than zfs_immediate_write_sz (was 64K but now 32K after
6440499 fixed)
> NP> we write the data block and also an intent log record with the block
pointer.
> NP> During txg commit we link this block into the pool tree. By
experimentation
> NP> we found 32K to be the (current) cutoff point. As the nfsd at most
write 32K
> NP> they do not benefit from this.
> 
> Is 32KB easily tuned (mdb?)?
I''m not sure. NFS folk?
> I guess not but perhaps.
> 
> And why only for blocks larger than zfs_immediate_write_sz?
When data is large enough (currently >32K) it''s more efficient to
directly
write the block, and additionally save the block pointer in a ZIL record.
Otherwise it''s more efficient to copy the data into a large log block
potentially along with other writes.

-- 

Neil

Chris Csanady

2006-Jun-26 23:07 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

On 6/26/06, Neil Perrin <Neil.Perrin at sun.com>
wrote:>
>
> Robert Milkowski wrote On 06/25/06 04:12,:
> > Hello Neil,
> >
> > Saturday, June 24, 2006, 3:46:34 PM, you wrote:
> >
> > NP> Chris,
> >
> > NP> The data will be written twice on ZFS using NFS. This is
because NFS
> > NP> on closing the file internally uses fsync to cause the writes
to be
> > NP> committed. This causes the ZIL to immediately write the data to
the intent log.
> > NP> Later the data is also written committed as part of the pools
transaction group
> > NP> commit, at which point the intent block blocks are freed.
> >
> > NP> It does seem inefficient to doubly write the data. In fact for
blocks
> > NP> larger than zfs_immediate_write_sz (was 64K but now 32K after
6440499 fixed)
> > NP> we write the data block and also an intent log record with the
block pointer.
> > NP> During txg commit we link this block into the pool tree. By
experimentation
> > NP> we found 32K to be the (current) cutoff point. As the nfsd at
most write 32K
> > NP> they do not benefit from this.
> >
> > Is 32KB easily tuned (mdb?)?
>
> I''m not sure. NFS folk?
I think he is referring to the zfs_immediate_write_sz variable, but
NFS will support
larger block sizes as well.  Unfortunately, since the maximum IP
datagram size is
64k, after headers are taken into account, the largest useful value is
60k.  If this is
to be laid out as an indirect write, will it be written as
32k+16k+8k+4k blocks?  If so,
this seems like it would be quite inefficient for RAID-Z, and writes
would best be
left at 32k.

Chris

Robert Milkowski

2006-Jun-27 09:00 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Hello Chris,

Tuesday, June 27, 2006, 1:07:31 AM, you wrote:

CC> On 6/26/06, Neil Perrin <Neil.Perrin at sun.com>
wrote:>>
>>
>> Robert Milkowski wrote On 06/25/06 04:12,:
>> > Hello Neil,
>> >
>> > Saturday, June 24, 2006, 3:46:34 PM, you wrote:
>> >
>> > NP> Chris,
>> >
>> > NP> The data will be written twice on ZFS using NFS. This is
because NFS
>> > NP> on closing the file internally uses fsync to cause the
writes to be
>> > NP> committed. This causes the ZIL to immediately write the
data to the intent log.
>> > NP> Later the data is also written committed as part of the
pools transaction group
>> > NP> commit, at which point the intent block blocks are freed.
>> >
>> > NP> It does seem inefficient to doubly write the data. In fact
for blocks
>> > NP> larger than zfs_immediate_write_sz (was 64K but now 32K
after 6440499 fixed)
>> > NP> we write the data block and also an intent log record with
the block pointer.
>> > NP> During txg commit we link this block into the pool tree. By
experimentation
>> > NP> we found 32K to be the (current) cutoff point. As the nfsd
at most write 32K
>> > NP> they do not benefit from this.
>> >
>> > Is 32KB easily tuned (mdb?)?
>>
>> I''m not sure. NFS folk?
CC> I think he is referring to the zfs_immediate_write_sz variable, but

Exactly, I was asking about this not NFS.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch

2006-Jun-27 09:00 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Chris Csanady writes:
 > On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote:
 > >
 > >
 > > Robert Milkowski wrote On 06/25/06 04:12,:
 > > > Hello Neil,
 > > >
 > > > Saturday, June 24, 2006, 3:46:34 PM, you wrote:
 > > >
 > > > NP> Chris,
 > > >
 > > > NP> The data will be written twice on ZFS using NFS. This is
because NFS
 > > > NP> on closing the file internally uses fsync to cause the
writes to be
 > > > NP> committed. This causes the ZIL to immediately write the
data to the intent log.
 > > > NP> Later the data is also written committed as part of the
pools transaction group
 > > > NP> commit, at which point the intent block blocks are freed.
 > > >
 > > > NP> It does seem inefficient to doubly write the data. In
fact for blocks
 > > > NP> larger than zfs_immediate_write_sz (was 64K but now 32K
after 6440499 fixed)
 > > > NP> we write the data block and also an intent log record
with the block pointer.
 > > > NP> During txg commit we link this block into the pool tree.
By experimentation
 > > > NP> we found 32K to be the (current) cutoff point. As the
nfsd at most write 32K
 > > > NP> they do not benefit from this.
 > > >
 > > > Is 32KB easily tuned (mdb?)?
 > >
 > > I''m not sure. NFS folk?
 > 
 > I think he is referring to the zfs_immediate_write_sz variable, but
 > NFS will support
 > larger block sizes as well.  Unfortunately, since the maximum IP
 > datagram size is
 > 64k, after headers are taken into account, the largest useful value is
 > 60k.  If this is
 > to be laid out as an indirect write, will it be written as
 > 32k+16k+8k+4k blocks?  If so,
 > this seems like it would be quite inefficient for RAID-Z, and writes
 > would best be
 > left at 32k.
 > 
 > Chris
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I think the 64K issue refers to UDP. That limits the max
block size the NFS may use. But with TCP mounts, NFS is not
bounded by this. It should be possible to adjust the nfs
blocksize up.

For this I think you need to adjust nfs4_bsize on client :

	echo "nfs4_bsize/W131072" | mdb -kw

And it could also help to tune up the transfer size

	echo "nfs4_max_transfer_size/W131072" | mdb -kw

I also wonder if general purpose NFS exports should not have 
their recordsize set to 32K in order to match the default
NFS bsize. But I have not really looked at this perf yet.

-r

Neil Perrin

2006-Jun-27 22:33 UTC

head link

[zfs-discuss] Bandwidth disparity between NFS and ZFS

Robert Milkowski wrote On 06/27/06 03:00,:> Hello Chris,
> 
> Tuesday, June 27, 2006, 1:07:31 AM, you wrote:
> 
> CC> On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote:
> 
>>>
>>>Robert Milkowski wrote On 06/25/06 04:12,:
>>>
>>>>Hello Neil,
>>>>
>>>>Saturday, June 24, 2006, 3:46:34 PM, you wrote:
>>>>
>>>>NP> Chris,
>>>>
>>>>NP> The data will be written twice on ZFS using NFS. This is
because NFS
>>>>NP> on closing the file internally uses fsync to cause the
writes to be
>>>>NP> committed. This causes the ZIL to immediately write the
data to the intent log.
>>>>NP> Later the data is also written committed as part of the
pools transaction group
>>>>NP> commit, at which point the intent block blocks are freed.
>>>>
>>>>NP> It does seem inefficient to doubly write the data. In
fact for blocks
>>>>NP> larger than zfs_immediate_write_sz (was 64K but now 32K
after 6440499 fixed)
>>>>NP> we write the data block and also an intent log record
with the block pointer.
>>>>NP> During txg commit we link this block into the pool tree.
By experimentation
>>>>NP> we found 32K to be the (current) cutoff point. As the
nfsd at most write 32K
>>>>NP> they do not benefit from this.
>>>>
>>>>Is 32KB easily tuned (mdb?)?
>>>
>>>I''m not sure. NFS folk?
> 
> 
> CC> I think he is referring to the zfs_immediate_write_sz variable, but
> 
> Exactly, I was asking about this not NFS.
Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for
internal use and not really intended for public tuning. However, yes it could
be tuned dynamically anytime using mdb, or set in /etc/system

-- 

Neil

zfs discuss - Jun 2006 - Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS

[zfs-discuss] Bandwidth disparity between NFS and ZFS