While dd''ing to an nfs filesystem, half of the bandwidth is unaccounted for. What dd reports amounts to almost exactly half of what zpool iostat or iostat show; even after accounting for the overhead of the two mirrored vdevs. Would anyone care to guess where it may be going? (This is measured over 10 second intervals. For 1 second intervals, the bandwidth to the disks jumps around from <40MB/s to >240MB/s) With a local dd, everything adds up. This is with a b41 server, and a MacOS 10.4 nfs client. I have verified that the bandwidth at the network interface is approximately that reported by dd, so the issue would appear to be within the server. Any suggestions would be welcome. Chris
Chris, The data will be written twice on ZFS using NFS. This is because NFS on closing the file internally uses fsync to cause the writes to be committed. This causes the ZIL to immediately write the data to the intent log. Later the data is also written committed as part of the pools transaction group commit, at which point the intent block blocks are freed. It does seem inefficient to doubly write the data. In fact for blocks larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) we write the data block and also an intent log record with the block pointer. During txg commit we link this block into the pool tree. By experimentation we found 32K to be the (current) cutoff point. As the nfsd at most write 32K they do not benefit from this. Anyway this is an area we are actively working on. Neil. Chris Csanady wrote On 06/23/06 23:45,:> While dd''ing to an nfs filesystem, half of the bandwidth is unaccounted > for. What dd reports amounts to almost exactly half of what zpool iostat > or iostat show; even after accounting for the overhead of the two mirrored > vdevs. Would anyone care to guess where it may be going? > > (This is measured over 10 second intervals. For 1 second intervals, > the bandwidth to the disks jumps around from <40MB/s to >240MB/s) > > With a local dd, everything adds up. This is with a b41 server, and a > MacOS 10.4 nfs client. I have verified that the bandwidth at the network > interface is approximately that reported by dd, so the issue would appear > to be within the server. > > Any suggestions would be welcome. > > Chris > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Neil
On 6/24/06, Neil Perrin <Neil.Perrin at sun.com> wrote:> > The data will be written twice on ZFS using NFS. This is because NFS > on closing the file internally uses fsync to cause the writes to be > committed. This causes the ZIL to immediately write the data to the intent log. > Later the data is also written committed as part of the pools transaction group > commit, at which point the intent block blocks are freed.In this case though, the file is left open, so there should be no synchronous I/O. (tcpdump -vv confirms that all writes are marked as unstable, and there are no commits.) Perhaps the NFS server is issuing the I/O synchronously when it should not be, thus causing the double writes?> It does seem inefficient to doubly write the data. In fact for blocks > larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) > we write the data block and also an intent log record with the block pointer. > During txg commit we link this block into the pool tree. By experimentation > we found 32K to be the (current) cutoff point. As the nfsd at most write 32K > they do not benefit from this.That seems like an interesting coincidence, though perhaps it may not be the same issue after all. While the disparity does not afflict UFS, maybe it has merely gone unnoticed, due to the nature of the logging. It seems possible that it is entirely an NFS server issue.> Anyway this is an area we are actively working on.This is good to know; the synchronous I/O performance is kind of painful at the moment. Not that it would be a problem, but disk image based filesystems in MacOS X appear to do *all* I/O to their backing store synchronously; as you might imagine, the performance is spectacularly bad over NFS! Anyway, thanks Neil. I appologize if this is in fact unrelated to ZFS. Chris
Robert Milkowski
2006-Jun-25 10:12 UTC
[zfs-discuss] Bandwidth disparity between NFS and ZFS
Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP> Chris, NP> The data will be written twice on ZFS using NFS. This is because NFS NP> on closing the file internally uses fsync to cause the writes to be NP> committed. This causes the ZIL to immediately write the data to the intent log. NP> Later the data is also written committed as part of the pools transaction group NP> commit, at which point the intent block blocks are freed. NP> It does seem inefficient to doubly write the data. In fact for blocks NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP> we write the data block and also an intent log record with the block pointer. NP> During txg commit we link this block into the pool tree. By experimentation NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP> they do not benefit from this. Is 32KB easily tuned (mdb?)? I guess not but perhaps. And why only for blocks larger than zfs_immediate_write_sz? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote On 06/25/06 04:12,:> Hello Neil, > > Saturday, June 24, 2006, 3:46:34 PM, you wrote: > > NP> Chris, > > NP> The data will be written twice on ZFS using NFS. This is because NFS > NP> on closing the file internally uses fsync to cause the writes to be > NP> committed. This causes the ZIL to immediately write the data to the intent log. > NP> Later the data is also written committed as part of the pools transaction group > NP> commit, at which point the intent block blocks are freed. > > NP> It does seem inefficient to doubly write the data. In fact for blocks > NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) > NP> we write the data block and also an intent log record with the block pointer. > NP> During txg commit we link this block into the pool tree. By experimentation > NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K > NP> they do not benefit from this. > > Is 32KB easily tuned (mdb?)?I''m not sure. NFS folk?> I guess not but perhaps. > > And why only for blocks larger than zfs_immediate_write_sz?When data is large enough (currently >32K) it''s more efficient to directly write the block, and additionally save the block pointer in a ZIL record. Otherwise it''s more efficient to copy the data into a large log block potentially along with other writes. -- Neil
On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > Robert Milkowski wrote On 06/25/06 04:12,: > > Hello Neil, > > > > Saturday, June 24, 2006, 3:46:34 PM, you wrote: > > > > NP> Chris, > > > > NP> The data will be written twice on ZFS using NFS. This is because NFS > > NP> on closing the file internally uses fsync to cause the writes to be > > NP> committed. This causes the ZIL to immediately write the data to the intent log. > > NP> Later the data is also written committed as part of the pools transaction group > > NP> commit, at which point the intent block blocks are freed. > > > > NP> It does seem inefficient to doubly write the data. In fact for blocks > > NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) > > NP> we write the data block and also an intent log record with the block pointer. > > NP> During txg commit we link this block into the pool tree. By experimentation > > NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K > > NP> they do not benefit from this. > > > > Is 32KB easily tuned (mdb?)? > > I''m not sure. NFS folk?I think he is referring to the zfs_immediate_write_sz variable, but NFS will support larger block sizes as well. Unfortunately, since the maximum IP datagram size is 64k, after headers are taken into account, the largest useful value is 60k. If this is to be laid out as an indirect write, will it be written as 32k+16k+8k+4k blocks? If so, this seems like it would be quite inefficient for RAID-Z, and writes would best be left at 32k. Chris
Robert Milkowski
2006-Jun-27 09:00 UTC
[zfs-discuss] Bandwidth disparity between NFS and ZFS
Hello Chris, Tuesday, June 27, 2006, 1:07:31 AM, you wrote: CC> On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote:>> >> >> Robert Milkowski wrote On 06/25/06 04:12,: >> > Hello Neil, >> > >> > Saturday, June 24, 2006, 3:46:34 PM, you wrote: >> > >> > NP> Chris, >> > >> > NP> The data will be written twice on ZFS using NFS. This is because NFS >> > NP> on closing the file internally uses fsync to cause the writes to be >> > NP> committed. This causes the ZIL to immediately write the data to the intent log. >> > NP> Later the data is also written committed as part of the pools transaction group >> > NP> commit, at which point the intent block blocks are freed. >> > >> > NP> It does seem inefficient to doubly write the data. In fact for blocks >> > NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) >> > NP> we write the data block and also an intent log record with the block pointer. >> > NP> During txg commit we link this block into the pool tree. By experimentation >> > NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K >> > NP> they do not benefit from this. >> > >> > Is 32KB easily tuned (mdb?)? >> >> I''m not sure. NFS folk?CC> I think he is referring to the zfs_immediate_write_sz variable, but Exactly, I was asking about this not NFS. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Chris Csanady writes: > On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote: > > > > > > Robert Milkowski wrote On 06/25/06 04:12,: > > > Hello Neil, > > > > > > Saturday, June 24, 2006, 3:46:34 PM, you wrote: > > > > > > NP> Chris, > > > > > > NP> The data will be written twice on ZFS using NFS. This is because NFS > > > NP> on closing the file internally uses fsync to cause the writes to be > > > NP> committed. This causes the ZIL to immediately write the data to the intent log. > > > NP> Later the data is also written committed as part of the pools transaction group > > > NP> commit, at which point the intent block blocks are freed. > > > > > > NP> It does seem inefficient to doubly write the data. In fact for blocks > > > NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) > > > NP> we write the data block and also an intent log record with the block pointer. > > > NP> During txg commit we link this block into the pool tree. By experimentation > > > NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K > > > NP> they do not benefit from this. > > > > > > Is 32KB easily tuned (mdb?)? > > > > I''m not sure. NFS folk? > > I think he is referring to the zfs_immediate_write_sz variable, but > NFS will support > larger block sizes as well. Unfortunately, since the maximum IP > datagram size is > 64k, after headers are taken into account, the largest useful value is > 60k. If this is > to be laid out as an indirect write, will it be written as > 32k+16k+8k+4k blocks? If so, > this seems like it would be quite inefficient for RAID-Z, and writes > would best be > left at 32k. > > Chris > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I think the 64K issue refers to UDP. That limits the max block size the NFS may use. But with TCP mounts, NFS is not bounded by this. It should be possible to adjust the nfs blocksize up. For this I think you need to adjust nfs4_bsize on client : echo "nfs4_bsize/W131072" | mdb -kw And it could also help to tune up the transfer size echo "nfs4_max_transfer_size/W131072" | mdb -kw I also wonder if general purpose NFS exports should not have their recordsize set to 32K in order to match the default NFS bsize. But I have not really looked at this perf yet. -r
Robert Milkowski wrote On 06/27/06 03:00,:> Hello Chris, > > Tuesday, June 27, 2006, 1:07:31 AM, you wrote: > > CC> On 6/26/06, Neil Perrin <Neil.Perrin at sun.com> wrote: > >>> >>>Robert Milkowski wrote On 06/25/06 04:12,: >>> >>>>Hello Neil, >>>> >>>>Saturday, June 24, 2006, 3:46:34 PM, you wrote: >>>> >>>>NP> Chris, >>>> >>>>NP> The data will be written twice on ZFS using NFS. This is because NFS >>>>NP> on closing the file internally uses fsync to cause the writes to be >>>>NP> committed. This causes the ZIL to immediately write the data to the intent log. >>>>NP> Later the data is also written committed as part of the pools transaction group >>>>NP> commit, at which point the intent block blocks are freed. >>>> >>>>NP> It does seem inefficient to doubly write the data. In fact for blocks >>>>NP> larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) >>>>NP> we write the data block and also an intent log record with the block pointer. >>>>NP> During txg commit we link this block into the pool tree. By experimentation >>>>NP> we found 32K to be the (current) cutoff point. As the nfsd at most write 32K >>>>NP> they do not benefit from this. >>>> >>>>Is 32KB easily tuned (mdb?)? >>> >>>I''m not sure. NFS folk? > > > CC> I think he is referring to the zfs_immediate_write_sz variable, but > > Exactly, I was asking about this not NFS.Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for internal use and not really intended for public tuning. However, yes it could be tuned dynamically anytime using mdb, or set in /etc/system -- Neil