Thomas Nau
2013-Jan-16 14:17 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
Dear all I''ve a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) R value ------------- Distribution ------------- count 256 | 0 512 |@ 22980 1024 | 663 2048 | 1075 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@ 433819 8192 |@@ 40876 16384 |@@ 37218 32768 |@@@@@ 82584 65536 |@@ 34784 131072 |@ 25968 262144 |@ 14884 524288 | 69 1048576 | 0 W value ------------- Distribution ------------- count 256 | 0 512 |@ 35961 1024 | 25108 2048 | 10222 4096 |@@@@@@@@@@@@@@@@@@@@@@@ 1243634 8192 |@@@@@@@@@ 521519 16384 |@@@@ 218932 32768 |@@@ 146519 65536 | 112 131072 | 15 262144 | 78 524288 | 0 For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here''s the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Thanks for any help Thomas
Bob Friesenhahn
2013-Jan-17 15:04 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Wed, 16 Jan 2013, Thomas Nau wrote:> Dear all > I''ve a question concerning possible performance tuning for both iSCSI access > and replicating a ZVOL through zfs send/receive. We export ZVOLs with the > default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. > The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL > SSDs and 128G of main memory > > The iSCSI access pattern (1 hour daytime average) looks like the following > (Thanks to Richard Elling for the dtrace script)If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. [ stuff removed ]> For disaster recovery we plan to sync the pool as often as possible > to a remote location. Running send/receive after a day or so seems to take > a significant amount of time wading through all the blocks and we hardly > see network average traffic going over 45MB/s (almost idle 1G link). > So here''s the question: would increasing/decreasing the volblocksize improve > the send/receive operation and what influence might show for the iSCSI side?Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to "overlap" blocks.. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2013-Jan-17 16:35 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-17 16:04, Bob Friesenhahn wrote:> If almost all of the I/Os are 4K, maybe your ZVOLs should use a > volblocksize of 4K? This seems like the most obvious improvement.> Matching the volume block size to what the clients are actually using > (due to their filesystem configuration) should improve performance > during normal operations and should reduce the number of blocks which > need to be sent in the backup by reducing write amplification due to > "overlap" blocks..Also, it would make sense while you are at it to verify that the clients(i.e. VMs'' filesystems) do their IOs 4KB-aligned, i.e. that their partitions start at a 512b-based sector offset divisible by 8 inside the virtual HDDs, and the FS headers also align to that so the first cluster is 4KB-aligned. Classic MSDOS MBR did not warrant that partition start, by using 63 sectors as the cylinder size and offset factor. Newer OSes don''t use the classic layout, as any config is allowable; and GPT is well aligned as well. Overall, a single IO in the VM guest changing a 4KB cluster in its FS should translate to one 4KB IO in your backend storage changing the dataset''s userdata (without reading a bigger block and modifying it with COW), plus some avalanche of metadata updates (likely with the COW) for ZFS''s own bookkeeping. //Jim
Richard Elling
2013-Jan-18 01:42 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Wed, 16 Jan 2013, Thomas Nau wrote: > >> Dear all >> I''ve a question concerning possible performance tuning for both iSCSI access >> and replicating a ZVOL through zfs send/receive. We export ZVOLs with the >> default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. >> The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL >> SSDs and 128G of main memory >> >> The iSCSI access pattern (1 hour daytime average) looks like the following >> (Thanks to Richard Elling for the dtrace script) > > If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement.4k might be a little small. 8k will have less metadata overhead. In some cases we''ve seen good performance on these workloads up through 32k. Real pain is felt at 128k :-)> > [ stuff removed ] > >> For disaster recovery we plan to sync the pool as often as possible >> to a remote location. Running send/receive after a day or so seems to take >> a significant amount of time wading through all the blocks and we hardly >> see network average traffic going over 45MB/s (almost idle 1G link). >> So here''s the question: would increasing/decreasing the volblocksize improve >> the send/receive operation and what influence might show for the iSCSI side? > > Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to "overlap" blocks..compression is a good win, too -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/2708f34f/attachment-0001.html>
Richard Elling
2013-Jan-18 01:50 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 8:35 AM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2013-01-17 16:04, Bob Friesenhahn wrote: >> If almost all of the I/Os are 4K, maybe your ZVOLs should use a >> volblocksize of 4K? This seems like the most obvious improvement. > >> Matching the volume block size to what the clients are actually using >> (due to their filesystem configuration) should improve performance >> during normal operations and should reduce the number of blocks which >> need to be sent in the backup by reducing write amplification due to >> "overlap" blocks.. > > > Also, it would make sense while you are at it to verify that the > clients(i.e. VMs'' filesystems) do their IOs 4KB-aligned, i.e. that > their partitions start at a 512b-based sector offset divisible by > 8 inside the virtual HDDs, and the FS headers also align to that > so the first cluster is 4KB-aligned.This is the classical expectation. So I added an alignment check into nfssvrtop and iscsisvrtop. I''ve looked at a *ton* of NFS workloads from ESX and, believe it or not, alignment doesn''t matter at all, at least for the data I''ve collected. I''ll let NetApp wallow in the mire of misalignment while I blissfully dream of other things :-)> Classic MSDOS MBR did not warrant that partition start, by using > 63 sectors as the cylinder size and offset factor. Newer OSes don''t > use the classic layout, as any config is allowable; and GPT is well > aligned as well. > > Overall, a single IO in the VM guest changing a 4KB cluster in its > FS should translate to one 4KB IO in your backend storage changing > the dataset''s userdata (without reading a bigger block and modifying > it with COW), plus some avalanche of metadata updates (likely with > the COW) for ZFS''s own bookkeeping.I''ve never seen a 1:1 correlation from the VM guest to the workload on the wire. To wit, I did a bunch of VDI and VDI-like (small, random writes) testing on XenServer and while the clients were chugging away doing 4K random I/Os, on the wire I was seeing 1MB NFS writes. In part this analysis led to my cars-and-trains analysis. In some VMware configurations, over the wire you could see a 16k read for every 4k random write. Go figure. Fortunately, those 16k reads find their way into the MFU side of the ARC :-) Bottom line: use tools like iscsisvrtop and dtrace to get an idea of what is really happening over the wire. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/07e25f35/attachment.html>
Thomas Nau
2013-Jan-18 05:35 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
Thanks for all the answers more inline) On 01/18/2013 02:42 AM, Richard Elling wrote:> On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us <mailto:bfriesen at simple.dallas.tx.us>> wrote: > >> On Wed, 16 Jan 2013, Thomas Nau wrote: >> >>> Dear all >>> I''ve a question concerning possible performance tuning for both iSCSI access >>> and replicating a ZVOL through zfs send/receive. We export ZVOLs with the >>> default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. >>> The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL >>> SSDs and 128G of main memory >>> >>> The iSCSI access pattern (1 hour daytime average) looks like the following >>> (Thanks to Richard Elling for the dtrace script) >> >> If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. > > 4k might be a little small. 8k will have less metadata overhead. In some cases > we''ve seen good performance on these workloads up through 32k. Real pain > is felt at 128k :-)My only pain so far is the time a send/receive takes without really loading the network at all. VM performance is nothing I worry about at all as it''s pretty good. So key question for me is if going from 8k to 16k or even 32k would have some benefit for that problem?> >> >> [ stuff removed ] >> >>> For disaster recovery we plan to sync the pool as often as possible >>> to a remote location. Running send/receive after a day or so seems to take >>> a significant amount of time wading through all the blocks and we hardly >>> see network average traffic going over 45MB/s (almost idle 1G link). >>> So here''s the question: would increasing/decreasing the volblocksize improve >>> the send/receive operation and what influence might show for the iSCSI side? >> >> Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve >> performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing >> write amplification due to "overlap" blocks.. > > compression is a good win, tooThanks for that. I''ll use your mentioned tools to drill down> -- richardThomas> > -- > > Richard.Elling at RichardElling.com <mailto:Richard.Elling at RichardElling.com> > +1-760-896-4422 > > > > > > > > >
Jim Klimov
2013-Jan-18 12:40 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-18 06:35, Thomas Nau wrote:>>> If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. >> >> 4k might be a little small. 8k will have less metadata overhead. In some cases >> we''ve seen good performance on these workloads up through 32k. Real pain >> is felt at 128k :-) > > My only pain so far is the time a send/receive takes without really loading the > network at all. VM performance is nothing I worry about at all as it''s pretty good. > So key question for me is if going from 8k to 16k or even 32k would have some benefit for > that problem?I would guess that increasing the block size would on one hand improve your reads - due to more userdata being stored contiguously as part of one ZFS block - and thus sending of the backup streams should be more about reading and sending the data and less about random seeking. On the other hand, this may likely be paid off with the need to do more read-modify-writes (when larger ZFS blocks are partially updated with the smaller clusters in the VM''s filesystem) while the overall system is running and used for its primary purpose. However, since the guest FS is likely to store files of non-minimal size, it is likely that the whole larger backend block would be updated anyway... So, I think, this is something an experiment can show you - whether the gain during backup (and primary-job) reads vs. possible degradation during the primary-job writes would be worth it. As for the experiment, I guess you can always make a ZVOL with different recordsize, DD data into it from the production dataset''s snapshot, and attach the VM or its clone to the newly created clone of its disk image. Good luck, and I hope I got Richard''s logic right in that answer ;) //Jim
Richard Elling
2013-Jan-19 06:05 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 9:35 PM, Thomas Nau <Thomas.Nau at uni-ulm.de> wrote:> Thanks for all the answers more inline) > > On 01/18/2013 02:42 AM, Richard Elling wrote: >> On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us <mailto:bfriesen at simple.dallas.tx.us>> wrote: >> >>> On Wed, 16 Jan 2013, Thomas Nau wrote: >>> >>>> Dear all >>>> I''ve a question concerning possible performance tuning for both iSCSI access >>>> and replicating a ZVOL through zfs send/receive. We export ZVOLs with the >>>> default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. >>>> The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL >>>> SSDs and 128G of main memory >>>> >>>> The iSCSI access pattern (1 hour daytime average) looks like the following >>>> (Thanks to Richard Elling for the dtrace script) >>> >>> If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. >> >> 4k might be a little small. 8k will have less metadata overhead. In some cases >> we''ve seen good performance on these workloads up through 32k. Real pain >> is felt at 128k :-) > > My only pain so far is the time a send/receive takes without really loading the > network at all. VM performance is nothing I worry about at all as it''s pretty good. > So key question for me is if going from 8k to 16k or even 32k would have some benefit for > that problem?send/receive can bottleneck on the receiving side. Take a look at the archives searching for "mbuffer" as a method of buffering on the receive side. In a well tuned system, the send will be from ARC :-) -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130118/3127efca/attachment-0001.html>
Richard Elling
2013-Jan-19 06:10 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 18, 2013, at 4:40 AM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2013-01-18 06:35, Thomas Nau wrote: >>>> If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. >>> >>> 4k might be a little small. 8k will have less metadata overhead. In some cases >>> we''ve seen good performance on these workloads up through 32k. Real pain >>> is felt at 128k :-) >> >> My only pain so far is the time a send/receive takes without really loading the >> network at all. VM performance is nothing I worry about at all as it''s pretty good. >> So key question for me is if going from 8k to 16k or even 32k would have some benefit for >> that problem? > > I would guess that increasing the block size would on one hand improve > your reads - due to more userdata being stored contiguously as part of > one ZFS block - and thus sending of the backup streams should be more > about reading and sending the data and less about random seeking.There is too much caching in the datapath to make a broad statement stick. Empirical measurements with your workload will need to choose the winner.> On the other hand, this may likely be paid off with the need to do more > read-modify-writes (when larger ZFS blocks are partially updated with > the smaller clusters in the VM''s filesystem) while the overall system > is running and used for its primary purpose. However, since the guest > FS is likely to store files of non-minimal size, it is likely that the > whole larger backend block would be updated anyway...For many ZFS implementations, RMW for zvols is the norm.> > So, I think, this is something an experiment can show you - whether the > gain during backup (and primary-job) reads vs. possible degradation > during the primary-job writes would be worth it. > > As for the experiment, I guess you can always make a ZVOL with different > recordsize, DD data into it from the production dataset''s snapshot, and > attach the VM or its clone to the newly created clone of its disk image.In my experience, it is very hard to recreate in the lab the environments found in real life. dd, in particular, will skew the results a bit because it is in LBA order for zvols, not the creation order as seen in the real world. That said, trying to get high performance out of HDDs is an exercise like fighting the tides :-) -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130118/91a761f7/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-19 15:16 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > If almost all of the I/Os are 4K, maybe your ZVOLs should use a > volblocksize of 4K? This seems like the most obvious improvement.Oh, I forgot to mention - The above logic only makes sense for mirrors and stripes. Not for raidz (or raid-5/6/dp in general) If you have a pool of mirrors or stripes, the system isn''t forced to subdivide a 4k block onto multiple disks, so it works very well. But if you have a pool blocksize of 4k and let''s say a 5-disk raidz (capacity of 4 disks) then the 4k block gets divided into 1k on each disk and 1k parity on the parity disk. Now, since the hardware only supports block sizes of 4k ... You can see there''s a lot of wasted space, and if you do a bunch of it, you''ll also have a lot of wasted time waiting for seeks/latency.
Richard Elling
2013-Jan-19 22:39 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn >> >> If almost all of the I/Os are 4K, maybe your ZVOLs should use a >> volblocksize of 4K? This seems like the most obvious improvement. > > Oh, I forgot to mention - The above logic only makes sense for mirrors and stripes. Not for raidz (or raid-5/6/dp in general) > > If you have a pool of mirrors or stripes, the system isn''t forced to subdivide a 4k block onto multiple disks, so it works very well. But if you have a pool blocksize of 4k and let''s say a 5-disk raidz (capacity of 4 disks) then the 4k block gets divided into 1k on each disk and 1k parity on the parity disk. Now, since the hardware only supports block sizes of 4k ... You can see there''s a lot of wasted space, and if you do a bunch of it, you''ll also have a lot of wasted time waiting for seeks/latency.This is not quite true for raidz. If there is a 4k write to a raidz comprised of 4k sector disks, then there will be one data and one parity block. There will not be 4 data + 1 parity with 75% space wastage. Rather, the space allocation more closely resembles a variant of mirroring, like some vendors call "RAID-1E" -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130119/11dbdce9/attachment.html>
Jim Klimov
2013-Jan-19 23:00 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-19 23:39, Richard Elling wrote:> This is not quite true for raidz. If there is a 4k write to a raidz > comprised of 4k sector disks, then > there will be one data and one parity block. There will not be 4 data + > 1 parity with 75% > space wastage. Rather, the space allocation more closely resembles a > variant of mirroring, > like some vendors call "RAID-1E"I agree with this exact reply, but as I posted sometime late last year, reporting on my "digging in the bowels of ZFS" and my problematic pool, for a 6-disk raidz2 set I only saw allocations (including two parity disks) divisible by 3 sectors, even if the amount of the (compressed) userdata was not so rounded. I.e. I had either miniature files or tails of files fitting into one sector plus two parities (overall a 3 sector allocation), or tails ranging 2-4 sectors and occupying 6 with parity (while 2 or 3 sectors could use just 4 or 5 w/parities, respectively). I am not sure what these numbers mean - 3 being a case for "one userdata sector plus both parities" or for "half of 6-disk stripe" - both such explanations fit in my case. But yes, with current raidz allocation there are many ways to waste space. And those small percentages (or not so small) do add up. Rectifying this example, i.e. allocating only as much as is used, does not seem like an incompatible on-disk format change, and should be doable within the write-queue logic. Maybe it would cause tradeoffs in efficiency; however, ZFS does explicitly "rotate" starting disks of allocations every few megabytes in order to even out the loads among spindles (normally parity disks don''t have to be accessed - unless mismatches occur on data disks). Disabling such padding would only help achieve this goal and save space at the same time... My 2c, //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Jan-20 16:21 UTC
[zfs-discuss] iSCSI access patterns and possible improvements?
> From: Richard Elling [mailto:richard.elling at gmail.com] > Sent: Saturday, January 19, 2013 5:39 PM > > the space allocation more closely resembles a variant > of mirroring, > like some vendors call "RAID-1E"Awesome, thank you. :-)