Nathan Kroenert
2011-Feb-13 08:56 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
Hi all, Exec summary: I have a situation where I''m seeing lots of large reads starving writes from being able to get through to disk. Some detail: I have a newly constructed box (was an old box, but blew the mobo - different story - sigh). Anyhoo - It''s a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as single member stripes, so, yeah, the nearest thing to JBOD that this controller gets to) pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230 Hewlett-Packard Company Smart Array Controller And it''s off this HP controller I''m handing my data zpool. config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth I guess my problem is more one that the ZFS folks should be aware of rather than something directly impacting me, as the workload I have created is not something I typically see - but it is something I see easily impacting customers - and in a nasty way should they encounter it. It *is* also a case I''ll create from time to time - when I''m moving DVD images backwards and forwards... I was stress testing the box, giving the new kits legs a stretch and kicked off the following: - create a test file to use as source for my ''full speed streaming write'' (lazy way) - dd if=/dev/urandom > /tmp/1 (and let that run for a few seconds, creating about100MB of random junk.) - start some jobs - while :; do cat /tmp/1 >> /data/delete.me/2; done & (The write workload, which is fine and dandy by itself) - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done & Before I kicked off the read workload, everything looked as expected. I was getting between 40 and 60MB/s to each of the disks and all was good. BUT - As soon as I introduced the read workload, my write throughput dropped to virtually zero, and remained there until the write workload was killed. The starvation is immediate. I can 100% reproducibly go from many MB/s of write throughput with no read workload to virtually 0MB/s write throughput, simply through kicking off that reading dd. Write performance picks up again as soon as I kill the read workload. It also behaves the same way of the file I''m reading is NOT the same one I''m writing to. (eg: cat >> file3 and the dd reading file 2) Other things to know about the system: - Disks are Seagate 2GB, 512 byte sector SATA disks - OS is Solaris 11 Express (build 151a) - zpool version is old. I''m still hedging my bets on having to go back to Nevada (sxce, build 124 or so, which is what I was at before installing s11express) Cached configuration: version: 19 - Plenty of space remains in the pool - bash-4.0$ zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT data 1.81T 1.34T 480G 74% 1.00x ONLINE - - The box has 8GB of memory - and ZFS is getting a fair whack at it. > ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 211843 827 11% ZFS File Data 1426054 5570 73% Anon 106814 417 5% Exec and libs 9364 36 0% Page cache 47192 184 2% Free (cachelist) 31448 122 2% Free (freelist) 130431 509 7% Total 1963146 7668 Physical 1963145 7668 - Rest of the zfs dataset properties: # zfs get all data NAME PROPERTY VALUE SOURCE data type filesystem - data creation Mon May 24 10:46 2010 - data used 1.34T - data available 451G - data referenced 500G - data compressratio 1.02x - data mounted yes - data quota none default data reservation none default data recordsize 128K default data mountpoint /data default data sharenfs ro,anon=0 local data checksum on default data compression off local data atime off local data devices on default data exec on default data setuid on default data readonly off default data zoned off default data snapdir hidden default data aclinherit restricted default data canmount on default data xattr on default data copies 1 default data version 3 - data utf8only off - data normalization none - data casesensitivity sensitive - data vscan off default data nbmand off default data sharesmb off default data refquota none default data refreservation none local data primarycache all default data secondarycache all default data usedbysnapshots 12.2G - data usedbydataset 500G - data usedbychildren 864G - data usedbyrefreservation 0 - data logbias latency default data dedup off default data mlslabel none default data sync standard default data encryption off - data keysource none default data keystatus none - data rekeydate - default data rstchown on default data com.sun:auto-snapshot true local Obviously, the potential for performance issues is considerable - and should it be required, I can provide some other detail, but given that this is so easy to reproduce, I thought I''d get it out there, just in case. It is also worthy of note that commands like ''zfs list'' take anywhere from 20 to 40 seconds to run when I have that sort of workload running - which also seems less optimal. I tried to recreate this issue on the boot pool (rpool) which is a single 2.5" 7200rpm disk (to take the cache controller out of the configuration) - but this seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive) - and I did not have any watchdog/snooping set and ran out of steam myself so just hit the big button. When I get the chance, I''ll give the rpool thing a crack again, but overall, it seems to me that the behavior I''m observing is not great... I''m also happy to supply lockstats / dtrace output etc if it''ll help. Thoughts? Cheers! Nathan.
Richard Elling
2011-Feb-13 17:31 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
On Feb 13, 2011, at 12:56 AM, Nathan Kroenert <nathan at tuneunix.com> wrote:> Hi all, > > Exec summary: I have a situation where I''m seeing lots of large reads starving writes from being able to get through to disk. > > Some detail: > I have a newly constructed box (was an old box, but blew the mobo - different story - sigh). > > Anyhoo - It''s a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as single member stripes, so, yeah, the nearest thing to JBOD that this controller gets to)What is the average service time of each disk? Multiply that by the average active queue depth. If that number is greater than, say, 100ms, then the ZFS I/O scheduler is not able to be very effective because the disks are too slow. Reducing the active queue depth can help, see zfs_vdev_max_pending in the ZFS Evil Tuning Guide. Faster disks helps, too. NexentaStor fans, note that you can do this easily, on the fly, via the Settings -> Preferences -> System web GUI. -- richard> > pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230 > Hewlett-Packard Company Smart Array Controller > > And it''s off this HP controller I''m handing my data zpool. > > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > > Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth > > I guess my problem is more one that the ZFS folks should be aware of rather than something directly impacting me, as the workload I have created is not something I typically see - but it is something I see easily impacting customers - and in a nasty way should they encounter it. It *is* also a case I''ll create from time to time - when I''m moving DVD images backwards and forwards... > > I was stress testing the box, giving the new kits legs a stretch and kicked off the following: > - create a test file to use as source for my ''full speed streaming write'' (lazy way) > - dd if=/dev/urandom > /tmp/1 > (and let that run for a few seconds, creating about100MB of random junk.) > - start some jobs > - while :; do cat /tmp/1 >> /data/delete.me/2; done & > (The write workload, which is fine and dandy by itself) > - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done & > > Before I kicked off the read workload, everything looked as expected. I was getting between 40 and 60MB/s to each of the disks and all was good. BUT - As soon as I introduced the read workload, my write throughput dropped to virtually zero, and remained there until the write workload was killed. > > The starvation is immediate. I can 100% reproducibly go from many MB/s of write throughput with no read workload to virtually 0MB/s write throughput, simply through kicking off that reading dd. Write performance picks up again as soon as I kill the read workload. It also behaves the same way of the file I''m reading is NOT the same one I''m writing to. (eg: cat >> file3 and the dd reading file 2) > > Other things to know about the system: > - Disks are Seagate 2GB, 512 byte sector SATA disks > - OS is Solaris 11 Express (build 151a) > - zpool version is old. I''m still hedging my bets on having to go back to Nevada (sxce, build 124 or so, which is what I was at before installing s11express) > Cached configuration: > version: 19 > - Plenty of space remains in the pool - > bash-4.0$ zpool list > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > data 1.81T 1.34T 480G 74% 1.00x ONLINE - > - The box has 8GB of memory - and ZFS is getting a fair whack at it. > > ::memstat > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 211843 827 11% > ZFS File Data 1426054 5570 73% > Anon 106814 417 5% > Exec and libs 9364 36 0% > Page cache 47192 184 2% > Free (cachelist) 31448 122 2% > Free (freelist) 130431 509 7% > > Total 1963146 7668 > Physical 1963145 7668 > > - Rest of the zfs dataset properties: > # zfs get all data > NAME PROPERTY VALUE SOURCE > data type filesystem - > data creation Mon May 24 10:46 2010 - > data used 1.34T - > data available 451G - > data referenced 500G - > data compressratio 1.02x - > data mounted yes - > data quota none default > data reservation none default > data recordsize 128K default > data mountpoint /data default > data sharenfs ro,anon=0 local > data checksum on default > data compression off local > data atime off local > data devices on default > data exec on default > data setuid on default > data readonly off default > data zoned off default > data snapdir hidden default > data aclinherit restricted default > data canmount on default > data xattr on default > data copies 1 default > data version 3 - > data utf8only off - > data normalization none - > data casesensitivity sensitive - > data vscan off default > data nbmand off default > data sharesmb off default > data refquota none default > data refreservation none local > data primarycache all default > data secondarycache all default > data usedbysnapshots 12.2G - > data usedbydataset 500G - > data usedbychildren 864G - > data usedbyrefreservation 0 - > data logbias latency default > data dedup off default > data mlslabel none default > data sync standard default > data encryption off - > data keysource none default > data keystatus none - > data rekeydate - default > data rstchown on default > data com.sun:auto-snapshot true local > > > Obviously, the potential for performance issues is considerable - and should it be required, I can provide some other detail, but given that this is so easy to reproduce, I thought I''d get it out there, just in case. > > It is also worthy of note that commands like ''zfs list'' take anywhere from 20 to 40 seconds to run when I have that sort of workload running - which also seems less optimal. > > I tried to recreate this issue on the boot pool (rpool) which is a single 2.5" 7200rpm disk (to take the cache controller out of the configuration) - but this seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive) - and I did not have any watchdog/snooping set and ran out of steam myself so just hit the big button. > > When I get the chance, I''ll give the rpool thing a crack again, but overall, it seems to me that the behavior I''m observing is not great... > > I''m also happy to supply lockstats / dtrace output etc if it''ll help. > > Thoughts? > > Cheers! > > Nathan. > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Nathan Kroenert
2011-Feb-14 04:04 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
Hi Steve, Thanks for the thoughts - I think that everything you asked about is in the original email - but for reference again, it''s 151a (s11 express). Are you really suggesting, for a single user system I need 16GB of memory, just to get ZFS to be able to write when it''s reading? (and even them, that would be contingent on you getting repeat, cached hits on the ARC). That''s hardly sensible, and anything but enterprise. I know I''m only talking about my little baby box at the moment, but extend that to a large database application, and I''m seeing badness all round. Worse - If I''m reading a 45GB contiguous file (say, HD video), the only way an ARC will help me is if I have 64GB, and have read it in the past... especially if I''m reading it sequentially. That''s inconceivable!! (cue reference to the Princess Bride :). I''d also ad that for the most part, 8GB is plenty for ZFS, and there are a lot of Sun/Oracle customers using it now in LDOM environments where 8GB is just great in the control/IO domain. I don''t think trying to blame the system in this case is the right answer. ZFS schedules the read/write activities, and to me it seems that it''s just not doing that. I was suspicious of the impact the HP Raid controller is having - and how it might be reacting to what''s being pushed at it, so re-created exactly this problem again on a different system with native non-cached SATA controllers. Issue is identical. (Though I have since determined that my HP raid controller is actually *slowing* my reads and writes to disk! ;) Cheers! Nathan. On 14/02/2011 4:08 AM, gonczi at comcast.net wrote:> Hi Nathan, > > Maybe it is buried somewhere in your email, but I did not see what > zfs version you are using. > > This is rather important, because the 145+ kernels work a lot better > in many ways than the > early ones ( say 134-ish). > > So whenever you are reporting various ZFS issues, something like > `uname -a` to report the kernel rev > is most useful. > > Writes starved by reads has been a complaint in early ZFS, I certainy > do not see > any evidence of this in the 145+ kernels. > > There is a fair amount of tuning and configuration that can be done > (adding ssd-s to your pool, zil vs no zil, how cacheing is configured, > ie what to cache..) > 8 Gig is not a lot of memory for ZFS, I would recommend double of that. > > If all goes well, most reads would be statisfied from ARC, and not > interfere with writes. > > > Steve-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110214/bf1025b9/attachment-0001.html>
Nathan Kroenert
2011-Feb-14 04:28 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
On 14/02/2011 4:31 AM, Richard Elling wrote:> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at tuneunix.com> wrote: > >> Hi all, >> >> Exec summary: I have a situation where I''m seeing lots of large reads starving writes from being able to get through to disk. >> >> <snip> > What is the average service time of each disk? Multiply that by the average > active queue depth. If that number is greater than, say, 100ms, then the ZFS > I/O scheduler is not able to be very effective because the disks are too slow. > Reducing the active queue depth can help, see zfs_vdev_max_pending in the > ZFS Evil Tuning Guide. Faster disks helps, too. > > NexentaStor fans, note that you can do this easily, on the fly, via the Settings -> > Preferences -> System web GUI. > -- richard >Hi Richard, Long time no speak! Anyhoo - See below. I''m unconvinced that faster disks would help. I think faster disks, at least in what I''m observing, would make it suck just as bad, just reading faster... ;) Maybe I''m missing something. Queue depth is around 10 (default and unchanged since install), and average service time is about 25ms... Below are 1 second samples with iostat - while I have included only about 10 seconds, it''s representative of what I''m seeing all the time. extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100 sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100 sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100 sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100 sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100 sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100 Bottom line is that ZFS does not seem to be caring about getting my writes to disk when there is a heavy read workload. I have also confirmed that it''s not the RAID controller either - behaviour is identical with direct attach SATA. But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things to swing dramatically! - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s write per spindle - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s - At 3, it''s starting to lean more heavily to reads again, but writes at least get a whack - 35mb/s per spindle read:15-20mb/s write. - At 4, we are closer to 35-40mb/s read, 15mb/s write By the time we get back to the default of 0xa, writes drop off almost completely. The crossover (on the box with no RAID controller) seems to be 5. Anything more than that, and writes get shouldered out the way almost completely. So - aside from the obvious - manually setting zfs_vdev_max_pending - do you have any thoughts on ZFS being able to make this sort of determination by itself? It would be somewhat of a shame to bust out such ''whacky knobs'' for plain old direct attach SATA disks to get balance... Also - can I set this property per-vdev? (just in case I have sata and, say, a USP-V connected to the same box)? Thanks again, and good to see you are still playing close by! Cheers! Nathan.>> pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230 >> Hewlett-Packard Company Smart Array Controller >> >> And it''s off this HP controller I''m handing my data zpool. >> >> config: >> >> NAME STATE READ WRITE CKSUM >> data ONLINE 0 0 0 >> mirror-0 ONLINE 0 0 0 >> c0t0d0 ONLINE 0 0 0 >> c0t1d0 ONLINE 0 0 0 >> >> Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth >> >> I guess my problem is more one that the ZFS folks should be aware of rather than something directly impacting me, as the workload I have created is not something I typically see - but it is something I see easily impacting customers - and in a nasty way should they encounter it. It *is* also a case I''ll create from time to time - when I''m moving DVD images backwards and forwards... >> >> I was stress testing the box, giving the new kits legs a stretch and kicked off the following: >> - create a test file to use as source for my ''full speed streaming write'' (lazy way) >> - dd if=/dev/urandom> /tmp/1 >> (and let that run for a few seconds, creating about100MB of random junk.) >> - start some jobs >> - while :; do cat /tmp/1>> /data/delete.me/2; done& >> (The write workload, which is fine and dandy by itself) >> - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done& >> >> Before I kicked off the read workload, everything looked as expected. I was getting between 40 and 60MB/s to each of the disks and all was good. BUT - As soon as I introduced the read workload, my write throughput dropped to virtually zero, and remained there until the write workload was killed. >> >> The starvation is immediate. I can 100% reproducibly go from many MB/s of write throughput with no read workload to virtually 0MB/s write throughput, simply through kicking off that reading dd. Write performance picks up again as soon as I kill the read workload. It also behaves the same way of the file I''m reading is NOT the same one I''m writing to. (eg: cat>> file3 and the dd reading file 2) >> >> Other things to know about the system: >> - Disks are Seagate 2GB, 512 byte sector SATA disks >> - OS is Solaris 11 Express (build 151a) >> - zpool version is old. I''m still hedging my bets on having to go back to Nevada (sxce, build 124 or so, which is what I was at before installing s11express) >> Cached configuration: >> version: 19 >> - Plenty of space remains in the pool - >> bash-4.0$ zpool list >> NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT >> data 1.81T 1.34T 480G 74% 1.00x ONLINE - >> - The box has 8GB of memory - and ZFS is getting a fair whack at it. >>> ::memstat >> Page Summary Pages MB %Tot >> ------------ ---------------- ---------------- ---- >> Kernel 211843 827 11% >> ZFS File Data 1426054 5570 73% >> Anon 106814 417 5% >> Exec and libs 9364 36 0% >> Page cache 47192 184 2% >> Free (cachelist) 31448 122 2% >> Free (freelist) 130431 509 7% >> >> Total 1963146 7668 >> Physical 1963145 7668 >> >> - Rest of the zfs dataset properties: >> # zfs get all data >> NAME PROPERTY VALUE SOURCE >> data type filesystem - >> data creation Mon May 24 10:46 2010 - >> data used 1.34T - >> data available 451G - >> data referenced 500G - >> data compressratio 1.02x - >> data mounted yes - >> data quota none default >> data reservation none default >> data recordsize 128K default >> data mountpoint /data default >> data sharenfs ro,anon=0 local >> data checksum on default >> data compression off local >> data atime off local >> data devices on default >> data exec on default >> data setuid on default >> data readonly off default >> data zoned off default >> data snapdir hidden default >> data aclinherit restricted default >> data canmount on default >> data xattr on default >> data copies 1 default >> data version 3 - >> data utf8only off - >> data normalization none - >> data casesensitivity sensitive - >> data vscan off default >> data nbmand off default >> data sharesmb off default >> data refquota none default >> data refreservation none local >> data primarycache all default >> data secondarycache all default >> data usedbysnapshots 12.2G - >> data usedbydataset 500G - >> data usedbychildren 864G - >> data usedbyrefreservation 0 - >> data logbias latency default >> data dedup off default >> data mlslabel none default >> data sync standard default >> data encryption off - >> data keysource none default >> data keystatus none - >> data rekeydate - default >> data rstchown on default >> data com.sun:auto-snapshot true local >> >> >> Obviously, the potential for performance issues is considerable - and should it be required, I can provide some other detail, but given that this is so easy to reproduce, I thought I''d get it out there, just in case. >> >> It is also worthy of note that commands like ''zfs list'' take anywhere from 20 to 40 seconds to run when I have that sort of workload running - which also seems less optimal. >> >> I tried to recreate this issue on the boot pool (rpool) which is a single 2.5" 7200rpm disk (to take the cache controller out of the configuration) - but this seemed to hard-hang the system (yep - even caps lock / num-lock were non-responsive) - and I did not have any watchdog/snooping set and ran out of steam myself so just hit the big button. >> >> When I get the chance, I''ll give the rpool thing a crack again, but overall, it seems to me that the behavior I''m observing is not great... >> >> I''m also happy to supply lockstats / dtrace output etc if it''ll help. >> >> Thoughts? >> >> Cheers! >> >> Nathan. >> >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2011-Feb-14 19:44 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
Hi Nathan, comments below... On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:> On 14/02/2011 4:31 AM, Richard Elling wrote: >> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at tuneunix.com> wrote: >> >>> Hi all, >>> >>> Exec summary: I have a situation where I''m seeing lots of large reads starving writes from being able to get through to disk. >>> >>> <snip> >> What is the average service time of each disk? Multiply that by the average >> active queue depth. If that number is greater than, say, 100ms, then the ZFS >> I/O scheduler is not able to be very effective because the disks are too slow. >> Reducing the active queue depth can help, see zfs_vdev_max_pending in the >> ZFS Evil Tuning Guide. Faster disks helps, too. >> >> NexentaStor fans, note that you can do this easily, on the fly, via the Settings -> >> Preferences -> System web GUI. >> -- richard >> > > Hi Richard, > > Long time no speak! Anyhoo - See below. > > I''m unconvinced that faster disks would help. I think faster disks, at least in what I''m observing, would make it suck just as bad, just reading faster... ;) Maybe I''m missing something.Faster disks always help :-)> > Queue depth is around 10 (default and unchanged since install), and average service time is about 25ms... Below are 1 second samples with iostat - while I have included only about 10 seconds, it''s representative of what I''m seeing all the time. > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 > sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100ok, we''ll take sd6 as an example (the math is easy :-) ... actv = 10 svc_t = 26.7 actv * svc_t = 267 milliseconds This is the queue at the disk. ZFS manages its own queue for the disk, but once it leaves ZFS, there is no way for ZFS to manage it. In the case of the active queue, the I/Os have left the OS, so even the OS is unable to change what is in the queue or directly influence when the I/Os will be finished. In ZFS, the queue has a priority scheduler and does place a higher priority on async writes than async reads (since b130 or so). But what you can see is that the intermittent nature of the async writes get stuck behind the 267 milliseconds as the queue drains the reads. [no, I''m not sure if that makes sense, try again...] If it sends reads continuously and writes occasionally, it will appear that reads have much more domination. In older releases, when the reads and writes had the same priority, this looks even worse.> > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > > sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 > sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 > sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100 > sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100 > sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100 > sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100 > sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 > sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 > sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100 > sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100 > > Bottom line is that ZFS does not seem to be caring about getting my writes to disk when there is a heavy read workload. > > I have also confirmed that it''s not the RAID controller either - behaviour is identical with direct attach SATA. > > But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things to swing dramatically! > - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s write per spindle > - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/sThough the NexentaStor docs recommend "1" for SATA disks, I find that "2" works better.> - At 3, it''s starting to lean more heavily to reads again, but writes at least get a whack - 35mb/s per spindle read:15-20mb/s write. > - At 4, we are closer to 35-40mb/s read, 15mb/s writeIsn''t queuing theory fun! :-)> > By the time we get back to the default of 0xa, writes drop off almost completely. > > The crossover (on the box with no RAID controller) seems to be 5. Anything more than that, and writes get shouldered out the way almost completely. > > So - aside from the obvious - manually setting zfs_vdev_max_pending - do you have any thoughts on ZFS being able to make this sort of determination by itself? It would be somewhat of a shame to bust out such ''whacky knobs'' for plain old direct attach SATA disks to get balance... > > Also - can I set this property per-vdev? (just in case I have sata and, say, a USP-V connected to the same box)?Today, there is not a per-vdev setting. There are several changes in the works for this and other scheduling. Incidentally, you can change the priorities on the fly, so you could experiment with different settings for mixed workloads. Obviously, non-mixed workloads won''t be very interesting. Also FWIW, for SATA disks in particular, it is not unusual for us to recommend dropping zfs_vdev_max_pending to 2. It can make a big difference for some workloads. -- richard
Nathan Kroenert
2011-Feb-15 00:27 UTC
[zfs-discuss] ZFS read/write fairness algorithm for single pool
Thanks for all the thoughts, Richard. One thing that still sticks in my craw is that I''m not wanting to write intermittently. I''m wanting to write flat out, and those writes are being held up... Seems to me that zfs should know and do something about that without me needing to tune zfs_vdev_max_pending... Nonetheless, I''m now at a far more balanced point than when I started, so that''s a good thing. :) Cheers, Nathan. On 15/02/2011 6:44 AM, Richard Elling wrote:> Hi Nathan, > comments below... > > On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote: > >> On 14/02/2011 4:31 AM, Richard Elling wrote: >>> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at tuneunix.com> wrote: >>> >>>> Hi all, >>>> >>>> Exec summary: I have a situation where I''m seeing lots of large reads starving writes from being able to get through to disk. >>>> >>>> <snip> >>> What is the average service time of each disk? Multiply that by the average >>> active queue depth. If that number is greater than, say, 100ms, then the ZFS >>> I/O scheduler is not able to be very effective because the disks are too slow. >>> Reducing the active queue depth can help, see zfs_vdev_max_pending in the >>> ZFS Evil Tuning Guide. Faster disks helps, too. >>> >>> NexentaStor fans, note that you can do this easily, on the fly, via the Settings -> >>> Preferences -> System web GUI. >>> -- richard >>> >> Hi Richard, >> >> Long time no speak! Anyhoo - See below. >> >> I''m unconvinced that faster disks would help. I think faster disks, at least in what I''m observing, would make it suck just as bad, just reading faster... ;) Maybe I''m missing something. > Faster disks always help :-) > >> Queue depth is around 10 (default and unchanged since install), and average service time is about 25ms... Below are 1 second samples with iostat - while I have included only about 10 seconds, it''s representative of what I''m seeing all the time. >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 >> sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100 > ok, we''ll take sd6 as an example (the math is easy :-) ... > actv = 10 > svc_t = 26.7 > > actv * svc_t = 267 milliseconds > > This is the queue at the disk. ZFS manages its own queue for the disk, > but once it leaves ZFS, there is no way for ZFS to manage it. In the > case of the active queue, the I/Os have left the OS, so even the OS > is unable to change what is in the queue or directly influence when > the I/Os will be finished. > > In ZFS, the queue has a priority scheduler and does place a higher > priority on async writes than async reads (since b130 or so). But what > you can see is that the intermittent nature of the async writes get > stuck behind the 267 milliseconds as the queue drains the reads. > [no, I''m not sure if that makes sense, try again...] > If it sends reads continuously and writes occasionally, it will appear > that reads have much more domination. In older releases, when the > reads and writes had the same priority, this looks even worse. > >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> >> sd6 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 >> sd7 422.1 0.0 54025.0 0.0 0.0 10.0 23.6 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 >> sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 388.0 7.0 49406.4 290.0 0.0 9.8 24.8 1 100 >> sd7 409.0 1.0 52350.3 2.0 0.0 9.5 23.2 1 99 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 423.0 0.0 54148.6 0.0 0.0 10.0 23.6 1 100 >> sd7 413.0 0.0 52868.5 0.0 0.0 10.0 24.2 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 400.0 2.0 51081.2 2.0 0.0 10.0 24.8 1 100 >> sd7 384.0 4.0 49153.2 4.0 0.0 10.0 25.7 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 401.9 1.0 51448.9 8.0 0.0 10.0 24.8 1 100 >> sd7 424.9 0.0 54392.4 0.0 0.0 10.0 23.5 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 >> sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 >> sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 >> >> extended device statistics >> device r/s w/s kr/s kw/s wait actv svc_t %w %b >> sd6 405.0 2.0 51843.8 6.0 0.0 10.0 24.5 1 100 >> sd7 408.0 3.0 52227.8 10.0 0.0 10.0 24.3 1 100 >> >> Bottom line is that ZFS does not seem to be caring about getting my writes to disk when there is a heavy read workload. >> >> I have also confirmed that it''s not the RAID controller either - behaviour is identical with direct attach SATA. >> >> But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things to swing dramatically! >> - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s write per spindle >> - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s > Though the NexentaStor docs recommend "1" for SATA disks, I find that "2" works better. > >> - At 3, it''s starting to lean more heavily to reads again, but writes at least get a whack - 35mb/s per spindle read:15-20mb/s write. >> - At 4, we are closer to 35-40mb/s read, 15mb/s write > Isn''t queuing theory fun! :-) > >> By the time we get back to the default of 0xa, writes drop off almost completely. >> >> The crossover (on the box with no RAID controller) seems to be 5. Anything more than that, and writes get shouldered out the way almost completely. >> >> So - aside from the obvious - manually setting zfs_vdev_max_pending - do you have any thoughts on ZFS being able to make this sort of determination by itself? It would be somewhat of a shame to bust out such ''whacky knobs'' for plain old direct attach SATA disks to get balance... >> >> Also - can I set this property per-vdev? (just in case I have sata and, say, a USP-V connected to the same box)? > Today, there is not a per-vdev setting. There are several changes in the works for > this and other scheduling. > > Incidentally, you can change the priorities on the fly, so you could experiment > with different settings for mixed workloads. Obviously, non-mixed workloads won''t > be very interesting. > > Also FWIW, for SATA disks in particular, it is not unusual for us to recommend > dropping zfs_vdev_max_pending to 2. It can make a big difference for some > workloads. > -- richard >