I''ve heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I''ve seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different "administrative weights" in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool''s spindles. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made "particular" with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? Thanks for info/ideas, //Jim
On Thu, 29 Nov 2012, Jim Klimov wrote:> I''ve heard a claim that ZFS relies too much on RAM caching, but > implements no sort of priorities (indeed, I''ve seen no knobs to > tune those) - so that if the storage box receives many different > types of IO requests with different "administrative weights" in > the view of admins, it can not really throttle some IOs to boost > others, when such IOs have to hit the pool''s spindles. > > For example, I might want to have corporate webshop-related > databases and appservers to be the fastest storage citizens, > then some corporate CRM and email, then various lower priority > zones and VMs, and at the bottom of the list - backups. > > AFAIK, now such requests would hit the ARC, then the disks if > needed - in no particular order. Well, can the order be made > "particular" with current ZFS architecture, i.e. by setting > some datasets to have a certain NICEness or another priority > mechanism?QoS poses a problem. Zfs needs to write a transaction group at a time. During part of the TXG write cycle, zfs does not return any data. Zfs writes TXGs quite hard so they fill the I/O channel. Even if one orders the reads during the TXG write cycle, zfs will not return any data for part of the time. There are really only a few solutions when resources might be limited: 1. Use fewer resources 2. Use resources more wisely 3. Add more resources until problem goes away I think that current zfs strives for #1 and QoS is option #2. Quite often, option #3 is effective because problems just go away once enough resources are available. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 11/29/12 10:56 AM, Jim Klimov wrote:> For example, I might want to have corporate webshop-related > databases and appservers to be the fastest storage citizens, > then some corporate CRM and email, then various lower priority > zones and VMs, and at the bottom of the list - backups. > > AFAIK, now such requests would hit the ARC, then the disks if > needed - in no particular order. Well, can the order be made > "particular" with current ZFS architecture, i.e. by setting > some datasets to have a certain NICEness or another priority > mechanism?Something like that is implemented in Joyent''s Illumos-based distribution, Smartos. (Illumos is open source continuation of Opensolaris OS/Net as well as Solaris11 is closed one) After them, it is implemented also in Openindiana/Illumos , possibly others. List of Illumos based distributions: http://wiki.illumos.org/display/illumos/Distributions It is using Solaris Zones and throttling their disk usage on that level, so you separate workload processes on separate zones. Or even put KVM machines under the zones (Joyent and OI support Joyent-written KVM/Intel implementation in Illumos) for the same reason of I/O throttling. They (Joyent) say that their solution is made in not too much code, but gives very good results (they run massive cloud computing service, with many zones and KVM VM''s so they might know). http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ I don''t know it is available/applicable to (now) closed OS/Net of Solaris11 and Solaris10, because Joyent/Illumos have access to complete stack and are actively changing it to suit their needs, as good example of benefits of open source/free software stack. But maybe it is.
On 12/ 2/12 03:24 AM, Nikola M. wrote:> It is using Solaris Zones and throttling their disk usage on that level, > so you separate workload processes on separate zones. > Or even put KVM machines under the zones (Joyent and OI support > Joyent-written KVM/Intel implementation in Illumos) for the same > reason of I/O throttling. > > They (Joyent) say that their solution is made in not too much code, > but gives very good results (they run massive cloud computing service, > with many zones and KVM VM''s so they might know). > http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle > http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ >There is short video from 16th minute onward, from BayLISA meetup at Joyent, August 16, 2012 https://www.youtube.com/watch?v=6csFi0D5eGY Talking about ZFS Throttle implementation architecture in Illumos , from Joyent''s Smartos. I learned it is also available in Entic.net-sponsored Openindiana and probably in Nexenta, too, since it is implemented inside Illumos. N.
On Dec 1, 2012, at 6:54 PM, "Nikola M." <minikola at gmail.com> wrote:> On 12/ 2/12 03:24 AM, Nikola M. wrote: >> It is using Solaris Zones and throttling their disk usage on that level, >> so you separate workload processes on separate zones. >> Or even put KVM machines under the zones (Joyent and OI support Joyent-written KVM/Intel implementation in Illumos) for the same reason of I/O throttling. >> >> They (Joyent) say that their solution is made in not too much code, but gives very good results (they run massive cloud computing service, with many zones and KVM VM''s so they might know). >> http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle >> http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ >> > There is short video from 16th minute onward, from BayLISA meetup at Joyent, August 16, 2012 > https://www.youtube.com/watch?v=6csFi0D5eGY > Talking about ZFS Throttle implementation architecture in Illumos , from Joyent''s Smartos.There was a good presentation on this at the OpenStorage Summit in 2011. Look for it on youtube.> I learned it is also available in Entic.net-sponsored Openindiana > and probably in Nexenta, too, since it is implemented inside Illumos.NexentaStor 3.x is not an illumos-based distribution, it is based on OpenSolaris b134. -- richard
On 12/ 2/12 05:19 AM, Richard Elling wrote:> On Dec 1, 2012, at 6:54 PM, "Nikola M." <minikola at gmail.com> wrote: > >> On 12/ 2/12 03:24 AM, Nikola M. wrote: >>> It is using Solaris Zones and throttling their disk usage on that level, >>> so you separate workload processes on separate zones. >>> Or even put KVM machines under the zones (Joyent and OI support Joyent-written KVM/Intel implementation in Illumos) for the same reason of I/O throttling. >>> >>> They (Joyent) say that their solution is made in not too much code, but gives very good results (they run massive cloud computing service, with many zones and KVM VM''s so they might know). >>> http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle >>> http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ >>> >> There is short video from 16th minute onward, from BayLISA meetup at Joyent, August 16, 2012 >> https://www.youtube.com/watch?v=6csFi0D5eGY >> Talking about ZFS Throttle implementation architecture in Illumos , from Joyent''s Smartos. > There was a good presentation on this at the OpenStorage Summit in 2011. > Look for it on youtube. > >> I learned it is also available in Entic.net-sponsored Openindiana >> and probably in Nexenta, too, since it is implemented inside Illumos. > NexentaStor 3.x is not an illumos-based distribution, it is based on OpenSolaris > b134.Oh yes, but I had Nexenta in general in mind, where NexentaStor community edition is based on Illumos. GDAmore (Illumos founder) is from Nexenta after all. It is good one can get support/storage from Nexenta. And it is alive thing, developing, future etc. And looking at OpenStorage Summit, i forget mentioning Delphix , having also developer previously in Sun , and selling software appliances. Last info I got about Illumos is that this kind of enhancements to Ilumos does not go set automatically upstream to Illumos, but it is on distributions to choose what to include. And yes. there are summits: http://www.nexenta.com/corp/nexenta-tv/openstorage-summit http://www.openstoragesummit.org/emea/index.html
On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru> wrote:> I''ve heard a claim that ZFS relies too much on RAM caching, but > implements no sort of priorities (indeed, I''ve seen no knobs to > tune those) - so that if the storage box receives many different > types of IO requests with different "administrative weights" in > the view of admins, it can not really throttle some IOs to boost > others, when such IOs have to hit the pool''s spindles.Caching has nothing to do with QoS in this context. *All* modern filesystems cache to RAM, otherwise they are unusable.> > For example, I might want to have corporate webshop-related > databases and appservers to be the fastest storage citizens, > then some corporate CRM and email, then various lower priority > zones and VMs, and at the bottom of the list - backups.Please read the papers on the ARC and how it deals with MFU and MRU cache types. You can adjust these policies using the primarycache and secondarycache properties at the dataset level.> > AFAIK, now such requests would hit the ARC, then the disks if > needed - in no particular order. Well, can the order be made > "particular" with current ZFS architecture, i.e. by setting > some datasets to have a certain NICEness or another priority > mechanism?ZFS has a priority-based I/O scheduler that works at the DMU level. However, there is no system call interface in UNIX that transfers priority or QoS information (eg read() or write()) into the file system VFS interface. So the grainularity of priority control is by zone or dataset. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121204/f21a0605/attachment.html>
On 2012-12-05 04:11, Richard Elling wrote:> On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru > <mailto:jimklimov at cos.ru>> wrote: > >> I''ve heard a claim that ZFS relies too much on RAM caching, but >> implements no sort of priorities (indeed, I''ve seen no knobs to >> tune those) - so that if the storage box receives many different >> types of IO requests with different "administrative weights" in >> the view of admins, it can not really throttle some IOs to boost >> others, when such IOs have to hit the pool''s spindles. > > Caching has nothing to do with QoS in this context. *All* modern > filesystems cache to RAM, otherwise they are unusable.Yes, I get that. However, many systems get away with less RAM than recommended for ZFS rigs (like the ZFS SA with a couple hundred GB as the starting option), and make their compromises elsewhere. They have to anyway, and they get different results, perhaps even better suited to certain narrow or big niches. Whatever the aggregate result, this difference does lead to some differing features that The Others'' marketing trumpets praise as the advantage :) - like this ability to mark some IO traffic as of higher priority than other traffics, in one case (which is now also an Oracle product line, apparently)... Actually, this question stems from a discussion at a seminar I''ve recently attended - which praised ZFS but pointed out its weaknesses against some other players on the market, so we are not unaware of those.>> For example, I might want to have corporate webshop-related >> databases and appservers to be the fastest storage citizens, >> then some corporate CRM and email, then various lower priority >> zones and VMs, and at the bottom of the list - backups. > > Please read the papers on the ARC and how it deals with MFU and > MRU cache types. You can adjust these policies using the primarycache > and secondarycache properties at the dataset level.I''ve read on that, and don''t exactly see how much these help if there is pressure on RAM so that cache entries expire... Meaning, if I want certain datasets to remain cached as long as possible (i.e. serve website or DB from RAM, not HDD), at expense of other datasets that might see higher usage, but have lower business priority - how do I do that? Or, perhaps, add (L2)ARC shares, reservations and/or quotas concepts to the certain datasets which I explicitly want to throttle up or down? At most, now I can mark the lower-priority datasets'' data or even metadata as not cached in ARC or L2ARC. On-off. There seems to be no smaller steps, like in QoS tags [0-7] or something like that. BTW, as a short side question: is it a true or false statement, that: if I set primarycache=metadata, then ZFS ARC won''t cache any "userdata" and thus it won''t appear in (expire into) L2ARC? So the real setting is that I can cache data+meta in RAM, and only meta in SSD? Not the other way around (meta in RAM but both data+meta in SSD)?>> >> AFAIK, now such requests would hit the ARC, then the disks if >> needed - in no particular order. Well, can the order be made >> "particular" with current ZFS architecture, i.e. by setting >> some datasets to have a certain NICEness or another priority >> mechanism? > > ZFS has a priority-based I/O scheduler that works at the DMU level. > However, there is no system call interface in UNIX that transfers > priority or QoS information (eg read() or write()) into the file system VFS > interface. So the grainularity of priority control is by zone or dataset.I do not think I''ve seen mention of priority controls per dataset, at least not in generic ZFS. Actually, that was part of my question above. And while throttling or resource shares between higher level software components (zones, VMs) might have similar effect, this is not something really controlled and enforced by the storage layer.> -- richardThanks, //Jim
On 2012-11-29 10:56, Jim Klimov wrote:> For example, I might want to have corporate webshop-related > databases and appservers to be the fastest storage citizens, > then some corporate CRM and email, then various lower priority > zones and VMs, and at the bottom of the list - backups.On a side note, I''m now revisiting old ZFS presentations collected over the years, and one suggested as "TBD" statements the ideas that metaslabs with varying speeds could be used for specific tasks, and not only to receive the allocations first so that a new pool would perform quickly. I.e. "TBD: Workload specific freespace selection policies". Say, I create a new storage box and lay out some bulk file, backup and database datasets. Even as they are receiving their first bytes, I have some idea about the kind of performance I''d expect from them - with QoS per dataset I might destine the databases to the fast LBAs (and smaller seeks between tracks I expect to use frequently), and the bulk data onto slower tracks right from the start, and the rest of unspecified data would grow around the middle of the allocation range. These types of data would then only "creep" onto the less fitting metaslabs (faster for bulk, slower for DB) if the target ones run out of free space. Then the next-best-fitting would be used... This one idea is somewhat reminiscent of hierarchical storage management, except that it is about static allocation at the write-time and takes place within the single disk (or set of similar disks), in order to warrant different performance for different tasks. ///Jim
I don''t have anything significant to add to this conversation, but wanted to chime in that I also find the concept of a QOS-like capability very appealing and that Jim''s recent emails resonate with me. You''re not alone! I believe there are many use cases where a granular prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a specific zvol, share, etc would be useful. My experience is stronger in the networking side and I envision a weighted class based queuing methodology (or something along those lines). I recognize that ZFS''s architecture preference for coalescing writes and reads into larger sequential batches might conflict with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good starting point towards that end? On a related note (maybe?) I would love to see pool-wide settings that control how aggressively data is added/removed form ARC, L2ARC, etc. Something that would accelerate the warming of a cold pool of storage or be more aggressive in adding/removing cached data on a volatile dataset (e.g. where Virtual Machines are turned on/off frequently). I have heard that some of these defaults might be changed in some future release of Illumos, but haven''t seen any specifics saying that the idea is nearing fruition in release XYZ. Matt On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2012-11-29 10:56, Jim Klimov wrote: > >> For example, I might want to have corporate webshop-related >> databases and appservers to be the fastest storage citizens, >> then some corporate CRM and email, then various lower priority >> zones and VMs, and at the bottom of the list - backups. >> > > On a side note, I''m now revisiting old ZFS presentations collected > over the years, and one suggested as "TBD" statements the ideas > that metaslabs with varying speeds could be used for specific > tasks, and not only to receive the allocations first so that a new > pool would perform quickly. I.e. "TBD: Workload specific freespace > selection policies". > > Say, I create a new storage box and lay out some bulk file, backup > and database datasets. Even as they are receiving their first bytes, > I have some idea about the kind of performance I''d expect from them - > with QoS per dataset I might destine the databases to the fast LBAs > (and smaller seeks between tracks I expect to use frequently), and > the bulk data onto slower tracks right from the start, and the rest > of unspecified data would grow around the middle of the allocation > range. > > These types of data would then only "creep" onto the less fitting > metaslabs (faster for bulk, slower for DB) if the target ones run > out of free space. Then the next-best-fitting would be used... > > This one idea is somewhat reminiscent of hierarchical storage > management, except that it is about static allocation at the > write-time and takes place within the single disk (or set of > similar disks), in order to warrant different performance for > different tasks. > > ///Jim > > ______________________________**_________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/37e2ad1a/attachment.html>
On Dec 5, 2012, at 5:41 AM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2012-12-05 04:11, Richard Elling wrote: >> On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru >> <mailto:jimklimov at cos.ru>> wrote: >> >>> I''ve heard a claim that ZFS relies too much on RAM caching, but >>> implements no sort of priorities (indeed, I''ve seen no knobs to >>> tune those) - so that if the storage box receives many different >>> types of IO requests with different "administrative weights" in >>> the view of admins, it can not really throttle some IOs to boost >>> others, when such IOs have to hit the pool''s spindles. >> >> Caching has nothing to do with QoS in this context. *All* modern >> filesystems cache to RAM, otherwise they are unusable. > > Yes, I get that. However, many systems get away with less RAM > than recommended for ZFS rigs (like the ZFS SA with a couple > hundred GB as the starting option), and make their compromises > elsewhere. They have to anyway, and they get different results, > perhaps even better suited to certain narrow or big niches.This is nothing more than a specious argument. They have small caches, so their performance is not as good as those with larger caches. This is like saying you need a smaller CPU cache because larger CPU caches get full.> Whatever the aggregate result, this difference does lead to > some differing features that The Others'' marketing trumpets > praise as the advantage :) - like this ability to mark some > IO traffic as of higher priority than other traffics, in one > case (which is now also an Oracle product line, apparently)... > > Actually, this question stems from a discussion at a seminar > I''ve recently attended - which praised ZFS but pointed out its > weaknesses against some other players on the market, so we are > not unaware of those. > >>> For example, I might want to have corporate webshop-related >>> databases and appservers to be the fastest storage citizens, >>> then some corporate CRM and email, then various lower priority >>> zones and VMs, and at the bottom of the list - backups. >> >> Please read the papers on the ARC and how it deals with MFU and >> MRU cache types. You can adjust these policies using the primarycache >> and secondarycache properties at the dataset level. > > I''ve read on that, and don''t exactly see how much these help > if there is pressure on RAM so that cache entries expire... > Meaning, if I want certain datasets to remain cached as long > as possible (i.e. serve website or DB from RAM, not HDD), at > expense of other datasets that might see higher usage, but > have lower business priority - how do I do that? Or, perhaps, > add (L2)ARC shares, reservations and/or quotas concepts to the > certain datasets which I explicitly want to throttle up or down?MRU evictions take precedence over MFU evictions. If the data is not in MFU, then it is, by definition, not being frequently used.> At most, now I can mark the lower-priority datasets'' data or > even metadata as not cached in ARC or L2ARC. On-off. There seems > to be no smaller steps, like in QoS tags [0-7] or something like > that. > > BTW, as a short side question: is it a true or false statement, > that: if I set primarycache=metadata, then ZFS ARC won''t cache > any "userdata" and thus it won''t appear in (expire into) L2ARC? > So the real setting is that I can cache data+meta in RAM, and > only meta in SSD? Not the other way around (meta in RAM but > both data+meta in SSD)?That is correct, by my reading of the code.>>> >>> AFAIK, now such requests would hit the ARC, then the disks if >>> needed - in no particular order. Well, can the order be made >>> "particular" with current ZFS architecture, i.e. by setting >>> some datasets to have a certain NICEness or another priority >>> mechanism? >> >> ZFS has a priority-based I/O scheduler that works at the DMU level. >> However, there is no system call interface in UNIX that transfers >> priority or QoS information (eg read() or write()) into the file system VFS >> interface. So the grainularity of priority control is by zone or dataset. > > I do not think I''ve seen mention of priority controls per dataset, > at least not in generic ZFS. Actually, that was part of my question > above. And while throttling or resource shares between higher level > software components (zones, VMs) might have similar effect, this is > not something really controlled and enforced by the storage layer.The priority scheduler is by type of I/O request. For example, sync requests have priority over async requests. Reads and writes have priority over scrubbing etc. The inter-dataset scheduling is done at the zone level. There is more work being done in this area, but it is still in the research phase. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/91d26242/attachment-0001.html>
On Dec 5, 2012, at 7:46 AM, Matt Van Mater <matt.vanmater at gmail.com> wrote:> I don''t have anything significant to add to this conversation, but wanted to chime in that I also find the concept of a QOS-like capability very appealing and that Jim''s recent emails resonate with me. You''re not alone! I believe there are many use cases where a granular prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a specific zvol, share, etc would be useful. My experience is stronger in the networking side and I envision a weighted class based queuing methodology (or something along those lines). I recognize that ZFS''s architecture preference for coalescing writes and reads into larger sequential batches might conflict with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good starting point towards that end?At present, I do not see async write QoS as being interesting. That leaves sync writes and reads as the managed I/O. Unfortunately, with HDDs, the variance in response time >> queue management time, so the results are less useful than the case with SSDs. Control theory works, once again. For sync writes, they are often latency-sensitive and thus have the highest priority. Reads have lower priority, with prefetch reads at lower priority still.> > On a related note (maybe?) I would love to see pool-wide settings that control how aggressively data is added/removed form ARC, L2ARC, etc.Evictions are done on an as-needed basis. Why would you want to evict more than needed? So you could fetch it again? Prefetching can be more aggressive, but we actually see busy systems disabling prefetch to improve interactive performance. Queuing theory works, once again.> Something that would accelerate the warming of a cold pool of storage or be more aggressive in adding/removing cached data on a volatile dataset (e.g. where Virtual Machines are turned on/off frequently). I have heard that some of these defaults might be changed in some future release of Illumos, but haven''t seen any specifics saying that the idea is nearing fruition in release XYZ.It is easy to warm data (dd), even to put it into MRU (dd + dd). For best performance with VMs, MRU works extremely well, especially with clones. There are plenty of good ideas being kicked around here, but remember that to support things like QoS at the application level, the applications must be written to an interface that passes QoS hints all the way down the stack. Lacking these interfaces, means that QoS needs to be managed by hand... and that management effort must be worth the effort. -- richard> > Matt > > > On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru> wrote: > On 2012-11-29 10:56, Jim Klimov wrote: > For example, I might want to have corporate webshop-related > databases and appservers to be the fastest storage citizens, > then some corporate CRM and email, then various lower priority > zones and VMs, and at the bottom of the list - backups. > > On a side note, I''m now revisiting old ZFS presentations collected > over the years, and one suggested as "TBD" statements the ideas > that metaslabs with varying speeds could be used for specific > tasks, and not only to receive the allocations first so that a new > pool would perform quickly. I.e. "TBD: Workload specific freespace > selection policies". > > Say, I create a new storage box and lay out some bulk file, backup > and database datasets. Even as they are receiving their first bytes, > I have some idea about the kind of performance I''d expect from them - > with QoS per dataset I might destine the databases to the fast LBAs > (and smaller seeks between tracks I expect to use frequently), and > the bulk data onto slower tracks right from the start, and the rest > of unspecified data would grow around the middle of the allocation > range. > > These types of data would then only "creep" onto the less fitting > metaslabs (faster for bulk, slower for DB) if the target ones run > out of free space. Then the next-best-fitting would be used... > > This one idea is somewhat reminiscent of hierarchical storage > management, except that it is about static allocation at the > write-time and takes place within the single disk (or set of > similar disks), in order to warrant different performance for > different tasks. > > ///Jim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/14674842/attachment.html>
bug fix below... On Dec 5, 2012, at 1:10 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Dec 5, 2012, at 7:46 AM, Matt Van Mater <matt.vanmater at gmail.com> wrote: > >> I don''t have anything significant to add to this conversation, but wanted to chime in that I also find the concept of a QOS-like capability very appealing and that Jim''s recent emails resonate with me. You''re not alone! I believe there are many use cases where a granular prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a specific zvol, share, etc would be useful. My experience is stronger in the networking side and I envision a weighted class based queuing methodology (or something along those lines). I recognize that ZFS''s architecture preference for coalescing writes and reads into larger sequential batches might conflict with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good starting point towards that end? > > At present, I do not see async write QoS as being interesting. That leaves sync writes and reads > as the managed I/O. Unfortunately, with HDDs, the variance in response time >> queue management > time, so the results are less useful than the case with SSDs. Control theory works, once again. > For sync writes, they are often latency-sensitive and thus have the highest priority. Reads have > lower priority, with prefetch reads at lower priority still. > >> >> On a related note (maybe?) I would love to see pool-wide settings that control how aggressively data is added/removed form ARC, L2ARC, etc. > > Evictions are done on an as-needed basis. Why would you want to evict more than needed? > So you could fetch it again? > > Prefetching can be more aggressive, but we actually see busy systems disabling prefetch to > improve interactive performance. Queuing theory works, once again. > >> Something that would accelerate the warming of a cold pool of storage or be more aggressive in adding/removing cached data on a volatile dataset (e.g. where Virtual Machines are turned on/off frequently). I have heard that some of these defaults might be changed in some future release of Illumos, but haven''t seen any specifics saying that the idea is nearing fruition in release XYZ. > > It is easy to warm data (dd), even to put it into MRU (dd + dd). For best performance with > VMs, MRU works extremely well, especially with clones.Should read: It is easy to warm data (dd), even to put it into MFU (dd + dd). For best performance with VMs, MFU works extremely well, especially with clones. -- richard> > There are plenty of good ideas being kicked around here, but remember that to support > things like QoS at the application level, the applications must be written to an interface > that passes QoS hints all the way down the stack. Lacking these interfaces, means that > QoS needs to be managed by hand... and that management effort must be worth the effort. > -- richard > >> >> Matt >> >> >> On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru> wrote: >> On 2012-11-29 10:56, Jim Klimov wrote: >> For example, I might want to have corporate webshop-related >> databases and appservers to be the fastest storage citizens, >> then some corporate CRM and email, then various lower priority >> zones and VMs, and at the bottom of the list - backups. >> >> On a side note, I''m now revisiting old ZFS presentations collected >> over the years, and one suggested as "TBD" statements the ideas >> that metaslabs with varying speeds could be used for specific >> tasks, and not only to receive the allocations first so that a new >> pool would perform quickly. I.e. "TBD: Workload specific freespace >> selection policies". >> >> Say, I create a new storage box and lay out some bulk file, backup >> and database datasets. Even as they are receiving their first bytes, >> I have some idea about the kind of performance I''d expect from them - >> with QoS per dataset I might destine the databases to the fast LBAs >> (and smaller seeks between tracks I expect to use frequently), and >> the bulk data onto slower tracks right from the start, and the rest >> of unspecified data would grow around the middle of the allocation >> range. >> >> These types of data would then only "creep" onto the less fitting >> metaslabs (faster for bulk, slower for DB) if the target ones run >> out of free space. Then the next-best-fitting would be used... >> >> This one idea is somewhat reminiscent of hierarchical storage >> management, except that it is about static allocation at the >> write-time and takes place within the single disk (or set of >> similar disks), in order to warrant different performance for >> different tasks. >> >> ///Jim >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > > Richard.Elling at RichardElling.com > +1-760-896-4422 > > > > > > > > >-- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/e0b198ad/attachment-0001.html>
> > > At present, I do not see async write QoS as being interesting. That leaves > sync writes and reads > as the managed I/O. Unfortunately, with HDDs, the variance in response > time >> queue management > time, so the results are less useful than the case with SSDs. Control > theory works, once again. > For sync writes, they are often latency-sensitive and thus have the > highest priority. Reads have > lower priority, with prefetch reads at lower priority still. > >This makes sense for the most part, and i agree that with spinning HDDs there might be minimal benefit. It is why I suggested that ARC/L2ARC might be the reasonable starting place for an idea like this because the latencies are orders of magnitude lower. Perhaps i''m looking for a way to modify the prefetch to have a higher priority when the system is under some threshold.> > On a related note (maybe?) I would love to see pool-wide settings that > control how aggressively data is added/removed form ARC, L2ARC, etc. > > Evictions are done on an as-needed basis. Why would you want to evict more > than needed? > So you could fetch it again? > > Prefetching can be more aggressive, but we actually see busy systems > disabling prefetch to > improve interactive performance. Queuing theory works, once again. > > It''s not that I want evictions to occur for no reason... only that therate be accelerated if there is contention. If I recall correctly, ZFS has some default values included that throttle how quickly the ACR/L2ARC are updated, and the explanation I read was it was due to SSDs 6+ years ago were not capable of the IOPS and throughput that they are today. I know that ZFS has a prefetch capability but have seen fairly little written about it, are there any good references you can point me to better understand it? In particular I would like to see some kind of measurement on my systems showing how often this capability is utilized.> Something that would accelerate the warming of a cold pool of storage or > be more aggressive in adding/removing cached data on a volatile dataset > (e.g. where Virtual Machines are turned on/off frequently). I have heard > that some of these defaults might be changed in some future release of > Illumos, but haven''t seen any specifics saying that the idea is nearing > fruition in release XYZ. > > It is easy to warm data (dd), even to put it into MFU (dd + dd). For best > performance with > VMs, MFU works extremely well, especially with clones. >I''m unclear on the best way to warm data... do you mean to simply `dd if=/volumes/myvol/data of=/dev/null`? I have always been under the impression that ARC/L2ARC has rate limiting how much data can be added to the cache per interval (i can''t remember the interval). Is this not the case? If there is some rate limiting in place, dd-ing the data like my example above would not necessarily cache all of the data... it might take several iterations to populate the cache, correct? Forgive my naivete, but when I look at my pool when it is under random load and see a heavy load hitting the spinning disk vdevs and relatively little on my L2ARC SSDs I wonder how to better utilize their performance. I would think that if my L2ARC is not yet full and it has very low IOPS/throughput/busy/wait, then ZFS should use that opportunity to populate the cache aggressively based on the MRU or some other mechanism. Sorry to digress from the original thread! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/c00192d7/attachment.html>
> > > > I''m unclear on the best way to warm data... do you mean to simply `dd > if=/volumes/myvol/data of=/dev/null`? I have always been under the > impression that ARC/L2ARC has rate limiting how much data can be added to > the cache per interval (i can''t remember the interval). Is this not the > case? If there is some rate limiting in place, dd-ing the data like my > example above would not necessarily cache all of the data... it might take > several iterations to populate the cache, correct? >Quick update... I found at least one reference to the rate limiting I was referring to. It was Richard from ~2.5 years ago :) http://marc.info/?l=zfs-discuss&m=127060523611023&w=2 I assume the source code reference is still valid, in which case a population of 8MB per 1 second into L2ARC is extremely slow in my books and very conservative... It would take a very long time to warm the hundreds of gigs of VMs we have into cache. Perhaps the L2ARC_WRITE_BOOST tunable might be a good place to aggressively warm a cache, but my preference is to not touch the tunables if I have a choice. I''d rather the system default be updated to reflect modern hardware, that way everyone benefits and I''m not running some custom build. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/5ba7b65f/attachment.html>
On Dec 6, 2012, at 5:30 AM, Matt Van Mater <matt.vanmater at gmail.com> wrote:> > > I''m unclear on the best way to warm data... do you mean to simply `dd if=/volumes/myvol/data of=/dev/null`? I have always been under the impression that ARC/L2ARC has rate limiting how much data can be added to the cache per interval (i can''t remember the interval). Is this not the case? If there is some rate limiting in place, dd-ing the data like my example above would not necessarily cache all of the data... it might take several iterations to populate the cache, correct? > > Quick update... I found at least one reference to the rate limiting I was referring to. It was Richard from ~2.5 years ago :) > http://marc.info/?l=zfs-discuss&m=127060523611023&w=2 > > I assume the source code reference is still valid, in which case a population of 8MB per 1 second into L2ARC is extremely slow in my books and very conservative... It would take a very long time to warm the hundreds of gigs of VMs we have into cache. Perhaps the L2ARC_WRITE_BOOST tunable might be a good place to aggressively warm a cache, but my preference is to not touch the tunables if I have a choice. I''d rather the system default be updated to reflect modern hardware, that way everyone benefits and I''m not running some custom build.Yep, the default L2ARC fill rate is quite low for modern systems. It is not uncommon to see it increased significantly, with the corresponding improvements in hit rate for busy systems. Can you file an RFE at https://www.illumos.org/projects/illumos-gate/issues/ Thanks! -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/3bfa084f/attachment-0001.html>