thr3ads.net - zfs discuss - [zfs-discuss] ZFS QoS and priorities [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Nov-29 09:56 UTC

[zfs-discuss] ZFS QoS and priorities

I''ve heard a claim that ZFS relies too much on RAM caching, but
implements no sort of priorities (indeed, I''ve seen no knobs to
tune those) - so that if the storage box receives many different
types of IO requests with different "administrative weights" in
the view of admins, it can not really throttle some IOs to boost
others, when such IOs have to hit the pool''s spindles.

For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.

AFAIK, now such requests would hit the ARC, then the disks if
needed - in no particular order. Well, can the order be made
"particular" with current ZFS architecture, i.e. by setting
some datasets to have a certain NICEness or another priority
mechanism?

Thanks for info/ideas,
//Jim

Bob Friesenhahn

2012-Nov-30 02:44 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Thu, 29 Nov 2012, Jim Klimov wrote:
> I''ve heard a claim that ZFS relies too much on RAM caching, but
> implements no sort of priorities (indeed, I''ve seen no knobs to
> tune those) - so that if the storage box receives many different
> types of IO requests with different "administrative weights" in
> the view of admins, it can not really throttle some IOs to boost
> others, when such IOs have to hit the pool''s spindles.
>
> For example, I might want to have corporate webshop-related
> databases and appservers to be the fastest storage citizens,
> then some corporate CRM and email, then various lower priority
> zones and VMs, and at the bottom of the list - backups.
>
> AFAIK, now such requests would hit the ARC, then the disks if
> needed - in no particular order. Well, can the order be made
> "particular" with current ZFS architecture, i.e. by setting
> some datasets to have a certain NICEness or another priority
> mechanism?
QoS poses a problem.  Zfs needs to write a transaction group at a 
time.  During part of the TXG write cycle, zfs does not return any 
data.  Zfs writes TXGs quite hard so they fill the I/O channel.  Even 
if one orders the reads during the TXG write cycle, zfs will not 
return any data for part of the time.

There are really only a few solutions when resources might be limited:

   1. Use fewer resources
   2. Use resources more wisely
   3. Add more resources until problem goes away

I think that current zfs strives for #1 and QoS is option #2.  Quite 
often, option #3 is effective because problems just go away once 
enough resources are available.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nikola M.

2012-Dec-02 02:24 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On 11/29/12 10:56 AM, Jim Klimov wrote:> For example, I might want to have corporate webshop-related
> databases and appservers to be the fastest storage citizens,
> then some corporate CRM and email, then various lower priority
> zones and VMs, and at the bottom of the list - backups.
>
> AFAIK, now such requests would hit the ARC, then the disks if
> needed - in no particular order. Well, can the order be made
> "particular" with current ZFS architecture, i.e. by setting
> some datasets to have a certain NICEness or another priority
> mechanism?Something like that is implemented in Joyent''s Illumos-based 
distribution, Smartos.
(Illumos is open source continuation of Opensolaris OS/Net as well as 
Solaris11 is closed one)
After them, it is implemented also in Openindiana/Illumos , possibly others.
List of Illumos based distributions: 
http://wiki.illumos.org/display/illumos/Distributions

It is using Solaris Zones and throttling their disk usage on that level,
so you separate workload processes on separate zones.
Or even put KVM machines under the zones (Joyent and OI support 
Joyent-written KVM/Intel implementation in Illumos)  for the same reason 
of I/O throttling.

They (Joyent) say that their solution is made in not too much code, but 
gives very good results (they run massive cloud computing service, with 
many zones and KVM VM''s so they might know).
http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle
http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/

I don''t know it is available/applicable to (now) closed OS/Net of 
Solaris11 and Solaris10, because Joyent/Illumos have access to complete 
stack and are actively changing it to suit their needs,
as good example of benefits of open source/free software stack. But 
maybe it is.

Nikola M.

2012-Dec-02 02:54 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On 12/ 2/12 03:24 AM, Nikola M. wrote:> It is using Solaris Zones and throttling their disk usage on that level,
> so you separate workload processes on separate zones.
> Or even put KVM machines under the zones (Joyent and OI support 
> Joyent-written KVM/Intel implementation in Illumos)  for the same 
> reason of I/O throttling.
>
> They (Joyent) say that their solution is made in not too much code, 
> but gives very good results (they run massive cloud computing service, 
> with many zones and KVM VM''s so they might know).
> http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle
> http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
>There is short video from 16th minute onward, from BayLISA meetup at 
Joyent, August 16, 2012
https://www.youtube.com/watch?v=6csFi0D5eGY
Talking about ZFS Throttle implementation architecture in Illumos , from 
Joyent''s Smartos.
I learned it is also available in Entic.net-sponsored Openindiana
and probably in Nexenta, too, since it is implemented inside Illumos.

N.

Richard Elling

2012-Dec-02 04:19 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Dec 1, 2012, at 6:54 PM, "Nikola M." <minikola at gmail.com>
wrote:
> On 12/ 2/12 03:24 AM, Nikola M. wrote:
>> It is using Solaris Zones and throttling their disk usage on that
level,
>> so you separate workload processes on separate zones.
>> Or even put KVM machines under the zones (Joyent and OI support
Joyent-written KVM/Intel implementation in Illumos)  for the same reason of I/O
throttling.
>> 
>> They (Joyent) say that their solution is made in not too much code, but
gives very good results (they run massive cloud computing service, with many
zones and KVM VM''s so they might know).
>> http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle
>> http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
>> 
> There is short video from 16th minute onward, from BayLISA meetup at
Joyent, August 16, 2012
> https://www.youtube.com/watch?v=6csFi0D5eGY
> Talking about ZFS Throttle implementation architecture in Illumos , from
Joyent''s Smartos.
There was a good presentation on this at the OpenStorage Summit in 2011.
Look for it on youtube.
> I learned it is also available in Entic.net-sponsored Openindiana
> and probably in Nexenta, too, since it is implemented inside Illumos.
NexentaStor 3.x is not an illumos-based distribution, it is based on OpenSolaris
b134.

 -- richard

Nikola M.

2012-Dec-02 04:50 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On 12/ 2/12 05:19 AM, Richard Elling wrote:> On Dec 1, 2012, at 6:54 PM, "Nikola M." <minikola at
gmail.com> wrote:
>
>> On 12/ 2/12 03:24 AM, Nikola M. wrote:
>>> It is using Solaris Zones and throttling their disk usage on that
level,
>>> so you separate workload processes on separate zones.
>>> Or even put KVM machines under the zones (Joyent and OI support
Joyent-written KVM/Intel implementation in Illumos)  for the same reason of I/O
throttling.
>>>
>>> They (Joyent) say that their solution is made in not too much code,
but gives very good results (they run massive cloud computing service, with many
zones and KVM VM''s so they might know).
>>> http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle
>>> http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
>>>
>> There is short video from 16th minute onward, from BayLISA meetup at
Joyent, August 16, 2012
>> https://www.youtube.com/watch?v=6csFi0D5eGY
>> Talking about ZFS Throttle implementation architecture in Illumos ,
from Joyent''s Smartos.
> There was a good presentation on this at the OpenStorage Summit in 2011.
> Look for it on youtube.
>
>> I learned it is also available in Entic.net-sponsored Openindiana
>> and probably in Nexenta, too, since it is implemented inside Illumos.
> NexentaStor 3.x is not an illumos-based distribution, it is based on
OpenSolaris
> b134.Oh yes, but I had Nexenta in general in mind, where NexentaStor 
community edition is based on Illumos. GDAmore (Illumos founder) is from 
Nexenta after all.
It is good one can get support/storage from Nexenta.
And it is alive thing, developing, future etc.

And looking at OpenStorage Summit, i forget mentioning Delphix , having 
also developer previously in Sun , and selling software appliances.

Last info I got about Illumos is that this kind of enhancements to 
Ilumos does not go set automatically upstream to Illumos, but it is on 
distributions to choose what to include.

And yes. there are summits:
http://www.nexenta.com/corp/nexenta-tv/openstorage-summit
http://www.openstoragesummit.org/emea/index.html

Richard Elling

2012-Dec-05 03:11 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> I''ve heard a claim that ZFS relies too much on RAM caching, but
> implements no sort of priorities (indeed, I''ve seen no knobs to
> tune those) - so that if the storage box receives many different
> types of IO requests with different "administrative weights" in
> the view of admins, it can not really throttle some IOs to boost
> others, when such IOs have to hit the pool''s spindles.
Caching has nothing to do with QoS in this context. *All* modern
filesystems cache to RAM, otherwise they are unusable.
> 
> For example, I might want to have corporate webshop-related
> databases and appservers to be the fastest storage citizens,
> then some corporate CRM and email, then various lower priority
> zones and VMs, and at the bottom of the list - backups.
Please read the papers on the ARC and how it deals with MFU and
MRU cache types. You can adjust these policies using the primarycache
and secondarycache properties at the dataset level.
> 
> AFAIK, now such requests would hit the ARC, then the disks if
> needed - in no particular order. Well, can the order be made
> "particular" with current ZFS architecture, i.e. by setting
> some datasets to have a certain NICEness or another priority
> mechanism?
ZFS has a priority-based I/O scheduler that works at the DMU level.
However, there is no system call interface in UNIX that transfers
priority or QoS information (eg read() or write()) into the file system VFS
interface. So the grainularity of priority control is by zone or dataset.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121204/f21a0605/attachment.html>

Jim Klimov

2012-Dec-05 13:41 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On 2012-12-05 04:11, Richard Elling wrote:> On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru
> <mailto:jimklimov at cos.ru>> wrote:
>
>> I''ve heard a claim that ZFS relies too much on RAM caching,
but
>> implements no sort of priorities (indeed, I''ve seen no knobs
to
>> tune those) - so that if the storage box receives many different
>> types of IO requests with different "administrative weights"
in
>> the view of admins, it can not really throttle some IOs to boost
>> others, when such IOs have to hit the pool''s spindles.
>
> Caching has nothing to do with QoS in this context. *All* modern
> filesystems cache to RAM, otherwise they are unusable.
Yes, I get that. However, many systems get away with less RAM
than recommended for ZFS rigs (like the ZFS SA with a couple
hundred GB as the starting option), and make their compromises
elsewhere. They have to anyway, and they get different results,
perhaps even better suited to certain narrow or big niches.

Whatever the aggregate result, this difference does lead to
some differing features that The Others'' marketing trumpets
praise as the advantage :) - like this ability to mark some
IO traffic as of higher priority than other traffics, in one
case (which is now also an Oracle product line, apparently)...

Actually, this question stems from a discussion at a seminar
I''ve recently attended - which praised ZFS but pointed out its
weaknesses against some other players on the market, so we are
not unaware of those.
>> For example, I might want to have corporate webshop-related
>> databases and appservers to be the fastest storage citizens,
>> then some corporate CRM and email, then various lower priority
>> zones and VMs, and at the bottom of the list - backups.
>
> Please read the papers on the ARC and how it deals with MFU and
> MRU cache types. You can adjust these policies using the primarycache
> and secondarycache properties at the dataset level.
I''ve read on that, and don''t exactly see how much these help
if there is pressure on RAM so that cache entries expire...
Meaning, if I want certain datasets to remain cached as long
as possible (i.e. serve website or DB from RAM, not HDD), at
expense of other datasets that might see higher usage, but
have lower business priority - how do I do that? Or, perhaps,
add (L2)ARC shares, reservations and/or quotas concepts to the
certain datasets which I explicitly want to throttle up or down?

At most, now I can mark the lower-priority datasets'' data or
even metadata as not cached in ARC or L2ARC. On-off. There seems
to be no smaller steps, like in QoS tags [0-7] or something like
that.

BTW, as a short side question: is it a true or false statement,
that: if I set primarycache=metadata, then ZFS ARC won''t cache
any "userdata" and thus it won''t appear in (expire into)
L2ARC?
So the real setting is that I can cache data+meta in RAM, and
only meta in SSD? Not the other way around (meta in RAM but
both data+meta in SSD)?
>>
>> AFAIK, now such requests would hit the ARC, then the disks if
>> needed - in no particular order. Well, can the order be made
>> "particular" with current ZFS architecture, i.e. by setting
>> some datasets to have a certain NICEness or another priority
>> mechanism?
>
> ZFS has a priority-based I/O scheduler that works at the DMU level.
> However, there is no system call interface in UNIX that transfers
> priority or QoS information (eg read() or write()) into the file system VFS
> interface. So the grainularity of priority control is by zone or dataset.
I do not think I''ve seen mention of priority controls per dataset,
at least not in generic ZFS. Actually, that was part of my question
above. And while throttling or resource shares between higher level
software components (zones, VMs) might have similar effect, this is
not something really controlled and enforced by the storage layer.
>   -- richard
Thanks,
//Jim

Jim Klimov

2012-Dec-05 15:26 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On 2012-11-29 10:56, Jim Klimov wrote:> For example, I might want to have corporate webshop-related
> databases and appservers to be the fastest storage citizens,
> then some corporate CRM and email, then various lower priority
> zones and VMs, and at the bottom of the list - backups.
On a side note, I''m now revisiting old ZFS presentations collected
over the years, and one suggested as "TBD" statements the ideas
that metaslabs with varying speeds could be used for specific
tasks, and not only to receive the allocations first so that a new
pool would perform quickly. I.e. "TBD: Workload specific freespace
selection policies".

Say, I create a new storage box and lay out some bulk file, backup
and database datasets. Even as they are receiving their first bytes,
I have some idea about the kind of performance I''d expect from them -
with QoS per dataset I might destine the databases to the fast LBAs
(and smaller seeks between tracks I expect to use frequently), and
the bulk data onto slower tracks right from the start, and the rest
of unspecified data would grow around the middle of the allocation
range.

These types of data would then only "creep" onto the less fitting
metaslabs (faster for bulk, slower for DB) if the target ones run
out of free space. Then the next-best-fitting would be used...

This one idea is somewhat reminiscent of hierarchical storage
management, except that it is about static allocation at the
write-time and takes place within the single disk (or set of
similar disks), in order to warrant different performance for
different tasks.

///Jim

Matt Van Mater

2012-Dec-05 15:46 UTC

head link

[zfs-discuss] ZFS QoS and priorities

I don''t have anything significant to add to this conversation, but
wanted
to chime in that I also find the concept of a QOS-like capability very
appealing and that Jim''s recent emails resonate with me. 
You''re not alone!
 I believe there are many use cases where a granular prioritization that
controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority
IO to a specific zvol, share, etc would be useful.  My experience is
stronger in the networking side and I envision a weighted class based
queuing methodology (or something along those lines).  I recognize that
ZFS''s architecture preference for coalescing writes and reads into
larger
sequential batches might conflict with a QOS-like capability... Perhaps the
ARC/L2ARC tuning might be a good starting point towards that end?

On a related note (maybe?) I would love to see pool-wide settings that
control how aggressively data is added/removed form ARC, L2ARC, etc.
 Something that would accelerate the warming of a cold pool of storage or
be more aggressive in adding/removing cached data on a volatile dataset
(e.g. where Virtual Machines are turned on/off frequently).  I have heard
that some of these defaults might be changed in some future release of
Illumos, but haven''t seen any specifics saying that the idea is nearing
fruition in release XYZ.

Matt

On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> On 2012-11-29 10:56, Jim Klimov wrote:
>
>> For example, I might want to have corporate webshop-related
>> databases and appservers to be the fastest storage citizens,
>> then some corporate CRM and email, then various lower priority
>> zones and VMs, and at the bottom of the list - backups.
>>
>
> On a side note, I''m now revisiting old ZFS presentations collected
> over the years, and one suggested as "TBD" statements the ideas
> that metaslabs with varying speeds could be used for specific
> tasks, and not only to receive the allocations first so that a new
> pool would perform quickly. I.e. "TBD: Workload specific freespace
> selection policies".
>
> Say, I create a new storage box and lay out some bulk file, backup
> and database datasets. Even as they are receiving their first bytes,
> I have some idea about the kind of performance I''d expect from
them -
> with QoS per dataset I might destine the databases to the fast LBAs
> (and smaller seeks between tracks I expect to use frequently), and
> the bulk data onto slower tracks right from the start, and the rest
> of unspecified data would grow around the middle of the allocation
> range.
>
> These types of data would then only "creep" onto the less fitting
> metaslabs (faster for bulk, slower for DB) if the target ones run
> out of free space. Then the next-best-fitting would be used...
>
> This one idea is somewhat reminiscent of hierarchical storage
> management, except that it is about static allocation at the
> write-time and takes place within the single disk (or set of
> similar disks), in order to warrant different performance for
> different tasks.
>
> ///Jim
>
> ______________________________**_________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
>
http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/37e2ad1a/attachment.html>

Richard Elling

2012-Dec-05 17:26 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Dec 5, 2012, at 5:41 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> On 2012-12-05 04:11, Richard Elling wrote:
>> On Nov 29, 2012, at 1:56 AM, Jim Klimov <jimklimov at cos.ru
>> <mailto:jimklimov at cos.ru>> wrote:
>> 
>>> I''ve heard a claim that ZFS relies too much on RAM
caching, but
>>> implements no sort of priorities (indeed, I''ve seen no
knobs to
>>> tune those) - so that if the storage box receives many different
>>> types of IO requests with different "administrative
weights" in
>>> the view of admins, it can not really throttle some IOs to boost
>>> others, when such IOs have to hit the pool''s spindles.
>> 
>> Caching has nothing to do with QoS in this context. *All* modern
>> filesystems cache to RAM, otherwise they are unusable.
> 
> Yes, I get that. However, many systems get away with less RAM
> than recommended for ZFS rigs (like the ZFS SA with a couple
> hundred GB as the starting option), and make their compromises
> elsewhere. They have to anyway, and they get different results,
> perhaps even better suited to certain narrow or big niches.
This is nothing more than a specious argument. They have small 
caches, so their performance is not as good as those with larger 
caches. This is like saying you need a smaller CPU cache because
larger CPU caches get full.
> Whatever the aggregate result, this difference does lead to
> some differing features that The Others'' marketing trumpets
> praise as the advantage :) - like this ability to mark some
> IO traffic as of higher priority than other traffics, in one
> case (which is now also an Oracle product line, apparently)...
> 
> Actually, this question stems from a discussion at a seminar
> I''ve recently attended - which praised ZFS but pointed out its
> weaknesses against some other players on the market, so we are
> not unaware of those.
> 
>>> For example, I might want to have corporate webshop-related
>>> databases and appservers to be the fastest storage citizens,
>>> then some corporate CRM and email, then various lower priority
>>> zones and VMs, and at the bottom of the list - backups.
>> 
>> Please read the papers on the ARC and how it deals with MFU and
>> MRU cache types. You can adjust these policies using the primarycache
>> and secondarycache properties at the dataset level.
> 
> I''ve read on that, and don''t exactly see how much these
help
> if there is pressure on RAM so that cache entries expire...
> Meaning, if I want certain datasets to remain cached as long
> as possible (i.e. serve website or DB from RAM, not HDD), at
> expense of other datasets that might see higher usage, but
> have lower business priority - how do I do that? Or, perhaps,
> add (L2)ARC shares, reservations and/or quotas concepts to the
> certain datasets which I explicitly want to throttle up or down?
MRU evictions take precedence over MFU evictions. If the data is 
not in MFU, then it is, by definition, not being frequently used.
> At most, now I can mark the lower-priority datasets'' data or
> even metadata as not cached in ARC or L2ARC. On-off. There seems
> to be no smaller steps, like in QoS tags [0-7] or something like
> that.
> 
> BTW, as a short side question: is it a true or false statement,
> that: if I set primarycache=metadata, then ZFS ARC won''t cache
> any "userdata" and thus it won''t appear in (expire into)
L2ARC?
> So the real setting is that I can cache data+meta in RAM, and
> only meta in SSD? Not the other way around (meta in RAM but
> both data+meta in SSD)?
That is correct, by my reading of the code.
>>> 
>>> AFAIK, now such requests would hit the ARC, then the disks if
>>> needed - in no particular order. Well, can the order be made
>>> "particular" with current ZFS architecture, i.e. by
setting
>>> some datasets to have a certain NICEness or another priority
>>> mechanism?
>> 
>> ZFS has a priority-based I/O scheduler that works at the DMU level.
>> However, there is no system call interface in UNIX that transfers
>> priority or QoS information (eg read() or write()) into the file system
VFS
>> interface. So the grainularity of priority control is by zone or
dataset.
> 
> I do not think I''ve seen mention of priority controls per dataset,
> at least not in generic ZFS. Actually, that was part of my question
> above. And while throttling or resource shares between higher level
> software components (zones, VMs) might have similar effect, this is
> not something really controlled and enforced by the storage layer.
The priority scheduler is by type of I/O request. For example, sync 
requests have priority over async requests. Reads and writes have
priority over scrubbing etc. The inter-dataset scheduling is done at
the zone level.

There is more work being done in this area, but it is still in the research
phase.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/91d26242/attachment-0001.html>

Richard Elling

2012-Dec-05 21:10 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Dec 5, 2012, at 7:46 AM, Matt Van Mater <matt.vanmater at gmail.com>
wrote:
> I don''t have anything significant to add to this conversation, but
wanted to chime in that I also find the concept of a QOS-like capability very
appealing and that Jim''s recent emails resonate with me. 
You''re not alone!  I believe there are many use cases where a granular
prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used
to give priority IO to a specific zvol, share, etc would be useful.  My
experience is stronger in the networking side and I envision a weighted class
based queuing methodology (or something along those lines).  I recognize that
ZFS''s architecture preference for coalescing writes and reads into
larger sequential batches might conflict with a QOS-like capability... Perhaps
the ARC/L2ARC tuning might be a good starting point towards that end?
At present, I do not see async write QoS as being interesting. That leaves sync
writes and reads
as the managed I/O. Unfortunately, with HDDs, the variance in response time
>> queue management
time, so the results are less useful than the case with SSDs. Control theory
works, once again.
For sync writes, they are often latency-sensitive and thus have the highest
priority. Reads have
lower priority, with prefetch reads at lower priority still.
> 
> On a related note (maybe?) I would love to see pool-wide settings that
control how aggressively data is added/removed form ARC, L2ARC, etc.
Evictions are done on an as-needed basis. Why would you want to evict more than
needed?
So you could fetch it again?

Prefetching can be more aggressive, but we actually see busy systems disabling
prefetch to
improve interactive performance. Queuing theory works, once again.
>  Something that would accelerate the warming of a cold pool of storage or
be more aggressive in adding/removing cached data on a volatile dataset (e.g.
where Virtual Machines are turned on/off frequently).  I have heard that some of
these defaults might be changed in some future release of Illumos, but
haven''t seen any specifics saying that the idea is nearing fruition in
release XYZ.
It is easy to warm data (dd), even to put it into MRU (dd + dd). For best
performance with
VMs, MRU works extremely well, especially with clones.

There are plenty of good ideas being kicked around here, but remember that to
support
things like QoS at the application level, the applications must be written to an
interface
that passes QoS hints all the way down the stack. Lacking these interfaces,
means that
QoS needs to be managed by hand... and that management effort must be worth the
effort.
 -- richard
> 
> Matt
> 
> 
> On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru>
wrote:
> On 2012-11-29 10:56, Jim Klimov wrote:
> For example, I might want to have corporate webshop-related
> databases and appservers to be the fastest storage citizens,
> then some corporate CRM and email, then various lower priority
> zones and VMs, and at the bottom of the list - backups.
> 
> On a side note, I''m now revisiting old ZFS presentations collected
> over the years, and one suggested as "TBD" statements the ideas
> that metaslabs with varying speeds could be used for specific
> tasks, and not only to receive the allocations first so that a new
> pool would perform quickly. I.e. "TBD: Workload specific freespace
> selection policies".
> 
> Say, I create a new storage box and lay out some bulk file, backup
> and database datasets. Even as they are receiving their first bytes,
> I have some idea about the kind of performance I''d expect from
them -
> with QoS per dataset I might destine the databases to the fast LBAs
> (and smaller seeks between tracks I expect to use frequently), and
> the bulk data onto slower tracks right from the start, and the rest
> of unspecified data would grow around the middle of the allocation
> range.
> 
> These types of data would then only "creep" onto the less fitting
> metaslabs (faster for bulk, slower for DB) if the target ones run
> out of free space. Then the next-best-fitting would be used...
> 
> This one idea is somewhat reminiscent of hierarchical storage
> management, except that it is about static allocation at the
> write-time and takes place within the single disk (or set of
> similar disks), in order to warrant different performance for
> different tasks.
> 
> ///Jim
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/14674842/attachment.html>

Richard Elling

2012-Dec-05 21:57 UTC

head link

[zfs-discuss] ZFS QoS and priorities

bug fix below...

On Dec 5, 2012, at 1:10 PM, Richard Elling <richard.elling at gmail.com>
wrote:
> On Dec 5, 2012, at 7:46 AM, Matt Van Mater <matt.vanmater at
gmail.com> wrote:
> 
>> I don''t have anything significant to add to this conversation,
but wanted to chime in that I also find the concept of a QOS-like capability
very appealing and that Jim''s recent emails resonate with me. 
You''re not alone!  I believe there are many use cases where a granular
prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used
to give priority IO to a specific zvol, share, etc would be useful.  My
experience is stronger in the networking side and I envision a weighted class
based queuing methodology (or something along those lines).  I recognize that
ZFS''s architecture preference for coalescing writes and reads into
larger sequential batches might conflict with a QOS-like capability... Perhaps
the ARC/L2ARC tuning might be a good starting point towards that end?
> 
> At present, I do not see async write QoS as being interesting. That leaves
sync writes and reads
> as the managed I/O. Unfortunately, with HDDs, the variance in response time
>> queue management
> time, so the results are less useful than the case with SSDs. Control
theory works, once again.
> For sync writes, they are often latency-sensitive and thus have the highest
priority. Reads have
> lower priority, with prefetch reads at lower priority still.
> 
>> 
>> On a related note (maybe?) I would love to see pool-wide settings that
control how aggressively data is added/removed form ARC, L2ARC, etc.
> 
> Evictions are done on an as-needed basis. Why would you want to evict more
than needed?
> So you could fetch it again?
> 
> Prefetching can be more aggressive, but we actually see busy systems
disabling prefetch to
> improve interactive performance. Queuing theory works, once again.
> 
>>  Something that would accelerate the warming of a cold pool of storage
or be more aggressive in adding/removing cached data on a volatile dataset (e.g.
where Virtual Machines are turned on/off frequently).  I have heard that some of
these defaults might be changed in some future release of Illumos, but
haven''t seen any specifics saying that the idea is nearing fruition in
release XYZ.
> 
> It is easy to warm data (dd), even to put it into MRU (dd + dd). For best
performance with
> VMs, MRU works extremely well, especially with clones.
Should read:
It is easy to warm data (dd), even to put it into MFU (dd + dd). For best
performance with
VMs, MFU works extremely well, especially with clones.
 -- richard
> 
> There are plenty of good ideas being kicked around here, but remember that
to support
> things like QoS at the application level, the applications must be written
to an interface
> that passes QoS hints all the way down the stack. Lacking these interfaces,
means that
> QoS needs to be managed by hand... and that management effort must be worth
the effort.
>  -- richard
> 
>> 
>> Matt
>> 
>> 
>> On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov <jimklimov at cos.ru>
wrote:
>> On 2012-11-29 10:56, Jim Klimov wrote:
>> For example, I might want to have corporate webshop-related
>> databases and appservers to be the fastest storage citizens,
>> then some corporate CRM and email, then various lower priority
>> zones and VMs, and at the bottom of the list - backups.
>> 
>> On a side note, I''m now revisiting old ZFS presentations
collected
>> over the years, and one suggested as "TBD" statements the
ideas
>> that metaslabs with varying speeds could be used for specific
>> tasks, and not only to receive the allocations first so that a new
>> pool would perform quickly. I.e. "TBD: Workload specific freespace
>> selection policies".
>> 
>> Say, I create a new storage box and lay out some bulk file, backup
>> and database datasets. Even as they are receiving their first bytes,
>> I have some idea about the kind of performance I''d expect from
them -
>> with QoS per dataset I might destine the databases to the fast LBAs
>> (and smaller seeks between tracks I expect to use frequently), and
>> the bulk data onto slower tracks right from the start, and the rest
>> of unspecified data would grow around the middle of the allocation
>> range.
>> 
>> These types of data would then only "creep" onto the less
fitting
>> metaslabs (faster for bulk, slower for DB) if the target ones run
>> out of free space. Then the next-best-fitting would be used...
>> 
>> This one idea is somewhat reminiscent of hierarchical storage
>> management, except that it is about static allocation at the
>> write-time and takes place within the single disk (or set of
>> similar disks), in order to warrant different performance for
>> different tasks.
>> 
>> ///Jim
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> --
> 
> Richard.Elling at RichardElling.com
> +1-760-896-4422
> 
> 
> 
> 
> 
> 
> 
> 
> 
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/e0b198ad/attachment-0001.html>

Matt Van Mater

2012-Dec-06 13:09 UTC

head link

[zfs-discuss] ZFS QoS and priorities

>
>
> At present, I do not see async write QoS as being interesting. That leaves
> sync writes and reads
> as the managed I/O. Unfortunately, with HDDs, the variance in response
> time >> queue management
> time, so the results are less useful than the case with SSDs. Control
> theory works, once again.
> For sync writes, they are often latency-sensitive and thus have the
> highest priority. Reads have
> lower priority, with prefetch reads at lower priority still.
>
>This makes sense for the most part, and i agree that with spinning HDDs
there might be minimal benefit.  It is why I suggested that ARC/L2ARC might
be the reasonable starting place for an idea like this because the
latencies are orders of magnitude lower.  Perhaps i''m looking for a way
to
modify the prefetch to have a higher priority when the system is under some
threshold.
>
> On a related note (maybe?) I would love to see pool-wide settings that
> control how aggressively data is added/removed form ARC, L2ARC, etc.
>
> Evictions are done on an as-needed basis. Why would you want to evict more
> than needed?
> So you could fetch it again?
>
> Prefetching can be more aggressive, but we actually see busy systems
> disabling prefetch to
> improve interactive performance. Queuing theory works, once again.
>
> It''s not that I want evictions to occur for no reason... only that
therate be accelerated if there is contention.  If I recall correctly, ZFS has
some default values included that throttle how quickly the ACR/L2ARC are
updated, and the explanation I read was it was due to SSDs 6+ years ago
were not capable of the IOPS and throughput that they are today.

I know that ZFS has a prefetch capability but have seen fairly little
written about it, are there any good references you can point me to better
understand it?  In particular I would like to see some kind of measurement
on my systems showing how often this capability is utilized.

>  Something that would accelerate the warming of a cold pool of storage or
> be more aggressive in adding/removing cached data on a volatile dataset
> (e.g. where Virtual Machines are turned on/off frequently).  I have heard
> that some of these defaults might be changed in some future release of
> Illumos, but haven''t seen any specifics saying that the idea is
nearing
> fruition in release XYZ.
>
> It is easy to warm data (dd), even to put it into MFU (dd + dd). For best
> performance with
> VMs, MFU works extremely well, especially with clones.
>
I''m unclear on the best way to warm data... do you mean to simply `dd
if=/volumes/myvol/data of=/dev/null`?  I have always been under the
impression that ARC/L2ARC has rate limiting how much data can be added to
the cache per interval (i can''t remember the interval).  Is this not
the
case?  If there is some rate limiting in place, dd-ing the data like my
example above would not necessarily cache all of the data... it might take
several iterations to populate the cache, correct?

Forgive my naivete, but when I look at my pool when it is under random load
and see a heavy load hitting the spinning disk vdevs and relatively little
on my L2ARC SSDs I wonder how to better utilize their performance.  I would
think that if my L2ARC is not yet full and it has very low
IOPS/throughput/busy/wait, then ZFS should use that opportunity to populate
the cache aggressively based on the MRU or some other mechanism.

Sorry to digress from the original thread!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/c00192d7/attachment.html>

Matt Van Mater

2012-Dec-06 13:30 UTC

head link

[zfs-discuss] ZFS QoS and priorities

>
>
>
> I''m unclear on the best way to warm data... do you mean to simply
`dd
> if=/volumes/myvol/data of=/dev/null`?  I have always been under the
> impression that ARC/L2ARC has rate limiting how much data can be added to
> the cache per interval (i can''t remember the interval).  Is this
not the
> case?  If there is some rate limiting in place, dd-ing the data like my
> example above would not necessarily cache all of the data... it might take
> several iterations to populate the cache, correct?
>
Quick update... I found at least one reference to the rate limiting I was
referring to.  It was Richard from ~2.5 years ago :)
http://marc.info/?l=zfs-discuss&m=127060523611023&w=2

I assume the source code reference is still valid, in which case a
population of 8MB per 1 second into L2ARC is extremely slow in my books and
very conservative... It would take a very long time to warm the hundreds of
gigs of VMs we have into cache.  Perhaps the L2ARC_WRITE_BOOST tunable
might be a good place to aggressively warm a cache, but my preference is to
not touch the tunables if I have a choice.  I''d rather the system
default
be updated to reflect modern hardware, that way everyone benefits and
I''m
not running some custom build.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/5ba7b65f/attachment.html>

Richard Elling

2012-Dec-07 05:35 UTC

head link

[zfs-discuss] ZFS QoS and priorities

On Dec 6, 2012, at 5:30 AM, Matt Van Mater <matt.vanmater at gmail.com>
wrote:
> 
> 
> I''m unclear on the best way to warm data... do you mean to simply
`dd if=/volumes/myvol/data of=/dev/null`?  I have always been under the
impression that ARC/L2ARC has rate limiting how much data can be added to the
cache per interval (i can''t remember the interval).  Is this not the
case?  If there is some rate limiting in place, dd-ing the data like my example
above would not necessarily cache all of the data... it might take several
iterations to populate the cache, correct?
> 
> Quick update... I found at least one reference to the rate limiting I was
referring to.  It was Richard from ~2.5 years ago :)
> http://marc.info/?l=zfs-discuss&m=127060523611023&w=2
> 
> I assume the source code reference is still valid, in which case a
population of 8MB per 1 second into L2ARC is extremely slow in my books and very
conservative... It would take a very long time to warm the hundreds of gigs of
VMs we have into cache.  Perhaps the L2ARC_WRITE_BOOST tunable might be a good
place to aggressively warm a cache, but my preference is to not touch the
tunables if I have a choice.  I''d rather the system default be updated
to reflect modern hardware, that way everyone benefits and I''m not
running some custom build.
Yep, the default L2ARC fill rate is quite low for modern systems. It is not
uncommon
to see it increased significantly, with the corresponding improvements in hit
rate for
busy systems. Can you file an RFE at
https://www.illumos.org/projects/illumos-gate/issues/
Thanks!
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121206/3bfa084f/attachment-0001.html>

zfs discuss - Nov 2012 - ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities

[zfs-discuss] ZFS QoS and priorities