thr3ads.net - Xen devel - blkback global resources [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2012-Mar-26 15:56 UTC

blkback global resources

All the resources allocated based on xen_blkif_reqs are global in
blkback. While (without having measured anything) I think that this
is bad from a QoS perspective (not the least implied from a warning
issued by Citrix''es multi-page-ring patches:

	if (blkif_reqs < BLK_RING_SIZE(order))
		printk(KERN_WARNING "WARNING: "
		       "I/O request space (%d reqs) < ring order %ld, "
		       "consider increasing %s.reqs to >= %ld.",
		       blkif_reqs, order, KBUILD_MODNAME,
		       roundup_pow_of_two(BLK_RING_SIZE(order)));

indicating that this _is_ a bottleneck), I''m otoh hesitant to convert
this to per-instance allocations, as the amount of memory taken
away from Dom0 for this may be not insignificant when there are
many devices.

Does anyone have an opinion here, in particular regarding the
original authors'' decision to make this global vs. the apparently
made observation (by Daniel Stodden, the author of said patch,
who I don''t have any current email of to ask directly), but also
in the context of multi-page rings, the purpose of which is to
allow for larger amounts of in-flight I/O?

Thanks, Jan

Ian Campbell

2012-Mar-26 16:20 UTC

head link

Re: blkback global resources

On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:> All the resources allocated based on xen_blkif_reqs are global in
> blkback. While (without having measured anything) I think that this
> is bad from a QoS perspective (not the least implied from a warning
> issued by Citrix''es multi-page-ring patches:
> 
> 	if (blkif_reqs < BLK_RING_SIZE(order))
> 		printk(KERN_WARNING "WARNING: "
> 		       "I/O request space (%d reqs) < ring order %ld, "
> 		       "consider increasing %s.reqs to >= %ld.",
> 		       blkif_reqs, order, KBUILD_MODNAME,
> 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> 
> indicating that this _is_ a bottleneck), I''m otoh hesitant to
convert
> this to per-instance allocations, as the amount of memory taken
> away from Dom0 for this may be not insignificant when there are
> many devices.
> 
> Does anyone have an opinion here, in particular regarding the
> original authors'' decision to make this global vs. the apparently
> made observation (by Daniel Stodden, the author of said patch,
> who I don''t have any current email of to ask directly), but also
> in the context of multi-page rings, the purpose of which is to
> allow for larger amounts of in-flight I/O?
Not really much to say other than we (well, mostly Wei Liu) have a
similar issue with netback too. That used to have a global pool, then a
pool-per-worker thread. When Wei added thread-per-vif support he solved
this by adding a "page pool" which handles allocations. Possibly this
could grow some sort of fairness etc nobs and be shared with blkback?

Ian.

Konrad Rzeszutek Wilk

2012-Mar-26 16:56 UTC

head link

Re: blkback global resources

On Mon, Mar 26, 2012 at 05:20:08PM +0100, Ian Campbell
wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > All the resources allocated based on xen_blkif_reqs are global in
> > blkback. While (without having measured anything) I think that this
> > is bad from a QoS perspective (not the least implied from a warning
> > issued by Citrix''es multi-page-ring patches:
> > 
> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> > 		printk(KERN_WARNING "WARNING: "
> > 		       "I/O request space (%d reqs) < ring order %ld, "
> > 		       "consider increasing %s.reqs to >= %ld.",
> > 		       blkif_reqs, order, KBUILD_MODNAME,
> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> > 
> > indicating that this _is_ a bottleneck), I''m otoh hesitant to
convert
> > this to per-instance allocations, as the amount of memory taken
> > away from Dom0 for this may be not insignificant when there are
> > many devices.
> > 
> > Does anyone have an opinion here, in particular regarding the
> > original authors'' decision to make this global vs. the
apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don''t have any current email of to ask directly), but
also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 
> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly
this
> could grow some sort of fairness etc nobs and be shared with blkback?
Yes! That is what I was hoping for - as it would hopefully also have
the shrinker API hooked up to combat excessive memory consumption.
> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Matt Wilson

2012-Mar-26 20:47 UTC

head link

Re: blkback global resources

On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell
wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > 
> > Does anyone have an opinion here, in particular regarding the
> > original authors'' decision to make this global vs. the
apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don''t have any current email of to ask directly), but
also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 
> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly
this
> could grow some sort of fairness etc nobs and be shared with blkback?
I''d definitely welcome an approach in blkback similar to the page pool
changes to netback. Steve Noonan''s change to use a spinlock per
blkfront device [1] should show increased contention on the global
request list on the blkback side when multiple devices are attached.

Does netback not also need to handle questions of fairness and
resource starvation?

Matt

[1]
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9

Jan Beulich

2012-Mar-27 07:27 UTC

head link

Re: blkback global resources

>>> On 26.03.12 at 18:53, Daniel Stodden
<daniel.stodden@googlemail.com> wrote:
> On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
>> Cc''ing Daniel for you on this one, Jan.
>> 
>>  K.
>> 
>> On 26/03/2012 16:56, "Jan Beulich" <JBeulich@suse.com>
wrote:
>> 
>> > All the resources allocated based on xen_blkif_reqs are global in
>> > blkback. While (without having measured anything) I think that
this
>> > is bad from a QoS perspective (not the least implied from a
warning
>> > issued by Citrix''es multi-page-ring patches:
>> > 
>> > if (blkif_reqs < BLK_RING_SIZE(order))
>> > printk(KERN_WARNING "WARNING: "
>> >       "I/O request space (%d reqs) < ring order %ld,
"
>> >       "consider increasing %s.reqs to >= %ld.",
>> >       blkif_reqs, order, KBUILD_MODNAME,
>> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> > 
>> > indicating that this _is_ a bottleneck), I''m otoh
hesitant to convert
>> > this to per-instance allocations, as the amount of memory taken
>> > away from Dom0 for this may be not insignificant when there are
>> > many devices.
>> > 
>> > Does anyone have an opinion here, in particular regarding the
>> > original authors'' decision to make this global vs. the
apparently
>> > made observation (by Daniel Stodden, the author of said patch,
>> > who I don''t have any current email of to ask directly),
but also
>> > in the context of multi-page rings, the purpose of which is to
>> > allow for larger amounts of in-flight I/O?
>> > 
>> > Thanks, Jan
> 
> Re-CC''ing Andrei Lifchits, I think there''s been some work
going on at
> Citrix regarding that matter.
> 
> Yes, just allocating a pfn pool per backend instance is way too much
> memory balooned out. Otherwise this stuff would have never looked the
> way it does now.
This of course could be accounted for by having an initially non-empty
(large enough) balloon (not sure how easy it is these days to do this
for pv-ops, but it has always been trivial with the legacy code). That
wouldn''t help a 32-bit kernel much (where generally the initial balloon
is all in highmem, yet the vacated pages need to be in lowmem), but
for 64-bit kernels it should be fine.
> Regarding the right balance, note that on the other extreme end, if PFN
> space were infinite, there''s not much expected performance gain
from
> rendering virtual backends fully independent. Beyond controller queue
> depth, these requests are all just going to pile up, waiting.
Is there a way to look through the queue stack to find out how many
distinct ones there are that the backend is running on top of as well
as - for a particular I/O path - the one with the smallest depth? Or can
one assume that the top most one (generally loop''s or
blktap2''s) won''t
advertise a queue deeper than what is going to be accepted
downstream (probably not, I''d guess)?

And - what you say would similarly apply to the usefulness of multi-page
rings afaict.
> XenServer has some support for decoupling in blktap.ko [1] which worked
> relatively well: Use frame ''pool'' kobjects. A bunch of
pages, mapped to
> sysfs object. Name was arbitrary. Size configurable, even at runtime. 
> 
> Sysfs meant stuff was easily set up by shell or python code, or
> manually. To become operational, every backend must be bound to a pool
> (initially, the global ''default'' one, for tool compat).
Backends can be
> relinked arbitrarily before entering Connected state.
> 
> Then let the userland toolstack set things up according to physical I/O
> topology and properties probed. Basically every physical backend (say, a
> volume group, or a HBA) would start out by allocating and dimensioning a
> dedicated pool (named after the backend), and every backend instance
> fired up gets bound to the pool it belongs to.
Having userland do all that seems like a fallback solution only to me - I
would hope that sufficient information is available directly to the drivers.

Thanks in any case for responding so quickly,
Jan
> There''s a lot of additional optimizations one could consider, e.g.
> autogrowing the pool (log(nbackends) or so?) and the like. To improve
> locality, having backends which look ahead in their request queue and
> allocate whole batches is probably a good idea too, etc, etc.
> 
> HTH,
> Daniel
> 
> [1]
> http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git 
>  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and request.c.
>  Note that these are based on mempools, not the frame pools blkback
>  would take.

Ian Campbell

2012-Mar-27 08:45 UTC

head link

Re: blkback global resources

On Mon, 2012-03-26 at 21:47 +0100, Matt Wilson wrote:> On Mon, Mar 26, 2012 at 09:20:08AM -0700, Ian Campbell wrote:
> > On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > > 
> > > Does anyone have an opinion here, in particular regarding the
> > > original authors'' decision to make this global vs. the
apparently
> > > made observation (by Daniel Stodden, the author of said patch,
> > > who I don''t have any current email of to ask directly),
but also
> > > in the context of multi-page rings, the purpose of which is to
> > > allow for larger amounts of in-flight I/O?
> > 
> > Not really much to say other than we (well, mostly Wei Liu) have a
> > similar issue with netback too. That used to have a global pool, then
a
> > pool-per-worker thread. When Wei added thread-per-vif support he
solved
> > this by adding a "page pool" which handles allocations.
Possibly this
> > could grow some sort of fairness etc nobs and be shared with blkback?
> 
> I''d definitely welcome an approach in blkback similar to the page
pool
> changes to netback. Steve Noonan''s change to use a spinlock per
> blkfront device [1] should show increased contention on the global
> request list on the blkback side when multiple devices are attached.
> 
> Does netback not also need to handle questions of fairness and
> resource starvation?
It does, what I was meaning to say was that the page pool could grow
some fairness nobs for netback use (if it is not already fair) and
separately that it could be shared with blkback (rather than fairness
nobs being related to the sharing with blkback which looking back is how
my original statement could have been read).

Ian.
> 
> Matt
> 
> [1]
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-block.git;a=commitdiff;h=3467811e26660eb46bc655234573d22d6876d5f9

Wei Liu

2012-Mar-27 09:21 UTC

head link

Re: blkback global resources

On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:> >>> On 26.03.12 at 18:53, Daniel Stodden
<daniel.stodden@googlemail.com> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc''ing Daniel for you on this one, Jan.
> >> 
> >>  K.
> >> 
> >> On 26/03/2012 16:56, "Jan Beulich"
<JBeulich@suse.com> wrote:
> >> 
> >> > All the resources allocated based on xen_blkif_reqs are
global in
> >> > blkback. While (without having measured anything) I think
that this
> >> > is bad from a QoS perspective (not the least implied from a
warning
> >> > issued by Citrix''es multi-page-ring patches:
> >> > 
> >> > if (blkif_reqs < BLK_RING_SIZE(order))
> >> > printk(KERN_WARNING "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld,
"
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> > 
> >> > indicating that this _is_ a bottleneck), I''m otoh
hesitant to convert
> >> > this to per-instance allocations, as the amount of memory
taken
> >> > away from Dom0 for this may be not insignificant when there
are
> >> > many devices.
> >> > 
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors'' decision to make this global vs.
the apparently
> >> > made observation (by Daniel Stodden, the author of said
patch,
> >> > who I don''t have any current email of to ask
directly), but also
> >> > in the context of multi-page rings, the purpose of which is
to
> >> > allow for larger amounts of in-flight I/O?
> >> > 
> >> > Thanks, Jan
> > 
> > Re-CC''ing Andrei Lifchits, I think there''s been some
work going on at
> > Citrix regarding that matter.
> > 
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty
> (large enough) balloon (not sure how easy it is these days to do this
> for pv-ops, but it has always been trivial with the legacy code). That
> wouldn''t help a 32-bit kernel much (where generally the initial
balloon
> is all in highmem, yet the vacated pages need to be in lowmem), but
> for 64-bit kernels it should be fine.
> 
> > Regarding the right balance, note that on the other extreme end, if
PFN
> > space were infinite, there''s not much expected performance
gain from
> > rendering virtual backends fully independent. Beyond controller queue
> > depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many
> distinct ones there are that the backend is running on top of as well
> as - for a particular I/O path - the one with the smallest depth? Or can
> one assume that the top most one (generally loop''s or
blktap2''s) won''t
> advertise a queue deeper than what is going to be accepted
> downstream (probably not, I''d guess)?
> 
> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 
The balance is tricky. What I observe so far is that having multi-page
rings doesn''t necessarily help improve performance (but it is still
nice
to have it in case of future usage). There are other contentions, which
limit throughput of a single VIF.
> > XenServer has some support for decoupling in blktap.ko [1] which
worked
> > relatively well: Use frame ''pool'' kobjects. A bunch
of pages, mapped to
> > sysfs object. Name was arbitrary. Size configurable, even at runtime. 
> > 
> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global ''default'' one, for tool
compat). Backends can be
> > relinked arbitrarily before entering Connected state.
> > 
> > Then let the userland toolstack set things up according to physical
I/O
> > topology and properties probed. Basically every physical backend (say,
a
> > volume group, or a HBA) would start out by allocating and dimensioning
a
> > dedicated pool (named after the backend), and every backend instance
> > fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I
> would hope that sufficient information is available directly to the
drivers.
> 
I tempt to make all information available to drivers, but haven''t
reached a conclusion yet. Maybe we should also allow user to experiment
various configuration for their specific needs?


Wei.
> Thanks in any case for responding so quickly,
> Jan
> 
> > There''s a lot of additional optimizations one could consider,
e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> > 
> > HTH,
> > Daniel
> > 
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git 
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and
request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> >  would take.
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Wei Liu

2012-Mar-27 09:41 UTC

head link

Re: blkback global resources

On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> > All the resources allocated based on xen_blkif_reqs are global in
> > blkback. While (without having measured anything) I think that this
> > is bad from a QoS perspective (not the least implied from a warning
> > issued by Citrix''es multi-page-ring patches:
> > 
> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> > 		printk(KERN_WAfdfdRNING "WARNING: "
> > 		       "I/O request space (%d reqs) < ring order %ld, "
> > 		       "consider increasing %s.reqs to >= %ld.",
> > 		       blkif_reqs, order, KBUILD_MODNAME,
> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> > 
> > indicating that this _is_ a bottleneck), I''m otoh hesitant to
convert
> > this to per-instance allocations, as the amount of memory taken
> > away from Dom0 for this may be not insignificant when there are
> > many devices.
> > 
What''s your main concern on QoS? Lock contention? Starvation? Or any
other things?

You are right here, we should be careful on how much memory blkback may
consume.
> > Does anyone have an opinion here, in particular regarding the
> > original authors'' decision to make this global vs. the
apparently
> > made observation (by Daniel Stodden, the author of said patch,
> > who I don''t have any current email of to ask directly), but
also
> > in the context of multi-page rings, the purpose of which is to
> > allow for larger amounts of in-flight I/O?
> 
I don''t know the actual bottleneck for blkback because I never play
with
it. But for multi-page ring netback / netfront, global pool (in the
sense of starvation) may not become bottleneck if there are very limited
VIFs running. What I observe is that to achieve 3.8Gb/s throughput
between two VIFs connected throughput internal bridge, it only consumes
~80 pages in the pool.


Wei.
> Not really much to say other than we (well, mostly Wei Liu) have a
> similar issue with netback too. That used to have a global pool, then a
> pool-per-worker thread. When Wei added thread-per-vif support he solved
> this by adding a "page pool" which handles allocations. Possibly
this
> could grow some sort of fairness etc nobs and be shared with blkback?
> 
> Ian.
>

Jan Beulich

2012-Mar-27 10:11 UTC

head link

Re: blkback global resources

>>> On 27.03.12 at 11:21, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, 2012-03-27 at 08:27 +0100, Jan Beulich wrote:
>> >>> On 26.03.12 at 18:53, Daniel Stodden
<daniel.stodden@googlemail.com> wrote:
>> > Then let the userland toolstack set things up according to
physical I/O
>> > topology and properties probed. Basically every physical backend
(say, a
>> > volume group, or a HBA) would start out by allocating and
dimensioning a
>> > dedicated pool (named after the backend), and every backend
instance
>> > fired up gets bound to the pool it belongs to.
>> 
>> Having userland do all that seems like a fallback solution only to me -
I
>> would hope that sufficient information is available directly to the
drivers.
>> 
> 
> I tempt to make all information available to drivers, but haven''t
> reached a conclusion yet. Maybe we should also allow user to experiment
> various configuration for their specific needs?
Oh, yes, user overrides are certainly desirable. I was merely hinting
at it being desirable from my pov to have sensible default behavior.

Jan

Jan Beulich

2012-Mar-27 10:22 UTC

head link

Re: blkback global resources

>>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com> wrote:
> On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
>> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
>> > All the resources allocated based on xen_blkif_reqs are global in
>> > blkback. While (without having measured anything) I think that
this
>> > is bad from a QoS perspective (not the least implied from a
warning
>> > issued by Citrix''es multi-page-ring patches:
>> > 
>> > 	if (blkif_reqs < BLK_RING_SIZE(order))
>> > 		printk(KERN_WAfdfdRNING "WARNING: "
>> > 		       "I/O request space (%d reqs) < ring order %ld,
"
>> > 		       "consider increasing %s.reqs to >= %ld.",
>> > 		       blkif_reqs, order, KBUILD_MODNAME,
>> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> > 
>> > indicating that this _is_ a bottleneck), I''m otoh
hesitant to convert
>> > this to per-instance allocations, as the amount of memory taken
>> > away from Dom0 for this may be not insignificant when there are
>> > many devices.
>> > 
> 
> What''s your main concern on QoS? Lock contention? Starvation? Or
any
> other things?
However you want to put it. Prior to the multi-page ring patches, we
have 64 pending requests (global) and 32 ring entries. Obviously,
bumping the ring size just to order 1 will already bring the number of
possible in-flight entries per device on par with those in-flight across
all devices. So _if_ someone really determined that a multi-page ring
helps performance, I wonder whether that was with manually
adjusted global pending request values (not said anywhere) or with
just a single frontend (not very close to real world scenarios).

In any case, two guests with heavy I/O clearly have the potential to
hinder each other, even if both get backed by different physical
devices.

Jan

Wei Liu

2012-Mar-27 12:34 UTC

head link

Re: blkback global resources

On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com>
wrote:
> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> >> > All the resources allocated based on xen_blkif_reqs are
global in
> >> > blkback. While (without having measured anything) I think
that this
> >> > is bad from a QoS perspective (not the least implied from a
warning
> >> > issued by Citrix''es multi-page-ring patches:
> >> > 
> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> >> > 		printk(KERN_WARNING "WARNING: "
> >> > 		       "I/O request space (%d reqs) < ring order
%ld, "
> >> > 		       "consider increasing %s.reqs to >=
%ld.",
> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> > 
> >> > indicating that this _is_ a bottleneck), I''m otoh
hesitant to convert
> >> > this to per-instance allocations, as the amount of memory
taken
> >> > away from Dom0 for this may be not insignificant when there
are
> >> > many devices.
> >> > 
> > 
> > What''s your main concern on QoS? Lock contention? Starvation?
Or any
> > other things?
> 
> However you want to put it. Prior to the multi-page ring patches, we
> have 64 pending requests (global) and 32 ring entries. Obviously,
> bumping the ring size just to order 1 will already bring the number of
> possible in-flight entries per device on par with those in-flight across
> all devices. So _if_ someone really determined that a multi-page ring
> helps performance, I wonder whether that was with manually
> adjusted global pending request values (not said anywhere) or with
> just a single frontend (not very close to real world scenarios).
> 
Just to be precise, bumping order to 1 makes ring entries more than 64.
> In any case, two guests with heavy I/O clearly have the potential to
> hinder each other, even if both get backed by different physical
> devices.
> 
Right. One solution I can think of is that each blk thread holds a small
number of private entries (threshold to be determined), while blkback
maintains a pool for any allocation goes beyond that threshold. But this
just makes things more and more complex -- let''s not overkill ourselves
before we identify the real bottleneck.



Wei.
> Jan
>

Jan Beulich

2012-Mar-27 12:45 UTC

head link

Re: blkback global resources

>>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:
>> >>> On 27.03.12 at 11:41, Wei Liu <wei.liu2@citrix.com>
wrote:
>> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
>> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
>> >> > All the resources allocated based on xen_blkif_reqs are
global in
>> >> > blkback. While (without having measured anything) I think
that this
>> >> > is bad from a QoS perspective (not the least implied from
a warning
>> >> > issued by Citrix''es multi-page-ring patches:
>> >> > 
>> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
>> >> > 		printk(KERN_WARNING "WARNING: "
>> >> > 		       "I/O request space (%d reqs) < ring
order %ld, "
>> >> > 		       "consider increasing %s.reqs to >=
%ld.",
>> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
>> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
>> >> > 
>> >> > indicating that this _is_ a bottleneck), I''m
otoh hesitant to convert
>> >> > this to per-instance allocations, as the amount of memory
taken
>> >> > away from Dom0 for this may be not insignificant when
there are
>> >> > many devices.
>> >> > 
>> > 
>> > What''s your main concern on QoS? Lock contention?
Starvation? Or any
>> > other things?
>> 
>> However you want to put it. Prior to the multi-page ring patches, we
>> have 64 pending requests (global) and 32 ring entries. Obviously,
>> bumping the ring size just to order 1 will already bring the number of
>> possible in-flight entries per device on par with those in-flight
across
>> all devices. So _if_ someone really determined that a multi-page ring
>> helps performance, I wonder whether that was with manually
>> adjusted global pending request values (not said anywhere) or with
>> just a single frontend (not very close to real world scenarios).
>> 
> 
> Just to be precise, bumping order to 1 makes ring entries more than 64.
Iirc order 0 -> 32 entries, order 1 -> 64 entries.

Jan

Wei Liu

2012-Mar-27 13:01 UTC

head link

Re: blkback global resources

On Tue, 2012-03-27 at 13:45 +0100, Jan Beulich wrote:> >>> On 27.03.12 at 14:34, Wei Liu <wei.liu2@citrix.com>
wrote:
> > On Tue, 2012-03-27 at 11:22 +0100, Jan Beulich wrote:
> >> >>> On 27.03.12 at 11:41, Wei Liu
<wei.liu2@citrix.com> wrote:
> >> > On Mon, 2012-03-26 at 17:20 +0100, Ian Campbell wrote:
> >> >> On Mon, 2012-03-26 at 16:56 +0100, Jan Beulich wrote:
> >> >> > All the resources allocated based on xen_blkif_reqs
are global in
> >> >> > blkback. While (without having measured anything) I
think that this
> >> >> > is bad from a QoS perspective (not the least implied
from a warning
> >> >> > issued by Citrix''es multi-page-ring
patches:
> >> >> > 
> >> >> > 	if (blkif_reqs < BLK_RING_SIZE(order))
> >> >> > 		printk(KERN_WARNING "WARNING: "
> >> >> > 		       "I/O request space (%d reqs) < ring
order %ld, "
> >> >> > 		       "consider increasing %s.reqs to >=
%ld.",
> >> >> > 		       blkif_reqs, order, KBUILD_MODNAME,
> >> >> > 		       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> >> > 
> >> >> > indicating that this _is_ a bottleneck),
I''m otoh hesitant to convert
> >> >> > this to per-instance allocations, as the amount of
memory taken
> >> >> > away from Dom0 for this may be not insignificant
when there are
> >> >> > many devices.
> >> >> > 
> >> > 
> >> > What''s your main concern on QoS? Lock contention?
Starvation? Or any
> >> > other things?
> >> 
> >> However you want to put it. Prior to the multi-page ring patches,
we
> >> have 64 pending requests (global) and 32 ring entries. Obviously,
> >> bumping the ring size just to order 1 will already bring the
number of
> >> possible in-flight entries per device on par with those in-flight
across
> >> all devices. So _if_ someone really determined that a multi-page
ring
> >> helps performance, I wonder whether that was with manually
> >> adjusted global pending request values (not said anywhere) or with
> >> just a single frontend (not very close to real world scenarios).
> >> 
> > 
> > Just to be precise, bumping order to 1 makes ring entries more than
64.
> 
> Iirc order 0 -> 32 entries, order 1 -> 64 entries.
> 
I skimmed the code just to make sure. You''re right, the size is rounded
down to power of 2. What I was talking about is the value before
rounded-down.


Wei.
> Jan
>

Andrei Lifchits

2012-Apr-02 09:56 UTC

head link

Re: blkback global resources

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, March 27, 2012 8:27 AM
> To: Daniel Stodden
> Cc: Andrei Lifchits; xen-devel
> Subject: Re: [Xen-devel] blkback global resources
> 
> >>> On 26.03.12 at 18:53, Daniel Stodden
<daniel.stodden@googlemail.com>
> wrote:
> > On Mon, 2012-03-26 at 17:06 +0100, Keir Fraser wrote:
> >> Cc''ing Daniel for you on this one, Jan.
> >>
> >>  K.
> >>
> >> On 26/03/2012 16:56, "Jan Beulich"
<JBeulich@suse.com> wrote:
> >>
> >> > All the resources allocated based on xen_blkif_reqs are
global in
> >> > blkback. While (without having measured anything) I think
that this
> >> > is bad from a QoS perspective (not the least implied from a
warning
> >> > issued by Citrix''es multi-page-ring patches:
> >> >
> >> > if (blkif_reqs < BLK_RING_SIZE(order)) printk(KERN_WARNING
> >> > "WARNING: "
> >> >       "I/O request space (%d reqs) < ring order %ld,
"
> >> >       "consider increasing %s.reqs to >= %ld.",
> >> >       blkif_reqs, order, KBUILD_MODNAME,
> >> >       roundup_pow_of_two(BLK_RING_SIZE(order)));
> >> >
> >> > indicating that this _is_ a bottleneck), I''m otoh
hesitant to
> >> > convert this to per-instance allocations, as the amount of
memory
> >> > taken away from Dom0 for this may be not insignificant when
there
> >> > are many devices.
> >> >
> >> > Does anyone have an opinion here, in particular regarding the
> >> > original authors'' decision to make this global vs.
the apparently
> >> > made observation (by Daniel Stodden, the author of said
patch, who
> >> > I don''t have any current email of to ask directly),
but also in the
> >> > context of multi-page rings, the purpose of which is to allow
for
> >> > larger amounts of in-flight I/O?
> >> >
> >> > Thanks, Jan
> >
> > Re-CC''ing Andrei Lifchits, I think there''s been some
work going on at
> > Citrix regarding that matter.
> >
> > Yes, just allocating a pfn pool per backend instance is way too much
> > memory balooned out. Otherwise this stuff would have never looked the
> > way it does now.
> 
> This of course could be accounted for by having an initially non-empty
(large
> enough) balloon (not sure how easy it is these days to do this for pv-ops,
but
> it has always been trivial with the legacy code). That wouldn''t
help a 32-bit
> kernel much (where generally the initial balloon is all in highmem, yet the
> vacated pages need to be in lowmem), but for 64-bit kernels it should be
> fine.
> 
> > Regarding the right balance, note that on the other extreme end, if
> > PFN space were infinite, there''s not much expected
performance gain
> > from rendering virtual backends fully independent. Beyond controller
> > queue depth, these requests are all just going to pile up, waiting.
> 
> Is there a way to look through the queue stack to find out how many
distinct
> ones there are that the backend is running on top of as well as - for a
> particular I/O path - the one with the smallest depth? Or can one assume
> that the top most one (generally loop''s or blktap2''s)
won''t advertise a queue
> deeper than what is going to be accepted downstream (probably not,
I''d
> guess)?
Hm, I don''t remember seeing anything relating to that off the top of my
head in the blkback code, so I don''t think so. (I''m not sure
the benefit would be that great, anyways).
> And - what you say would similarly apply to the usefulness of multi-page
> rings afaict.
> 
> > XenServer has some support for decoupling in blktap.ko [1] which
> > worked relatively well: Use frame ''pool'' kobjects. A
bunch of pages,
> > mapped to sysfs object. Name was arbitrary. Size configurable, even at
> runtime.
I have added a similar functionality to blkback (pools configurable through
xenstore, with userland tools creating one pool per SR), which is now out in the
form of a limited-availability hotfix and will be there in the next XenServer
release. Felipe (CC''d) measured the effects on performance and found
that it helps.
> > Sysfs meant stuff was easily set up by shell or python code, or
> > manually. To become operational, every backend must be bound to a pool
> > (initially, the global ''default'' one, for tool
compat). Backends can
> > be relinked arbitrarily before entering Connected state.
> >
> > Then let the userland toolstack set things up according to physical
> > I/O topology and properties probed. Basically every physical backend
> > (say, a volume group, or a HBA) would start out by allocating and
> > dimensioning a dedicated pool (named after the backend), and every
> > backend instance fired up gets bound to the pool it belongs to.
> 
> Having userland do all that seems like a fallback solution only to me - I
would
> hope that sufficient information is available directly to the drivers.
You''re probably right.
> Thanks in any case for responding so quickly, Jan
> 
> > There''s a lot of additional optimizations one could consider,
e.g.
> > autogrowing the pool (log(nbackends) or so?) and the like. To improve
> > locality, having backends which look ahead in their request queue and
> > allocate whole batches is probably a good idea too, etc, etc.
> >
> > HTH,
> > Daniel
> >
> > [1]
> > http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git
> >  mostly in drivers/block/blktap/sysfs.c (show/store_pool) and
request.c.
> >  Note that these are based on mempools, not the frame pools blkback
> > would take.
> 
Cheers,
Andrei

Maybe Matching Threads

Search for more possibly parallel threads

Xen devel - Mar 2012 - blkback global resources

blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Re: blkback global resources

Maybe Matching Threads