thr3ads.net - Xen devel - [Hackathon minutes] PV block improvements [May 2013]

If this information is useful, please help other people find it:
Share via:

Roger Pau Monné

2013-May-24 15:06 UTC

[Hackathon minutes] PV block improvements

Hello,

This are the notes about the block improvements discussed at the
Hackathon, some of them, if not all, have already been incorporated into:

https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit

Here is a list of future work items, more measurable and limited:

A) Separate request and response rings. This has several benefits, we
will be able to reduce the size of the response struct, since it no
longer has to be the same size of the request. Also, we could increase
the number of in-flight requests, since we are no longer limited by the
size of the request ring. We still need to make sure that all in-flight
requests can be written to the response ring once they are finished, or
added to a queue that writes them to the response ring when there''s a
free slot.

B) Clean up the size differences between 32/64bit structs, also while
there reduce the size of a request so it is aligned to the cache (64bits).

C) Investigate the interrupt rate between blkfront/blkback and if needed
add support for polling, switching between polling or events could be
done automatically by blkfront/blkback when a high interrupt rate is
detected.

D) Multipage ring support.

Roger Pau Monné

2013-Jun-21 17:10 UTC

head link

Re: [Hackathon minutes] PV block improvements

Hello,

While working on further block improvements I''ve found an issue with
persistent grants in blkfront.

Persistent grants basically allocate grants and then they are never
released, so both blkfront and blkback keep using the same memory pages
for all the transactions.

This is not a problem in blkback, because we can dynamically choose how
many grants we want to map. On the other hand, blkfront cannot remove
the access to those grants at any point, because blkfront doesn''t know
if blkback has this grants mapped persistently or not.

So if for example we start expanding the number of segments in indirect
requests, to a value like 512 segments per requests, blkfront will
probably try to persistently map 512*32+512 = 16896 grants per device,
that''s much more grants that the current default, which is 32*256 =
8192
(if using grant tables v2). This can cause serious problems to other
interfaces inside the DomU, since blkfront basically starts hoarding all
possible grants, leaving other interfaces completely locked.

I''ve been thinking about different ways to solve this, but so far I
haven''t been able to found a nice solution:

1. Limit the number of persistent grants a blkfront instance can use,
let''s say that only the first X used grants will be persistently mapped
by both blkfront and blkback, and if more grants are needed the previous
map/unmap will be used.

2. Switch to grant copy in blkback, and get rid of persistent grants (I
have not benchmarked this solution, but I''m quite sure it will involve
a
performance regression, specially when scaling to a high number of domains).

3. Increase the size of the grant_table or the size of a single grant
(from 4k to 2M) (this is from Stefano Stabellini).

4. Introduce a new request type that we can use to request blkback to
unmap certain grefs so we can free them in blkfront.

So far none of them looks like a suitable solution.

Matt Wilson

2013-Jun-21 18:07 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné
wrote:> Hello,
> 
> While working on further block improvements I''ve found an issue
with
> persistent grants in blkfront.
> 
> Persistent grants basically allocate grants and then they are never
> released, so both blkfront and blkback keep using the same memory pages
> for all the transactions.
> 
> This is not a problem in blkback, because we can dynamically choose how
> many grants we want to map. On the other hand, blkfront cannot remove
> the access to those grants at any point, because blkfront doesn''t
know
> if blkback has this grants mapped persistently or not.
> 
> So if for example we start expanding the number of segments in indirect
> requests, to a value like 512 segments per requests, blkfront will
> probably try to persistently map 512*32+512 = 16896 grants per device,
> that''s much more grants that the current default, which is 32*256
= 8192
> (if using grant tables v2). This can cause serious problems to other
> interfaces inside the DomU, since blkfront basically starts hoarding all
> possible grants, leaving other interfaces completely locked.
Yikes.
> I''ve been thinking about different ways to solve this, but so far
I
> haven''t been able to found a nice solution:
> 
> 1. Limit the number of persistent grants a blkfront instance can use,
> let''s say that only the first X used grants will be persistently
mapped
> by both blkfront and blkback, and if more grants are needed the previous
> map/unmap will be used.
I''m not thrilled with this option. It would likely introduce some
significant performance variability, wouldn''t it?
> 2. Switch to grant copy in blkback, and get rid of persistent grants (I
> have not benchmarked this solution, but I''m quite sure it will
involve a
> performance regression, specially when scaling to a high number of
domains).
Why do you think so?
> 3. Increase the size of the grant_table or the size of a single grant
> (from 4k to 2M) (this is from Stefano Stabellini).
Seems like a bit of a bigger hammer approach.
> 4. Introduce a new request type that we can use to request blkback to
> unmap certain grefs so we can free them in blkfront.
Sounds complex.
> So far none of them looks like a suitable solution.
I agree. Of these, I think #2 is worth a little more attention.

--msw

Konrad Rzeszutek Wilk

2013-Jun-21 20:16 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné
wrote:> Hello,
> 
> While working on further block improvements I''ve found an issue
with
> persistent grants in blkfront.
> 
> Persistent grants basically allocate grants and then they are never
> released, so both blkfront and blkback keep using the same memory pages
> for all the transactions.
> 
> This is not a problem in blkback, because we can dynamically choose how
> many grants we want to map. On the other hand, blkfront cannot remove
> the access to those grants at any point, because blkfront doesn''t
know
> if blkback has this grants mapped persistently or not.
> 
> So if for example we start expanding the number of segments in indirect
> requests, to a value like 512 segments per requests, blkfront will
> probably try to persistently map 512*32+512 = 16896 grants per device,
> that''s much more grants that the current default, which is 32*256
= 8192
> (if using grant tables v2). This can cause serious problems to other
> interfaces inside the DomU, since blkfront basically starts hoarding all
> possible grants, leaving other interfaces completely locked.
> 
> I''ve been thinking about different ways to solve this, but so far
I
> haven''t been able to found a nice solution:
> 
> 1. Limit the number of persistent grants a blkfront instance can use,
> let''s say that only the first X used grants will be persistently
mapped
> by both blkfront and blkback, and if more grants are needed the previous
> map/unmap will be used.
> 
> 2. Switch to grant copy in blkback, and get rid of persistent grants (I
> have not benchmarked this solution, but I''m quite sure it will
involve a
> performance regression, specially when scaling to a high number of
domains).
> 
> 3. Increase the size of the grant_table or the size of a single grant
> (from 4k to 2M) (this is from Stefano Stabellini).
> 
> 4. Introduce a new request type that we can use to request blkback to
> unmap certain grefs so we can free them in blkfront.

5). Lift the limit of grant pages a domain can have.

6). Have an outstanding of grant pools that are mapped to a guest and
recycle them? That way both netfront and blkfront could use them as needed?
> 
> So far none of them looks like a suitable solution.
>

Wei Liu

2013-Jun-21 23:17 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Fri, Jun 21, 2013 at 04:16:25PM -0400, Konrad Rzeszutek Wilk
wrote:> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> > Hello,
> > 
> > While working on further block improvements I''ve found an
issue with
> > persistent grants in blkfront.
> > 
> > Persistent grants basically allocate grants and then they are never
> > released, so both blkfront and blkback keep using the same memory
pages
> > for all the transactions.
> > 
> > This is not a problem in blkback, because we can dynamically choose
how
> > many grants we want to map. On the other hand, blkfront cannot remove
> > the access to those grants at any point, because blkfront
doesn''t know
> > if blkback has this grants mapped persistently or not.
> > 
> > So if for example we start expanding the number of segments in
indirect
> > requests, to a value like 512 segments per requests, blkfront will
> > probably try to persistently map 512*32+512 = 16896 grants per device,
> > that''s much more grants that the current default, which is
32*256 = 8192
> > (if using grant tables v2). This can cause serious problems to other
> > interfaces inside the DomU, since blkfront basically starts hoarding
all
> > possible grants, leaving other interfaces completely locked.
> > 
> > I''ve been thinking about different ways to solve this, but so
far I
> > haven''t been able to found a nice solution:
> > 
> > 1. Limit the number of persistent grants a blkfront instance can use,
> > let''s say that only the first X used grants will be
persistently mapped
> > by both blkfront and blkback, and if more grants are needed the
previous
> > map/unmap will be used.
> > 
> > 2. Switch to grant copy in blkback, and get rid of persistent grants
(I
> > have not benchmarked this solution, but I''m quite sure it
will involve a
> > performance regression, specially when scaling to a high number of
domains).
> > 
Any chance that the speed of copying is fast enough for block devices?
> > 3. Increase the size of the grant_table or the size of a single grant
> > (from 4k to 2M) (this is from Stefano Stabellini).
> > 
> > 4. Introduce a new request type that we can use to request blkback to
> > unmap certain grefs so we can free them in blkfront.
> 
> 
> 5). Lift the limit of grant pages a domain can have.
If I''m not mistaken, this is basically the same as "increase the
size of
the grant_table" in #3.
> 
> 6). Have an outstanding of grant pools that are mapped to a guest and
> recycle them? That way both netfront and blkfront could use them as needed?
> 
Is there an easy way to instrument the network stack to use those pages
only?


Wei.
> > 
> > So far none of them looks like a suitable solution.
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Roger Pau Monné

2013-Jun-22 07:11 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 21/06/13 20:07, Matt Wilson wrote:> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
>> Hello,
>>
>> While working on further block improvements I''ve found an
issue with
>> persistent grants in blkfront.
>>
>> Persistent grants basically allocate grants and then they are never
>> released, so both blkfront and blkback keep using the same memory pages
>> for all the transactions.
>>
>> This is not a problem in blkback, because we can dynamically choose how
>> many grants we want to map. On the other hand, blkfront cannot remove
>> the access to those grants at any point, because blkfront
doesn''t know
>> if blkback has this grants mapped persistently or not.
>>
>> So if for example we start expanding the number of segments in indirect
>> requests, to a value like 512 segments per requests, blkfront will
>> probably try to persistently map 512*32+512 = 16896 grants per device,
>> that''s much more grants that the current default, which is
32*256 = 8192
>> (if using grant tables v2). This can cause serious problems to other
>> interfaces inside the DomU, since blkfront basically starts hoarding
all
>> possible grants, leaving other interfaces completely locked.
> 
> Yikes.
> 
>> I''ve been thinking about different ways to solve this, but so
far I
>> haven''t been able to found a nice solution:
>>
>> 1. Limit the number of persistent grants a blkfront instance can use,
>> let''s say that only the first X used grants will be
persistently mapped
>> by both blkfront and blkback, and if more grants are needed the
previous
>> map/unmap will be used.
> 
> I''m not thrilled with this option. It would likely introduce some
> significant performance variability, wouldn''t it?
Probably, and also it will be hard to distribute the number of available
grant across the different interfaces in a performance sensible way,
specially given the fact that once a grant is assigned to a interface it
cannot be returned back to the pool of grants.

So if we had two interfaces with very different usage (one very busy and
another one almost idle), and equally distribute the grants amongst
them, one will have a lot of unused grants while the other will suffer
from starvation.
> 
>> 2. Switch to grant copy in blkback, and get rid of persistent grants (I
>> have not benchmarked this solution, but I''m quite sure it will
involve a
>> performance regression, specially when scaling to a high number of
domains).
> 
> Why do you think so?
First because grant_copy is done by the hypervisor, while when using
persistent grants the copy is done by the guest. Also, grant_copy takes
the grant lock, so when scaling to a large number of domains there''s
going to be contention around this lock. Persistent grants don''t need
any shared lock, and thus scale better.
> 
>> 3. Increase the size of the grant_table or the size of a single grant
>> (from 4k to 2M) (this is from Stefano Stabellini).
> 
> Seems like a bit of a bigger hammer approach.
> 
>> 4. Introduce a new request type that we can use to request blkback to
>> unmap certain grefs so we can free them in blkfront.
> 
> Sounds complex.
> 
>> So far none of them looks like a suitable solution.
> 
> I agree. Of these, I think #2 is worth a little more attention.
> 
> --msw
>

Roger Pau Monné

2013-Jun-22 07:17 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 21/06/13 22:16, Konrad Rzeszutek Wilk wrote:> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
>> Hello,
>>
>> While working on further block improvements I''ve found an
issue with
>> persistent grants in blkfront.
>>
>> Persistent grants basically allocate grants and then they are never
>> released, so both blkfront and blkback keep using the same memory pages
>> for all the transactions.
>>
>> This is not a problem in blkback, because we can dynamically choose how
>> many grants we want to map. On the other hand, blkfront cannot remove
>> the access to those grants at any point, because blkfront
doesn''t know
>> if blkback has this grants mapped persistently or not.
>>
>> So if for example we start expanding the number of segments in indirect
>> requests, to a value like 512 segments per requests, blkfront will
>> probably try to persistently map 512*32+512 = 16896 grants per device,
>> that''s much more grants that the current default, which is
32*256 = 8192
>> (if using grant tables v2). This can cause serious problems to other
>> interfaces inside the DomU, since blkfront basically starts hoarding
all
>> possible grants, leaving other interfaces completely locked.
>>
>> I''ve been thinking about different ways to solve this, but so
far I
>> haven''t been able to found a nice solution:
>>
>> 1. Limit the number of persistent grants a blkfront instance can use,
>> let''s say that only the first X used grants will be
persistently mapped
>> by both blkfront and blkback, and if more grants are needed the
previous
>> map/unmap will be used.
>>
>> 2. Switch to grant copy in blkback, and get rid of persistent grants (I
>> have not benchmarked this solution, but I''m quite sure it will
involve a
>> performance regression, specially when scaling to a high number of
domains).
>>
>> 3. Increase the size of the grant_table or the size of a single grant
>> (from 4k to 2M) (this is from Stefano Stabellini).
>>
>> 4. Introduce a new request type that we can use to request blkback to
>> unmap certain grefs so we can free them in blkfront.
> 
> 
> 5). Lift the limit of grant pages a domain can have.
> 
> 6). Have an outstanding of grant pools that are mapped to a guest and
> recycle them? That way both netfront and blkfront could use them as needed?
If all the backends run in the same guest that could be a viable option,
but if we have backends running in different domains we will end up with
several different pools for each backend domain, and thus the scenario
is going to be quite similar to what we have now (a pool can hoard all
available grants and leave the others starving).

Stefano Stabellini

2013-Jun-24 11:06 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Sat, 22 Jun 2013, Wei Liu wrote:> On Fri, Jun 21, 2013 at 04:16:25PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> > > Hello,
> > > 
> > > While working on further block improvements I''ve found
an issue with
> > > persistent grants in blkfront.
> > > 
> > > Persistent grants basically allocate grants and then they are
never
> > > released, so both blkfront and blkback keep using the same memory
pages
> > > for all the transactions.
> > > 
> > > This is not a problem in blkback, because we can dynamically
choose how
> > > many grants we want to map. On the other hand, blkfront cannot
remove
> > > the access to those grants at any point, because blkfront
doesn''t know
> > > if blkback has this grants mapped persistently or not.
> > > 
> > > So if for example we start expanding the number of segments in
indirect
> > > requests, to a value like 512 segments per requests, blkfront
will
> > > probably try to persistently map 512*32+512 = 16896 grants per
device,
> > > that''s much more grants that the current default, which
is 32*256 = 8192
> > > (if using grant tables v2). This can cause serious problems to
other
> > > interfaces inside the DomU, since blkfront basically starts
hoarding all
> > > possible grants, leaving other interfaces completely locked.
> > > 
> > > I''ve been thinking about different ways to solve this,
but so far I
> > > haven''t been able to found a nice solution:
> > > 
> > > 1. Limit the number of persistent grants a blkfront instance can
use,
> > > let''s say that only the first X used grants will be
persistently mapped
> > > by both blkfront and blkback, and if more grants are needed the
previous
> > > map/unmap will be used.
> > > 
> > > 2. Switch to grant copy in blkback, and get rid of persistent
grants (I
> > > have not benchmarked this solution, but I''m quite sure
it will involve a
> > > performance regression, specially when scaling to a high number
of domains).
> > > 
> 
> Any chance that the speed of copying is fast enough for block devices?
> 
> > > 3. Increase the size of the grant_table or the size of a single
grant
> > > (from 4k to 2M) (this is from Stefano Stabellini).
> > > 
> > > 4. Introduce a new request type that we can use to request
blkback to
> > > unmap certain grefs so we can free them in blkfront.
> > 
> > 
> > 5). Lift the limit of grant pages a domain can have.
> 
> If I''m not mistaken, this is basically the same as "increase
the size of
> the grant_table" in #3.
Yes, that was one of the things I was suggesting, but it needs
investigating: I wouldn''t want that increasing the number of grant
frames would reach a different scalability limit of the data structure.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Matt Wilson

2013-Jun-25 06:09 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Sat, Jun 22, 2013 at 09:11:20AM +0200, Roger Pau Monné
wrote:> On 21/06/13 20:07, Matt Wilson wrote:
> > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
[...]
> >> 2. Switch to grant copy in blkback, and get rid of persistent
grants (I
> >> have not benchmarked this solution, but I''m quite sure it
will involve a
> >> performance regression, specially when scaling to a high number of
domains).
> > 
> > Why do you think so?
> 
> First because grant_copy is done by the hypervisor, while when using
> persistent grants the copy is done by the guest. Also, grant_copy takes
> the grant lock, so when scaling to a large number of domains
there''s
> going to be contention around this lock. Persistent grants don''t
need
> any shared lock, and thus scale better.
It''d benefit xen-netback to make the locking in the copy path more
fine grained. That would help multi-vif domUs today, and multi-queue
vifs later on.

Thoughts?

--msw

Wei Liu

2013-Jun-25 13:01 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Mon, Jun 24, 2013 at 11:09:19PM -0700, Matt Wilson
wrote:> On Sat, Jun 22, 2013 at 09:11:20AM +0200, Roger Pau Monné wrote:
> > On 21/06/13 20:07, Matt Wilson wrote:
> > > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> 
> [...]
> 
> > >> 2. Switch to grant copy in blkback, and get rid of persistent
grants (I
> > >> have not benchmarked this solution, but I''m quite
sure it will involve a
> > >> performance regression, specially when scaling to a high
number of domains).
> > > 
> > > Why do you think so?
> > 
> > First because grant_copy is done by the hypervisor, while when using
> > persistent grants the copy is done by the guest. Also, grant_copy
takes
> > the grant lock, so when scaling to a large number of domains
there''s
> > going to be contention around this lock. Persistent grants
don''t need
> > any shared lock, and thus scale better.
> 
> It''d benefit xen-netback to make the locking in the copy path more
> fine grained. That would help multi-vif domUs today, and multi-queue
> vifs later on.
> 
I''m not sure I follow. I presume you mean using persistent grant in
xen-netback to help scale better?


Wei.
> Thoughts?
> 
> --msw
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Matt Wilson

2013-Jun-25 15:39 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Tue, Jun 25, 2013 at 02:01:30PM +0100, Wei Liu wrote:> On Mon, Jun 24, 2013 at 11:09:19PM -0700, Matt Wilson wrote:
> > On Sat, Jun 22, 2013 at 09:11:20AM +0200, Roger Pau Monné wrote:
> > > On 21/06/13 20:07, Matt Wilson wrote:
> > > > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné
wrote:
> > 
> > [...]
> > 
> > > >> 2. Switch to grant copy in blkback, and get rid of
persistent grants (I
> > > >> have not benchmarked this solution, but I''m
quite sure it will involve a
> > > >> performance regression, specially when scaling to a high
number of domains).
> > > > 
> > > > Why do you think so?
> > > 
> > > First because grant_copy is done by the hypervisor, while when
using
> > > persistent grants the copy is done by the guest. Also, grant_copy
takes
> > > the grant lock, so when scaling to a large number of domains
there''s
> > > going to be contention around this lock. Persistent grants
don''t need
> > > any shared lock, and thus scale better.
> > 
> > It''d benefit xen-netback to make the locking in the copy path
more
> > fine grained. That would help multi-vif domUs today, and multi-queue
> > vifs later on.
> > 
> 
> I''m not sure I follow. I presume you mean using persistent grant
in
> xen-netback to help scale better?
No, I mean further scaling improvements in the GNTTABOP_copy path
would benefit xen-netback performance when a single guest has multiple
vifs, and will be needed for good multi-queue performance. Given we
might need to do some work there, would it make sense to change
blkback to use GNTTABOP_copy to avoid the problem he''s identified with
persistent grants.

--msw

Ian Campbell

2013-Jun-25 15:53 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné
wrote:> On 21/06/13 20:07, Matt Wilson wrote:
> > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> >> Hello,
> >>
> >> While working on further block improvements I've found an
issue with
> >> persistent grants in blkfront.
> >>
> >> Persistent grants basically allocate grants and then they are
never
> >> released, so both blkfront and blkback keep using the same memory
pages
> >> for all the transactions.
> >>
> >> This is not a problem in blkback, because we can dynamically
choose how
> >> many grants we want to map. On the other hand, blkfront cannot
remove
> >> the access to those grants at any point, because blkfront
doesn't know
> >> if blkback has this grants mapped persistently or not.
> >>
> >> So if for example we start expanding the number of segments in
indirect
> >> requests, to a value like 512 segments per requests, blkfront will
> >> probably try to persistently map 512*32+512 = 16896 grants per
device,
> >> that's much more grants that the current default, which is
32*256 = 8192
> >> (if using grant tables v2). This can cause serious problems to
other
> >> interfaces inside the DomU, since blkfront basically starts
hoarding all
> >> possible grants, leaving other interfaces completely locked.
> > 
> > Yikes.
> > 
> >> I've been thinking about different ways to solve this, but so
far I
> >> haven't been able to found a nice solution:
> >>
> >> 1. Limit the number of persistent grants a blkfront instance can
use,
> >> let's say that only the first X used grants will be
persistently mapped
> >> by both blkfront and blkback, and if more grants are needed the
previous
> >> map/unmap will be used.
> > 
> > I'm not thrilled with this option. It would likely introduce some
> > significant performance variability, wouldn't it?
> 
> Probably, and also it will be hard to distribute the number of available
> grant across the different interfaces in a performance sensible way,
> specially given the fact that once a grant is assigned to a interface it
> cannot be returned back to the pool of grants.
> 
> So if we had two interfaces with very different usage (one very busy and
> another one almost idle), and equally distribute the grants amongst
> them, one will have a lot of unused grants while the other will suffer
> from starvation.
I do think we need to implement some sort of reclaim scheme, which
probably does mean a specific request (per your #4). We simply can't
have a device which once upon a time had high throughput but is no
mostly ideal continue to tie up all those grants.

If you make the reuse of grants use an MRU scheme and reclaim the
currently unused tail fairly infrequently and in large batches then the
perf overhead should be minimal, I think.

I also don't think I would discount the idea of using ephemeral grants
to cover bursts so easily either, in fact it might fall out quite
naturally from an MRU scheme? In that scheme bursting up is pretty cheap
since grant map is relative inexpensive, and recovering from the burst
shouldn't be too expensive if you batch it. If it turns out to be not a
burst but a sustained level of I/O then the MRU scheme would mean you
wouldn't be recovering them.

I also think there probably needs to be some tunable per device limit on
the maximum persistent grants, perhaps minimum and maximum pool sizes
ties in with an MRU scheme? If nothing else it gives the admin the
ability to prioritise devices.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Jun-25 15:57 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné
wrote:> First because grant_copy is done by the hypervisor, while when using
> persistent grants the copy is done by the guest.
This is true and a reasonable concern.
> Also, grant_copy takes
> the grant lock, so when scaling to a large number of domains there's
> going to be contention around this lock.
Does grant copy really take the lock for the duration of the copy,
preventing any other grant ops from the source and/or target domain?

If true then that sounds like an area which is ripe for optimisation!

However I am hopeful that you are mistaken... __acquire_grant_for_copy()
takes the grant lock while it pins the entry into the active grant entry
list and not for the actual duration of the copy (and likewise
__release_grant_for-copy). I hope Jan can confirm this!

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jun-25 16:05 UTC

head link

Re: [Hackathon minutes] PV block improvements

>>> On 25.06.13 at 17:57, Ian Campbell <Ian.Campbell@citrix.com>
wrote:
> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
>> Also, grant_copy takes
>> the grant lock, so when scaling to a large number of domains
there's
>> going to be contention around this lock.
> 
> Does grant copy really take the lock for the duration of the copy,
> preventing any other grant ops from the source and/or target domain?
> 
> If true then that sounds like an area which is ripe for optimisation!
> 
> However I am hopeful that you are mistaken... __acquire_grant_for_copy()
> takes the grant lock while it pins the entry into the active grant entry
> list and not for the actual duration of the copy (and likewise
> __release_grant_for-copy). I hope Jan can confirm this!
Yes, that's how I recall it works since the removal of the per-domain
lock uses from those paths.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Roger Pau Monné

2013-Jun-25 16:30 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 25/06/13 17:57, Ian Campbell wrote:> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
>> First because grant_copy is done by the hypervisor, while when using
>> persistent grants the copy is done by the guest.
> 
> This is true and a reasonable concern.
> 
>> Also, grant_copy takes
>> the grant lock, so when scaling to a large number of domains
there's
>> going to be contention around this lock.
> 
> Does grant copy really take the lock for the duration of the copy,
> preventing any other grant ops from the source and/or target domain?
> 
> If true then that sounds like an area which is ripe for optimisation!
> 
> However I am hopeful that you are mistaken... __acquire_grant_for_copy()
> takes the grant lock while it pins the entry into the active grant entry
> list and not for the actual duration of the copy (and likewise
> __release_grant_for-copy). I hope Jan can confirm this!
Sorry, probably I haven't detailed enough here. I didn't want to mean it
takes the lock for the duration of the whole copy, but it is used at
some places during the grant copy operation so it might introduce
contention when the number of domains is high (although I have not
measured it).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Stefano Stabellini

2013-Jun-25 18:04 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Tue, 25 Jun 2013, Ian Campbell wrote:> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
> > On 21/06/13 20:07, Matt Wilson wrote:
> > > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> > >> Hello,
> > >>
> > >> While working on further block improvements I''ve
found an issue with
> > >> persistent grants in blkfront.
> > >>
> > >> Persistent grants basically allocate grants and then they are
never
> > >> released, so both blkfront and blkback keep using the same
memory pages
> > >> for all the transactions.
> > >>
> > >> This is not a problem in blkback, because we can dynamically
choose how
> > >> many grants we want to map. On the other hand, blkfront
cannot remove
> > >> the access to those grants at any point, because blkfront
doesn''t know
> > >> if blkback has this grants mapped persistently or not.
> > >>
> > >> So if for example we start expanding the number of segments
in indirect
> > >> requests, to a value like 512 segments per requests, blkfront
will
> > >> probably try to persistently map 512*32+512 = 16896 grants
per device,
> > >> that''s much more grants that the current default,
which is 32*256 = 8192
> > >> (if using grant tables v2). This can cause serious problems
to other
> > >> interfaces inside the DomU, since blkfront basically starts
hoarding all
> > >> possible grants, leaving other interfaces completely locked.
> > > 
> > > Yikes.
> > > 
> > >> I''ve been thinking about different ways to solve
this, but so far I
> > >> haven''t been able to found a nice solution:
> > >>
> > >> 1. Limit the number of persistent grants a blkfront instance
can use,
> > >> let''s say that only the first X used grants will be
persistently mapped
> > >> by both blkfront and blkback, and if more grants are needed
the previous
> > >> map/unmap will be used.
> > > 
> > > I''m not thrilled with this option. It would likely
introduce some
> > > significant performance variability, wouldn''t it?
> > 
> > Probably, and also it will be hard to distribute the number of
available
> > grant across the different interfaces in a performance sensible way,
> > specially given the fact that once a grant is assigned to a interface
it
> > cannot be returned back to the pool of grants.
> > 
> > So if we had two interfaces with very different usage (one very busy
and
> > another one almost idle), and equally distribute the grants amongst
> > them, one will have a lot of unused grants while the other will suffer
> > from starvation.
> 
> I do think we need to implement some sort of reclaim scheme, which
> probably does mean a specific request (per your #4). We simply
can''t
> have a device which once upon a time had high throughput but is no
> mostly ideal continue to tie up all those grants.
> 
> If you make the reuse of grants use an MRU scheme and reclaim the
> currently unused tail fairly infrequently and in large batches then the
> perf overhead should be minimal, I think.
> 
> I also don''t think I would discount the idea of using ephemeral
grants
> to cover bursts so easily either, in fact it might fall out quite
> naturally from an MRU scheme? In that scheme bursting up is pretty cheap
> since grant map is relative inexpensive, and recovering from the burst
> shouldn''t be too expensive if you batch it. If it turns out to be
not a
> burst but a sustained level of I/O then the MRU scheme would mean you
> wouldn''t be recovering them.
> 
> I also think there probably needs to be some tunable per device limit on
> the maximum persistent grants, perhaps minimum and maximum pool sizes
> ties in with an MRU scheme? If nothing else it gives the admin the
> ability to prioritise devices.
If we introduce a reclaim call we have to be careful not to fall back
to a map/unmap scheme like we had before.

The way I see it either these additional grants are useful or not.
In the first case we could just limit the maximum amount of persistent
grants and be done with it.
If they are not useful (they have been allocated for one very large
request and not used much after that), could we find a way to identify
unusually large requests and avoid using persistent grants for those?
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Jun-26 09:37 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:> On Tue, 25 Jun 2013, Ian Campbell wrote:
>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
>> > On 21/06/13 20:07, Matt Wilson wrote:
>> > > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné
wrote:
>> > >> Hello,
>> > >>
>> > >> While working on further block improvements I''ve
found an issue with
>> > >> persistent grants in blkfront.
>> > >>
>> > >> Persistent grants basically allocate grants and then they
are never
>> > >> released, so both blkfront and blkback keep using the
same memory pages
>> > >> for all the transactions.
>> > >>
>> > >> This is not a problem in blkback, because we can
dynamically choose how
>> > >> many grants we want to map. On the other hand, blkfront
cannot remove
>> > >> the access to those grants at any point, because blkfront
doesn''t know
>> > >> if blkback has this grants mapped persistently or not.
>> > >>
>> > >> So if for example we start expanding the number of
segments in indirect
>> > >> requests, to a value like 512 segments per requests,
blkfront will
>> > >> probably try to persistently map 512*32+512 = 16896
grants per device,
>> > >> that''s much more grants that the current
default, which is 32*256 = 8192
>> > >> (if using grant tables v2). This can cause serious
problems to other
>> > >> interfaces inside the DomU, since blkfront basically
starts hoarding all
>> > >> possible grants, leaving other interfaces completely
locked.
>> > >
>> > > Yikes.
>> > >
>> > >> I''ve been thinking about different ways to solve
this, but so far I
>> > >> haven''t been able to found a nice solution:
>> > >>
>> > >> 1. Limit the number of persistent grants a blkfront
instance can use,
>> > >> let''s say that only the first X used grants will
be persistently mapped
>> > >> by both blkfront and blkback, and if more grants are
needed the previous
>> > >> map/unmap will be used.
>> > >
>> > > I''m not thrilled with this option. It would likely
introduce some
>> > > significant performance variability, wouldn''t it?
>> >
>> > Probably, and also it will be hard to distribute the number of
available
>> > grant across the different interfaces in a performance sensible
way,
>> > specially given the fact that once a grant is assigned to a
interface it
>> > cannot be returned back to the pool of grants.
>> >
>> > So if we had two interfaces with very different usage (one very
busy and
>> > another one almost idle), and equally distribute the grants
amongst
>> > them, one will have a lot of unused grants while the other will
suffer
>> > from starvation.
>>
>> I do think we need to implement some sort of reclaim scheme, which
>> probably does mean a specific request (per your #4). We simply
can''t
>> have a device which once upon a time had high throughput but is no
>> mostly ideal continue to tie up all those grants.
>>
>> If you make the reuse of grants use an MRU scheme and reclaim the
>> currently unused tail fairly infrequently and in large batches then the
>> perf overhead should be minimal, I think.
>>
>> I also don''t think I would discount the idea of using
ephemeral grants
>> to cover bursts so easily either, in fact it might fall out quite
>> naturally from an MRU scheme? In that scheme bursting up is pretty
cheap
>> since grant map is relative inexpensive, and recovering from the burst
>> shouldn''t be too expensive if you batch it. If it turns out to
be not a
>> burst but a sustained level of I/O then the MRU scheme would mean you
>> wouldn''t be recovering them.
>>
>> I also think there probably needs to be some tunable per device limit
on
>> the maximum persistent grants, perhaps minimum and maximum pool sizes
>> ties in with an MRU scheme? If nothing else it gives the admin the
>> ability to prioritise devices.
>
> If we introduce a reclaim call we have to be careful not to fall back
> to a map/unmap scheme like we had before.
>
> The way I see it either these additional grants are useful or not.
> In the first case we could just limit the maximum amount of persistent
> grants and be done with it.
> If they are not useful (they have been allocated for one very large
> request and not used much after that), could we find a way to identify
> unusually large requests and avoid using persistent grants for those?
Isn''t it possible that these grants are useful for some periods of
time, but not for others?  You wouldn''t say, "Caching the disk
data in
main memory is either useful or not; if it is not useful (if it was
allocated for one very large request and not used much after that), we
should find a way to identify unusually large requests and avoid
caching it."  If you''re playing a movie, sure; but in most cases,
the
cache was useful for a time, then stopped being useful.  Treating the
persistent grants the same way makes sense to me.

 -George

Ian Campbell

2013-Jun-26 11:37 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote:> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Tue, 25 Jun 2013, Ian Campbell wrote:
> >> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
> >> > On 21/06/13 20:07, Matt Wilson wrote:
> >> > > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau
Monné wrote:
> >> > >> Hello,
> >> > >>
> >> > >> While working on further block improvements I've
found an issue with
> >> > >> persistent grants in blkfront.
> >> > >>
> >> > >> Persistent grants basically allocate grants and then
they are never
> >> > >> released, so both blkfront and blkback keep using
the same memory pages
> >> > >> for all the transactions.
> >> > >>
> >> > >> This is not a problem in blkback, because we can
dynamically choose how
> >> > >> many grants we want to map. On the other hand,
blkfront cannot remove
> >> > >> the access to those grants at any point, because
blkfront doesn't know
> >> > >> if blkback has this grants mapped persistently or
not.
> >> > >>
> >> > >> So if for example we start expanding the number of
segments in indirect
> >> > >> requests, to a value like 512 segments per requests,
blkfront will
> >> > >> probably try to persistently map 512*32+512 = 16896
grants per device,
> >> > >> that's much more grants that the current
default, which is 32*256 = 8192
> >> > >> (if using grant tables v2). This can cause serious
problems to other
> >> > >> interfaces inside the DomU, since blkfront basically
starts hoarding all
> >> > >> possible grants, leaving other interfaces completely
locked.
> >> > >
> >> > > Yikes.
> >> > >
> >> > >> I've been thinking about different ways to solve
this, but so far I
> >> > >> haven't been able to found a nice solution:
> >> > >>
> >> > >> 1. Limit the number of persistent grants a blkfront
instance can use,
> >> > >> let's say that only the first X used grants will
be persistently mapped
> >> > >> by both blkfront and blkback, and if more grants are
needed the previous
> >> > >> map/unmap will be used.
> >> > >
> >> > > I'm not thrilled with this option. It would likely
introduce some
> >> > > significant performance variability, wouldn't it?
> >> >
> >> > Probably, and also it will be hard to distribute the number
of available
> >> > grant across the different interfaces in a performance
sensible way,
> >> > specially given the fact that once a grant is assigned to a
interface it
> >> > cannot be returned back to the pool of grants.
> >> >
> >> > So if we had two interfaces with very different usage (one
very busy and
> >> > another one almost idle), and equally distribute the grants
amongst
> >> > them, one will have a lot of unused grants while the other
will suffer
> >> > from starvation.
> >>
> >> I do think we need to implement some sort of reclaim scheme, which
> >> probably does mean a specific request (per your #4). We simply
can't
> >> have a device which once upon a time had high throughput but is no
> >> mostly ideal continue to tie up all those grants.
> >>
> >> If you make the reuse of grants use an MRU scheme and reclaim the
> >> currently unused tail fairly infrequently and in large batches
then the
> >> perf overhead should be minimal, I think.
> >>
> >> I also don't think I would discount the idea of using
ephemeral grants
> >> to cover bursts so easily either, in fact it might fall out quite
> >> naturally from an MRU scheme? In that scheme bursting up is pretty
cheap
> >> since grant map is relative inexpensive, and recovering from the
burst
> >> shouldn't be too expensive if you batch it. If it turns out to
be not a
> >> burst but a sustained level of I/O then the MRU scheme would mean
you
> >> wouldn't be recovering them.
> >>
> >> I also think there probably needs to be some tunable per device
limit on
> >> the maximum persistent grants, perhaps minimum and maximum pool
sizes
> >> ties in with an MRU scheme? If nothing else it gives the admin the
> >> ability to prioritise devices.
> >
> > If we introduce a reclaim call we have to be careful not to fall back
> > to a map/unmap scheme like we had before.
> >
> > The way I see it either these additional grants are useful or not.
> > In the first case we could just limit the maximum amount of persistent
> > grants and be done with it.
> > If they are not useful (they have been allocated for one very large
> > request and not used much after that), could we find a way to identify
> > unusually large requests and avoid using persistent grants for those?
> 
> Isn't it possible that these grants are useful for some periods of
> time, but not for others?  You wouldn't say, "Caching the disk
data in
> main memory is either useful or not; if it is not useful (if it was
> allocated for one very large request and not used much after that), we
> should find a way to identify unusually large requests and avoid
> caching it."  If you're playing a movie, sure; but in most cases,
the
> cache was useful for a time, then stopped being useful.  Treating the
> persistent grants the same way makes sense to me.
Right, this is what I was trying to suggest with the MRU scheme. If you
are using lots of grants and you keep on reusing them then they remain
persistent and don't get reclaimed. If you are not reusing them for a
while then they get reclaimed. If you make "for a while" big enough
then
you should find you aren't unintentionally falling back to a map/unmap
scheme.


Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Jun-27 13:58 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 26/06/13 12:37, Ian Campbell wrote:> On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote:
>> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>>> On Tue, 25 Jun 2013, Ian Campbell wrote:
>>>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
>>>>> On 21/06/13 20:07, Matt Wilson wrote:
>>>>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau
Monné wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> While working on further block improvements
I've found an issue with
>>>>>>> persistent grants in blkfront.
>>>>>>>
>>>>>>> Persistent grants basically allocate grants and
then they are never
>>>>>>> released, so both blkfront and blkback keep using
the same memory pages
>>>>>>> for all the transactions.
>>>>>>>
>>>>>>> This is not a problem in blkback, because we can
dynamically choose how
>>>>>>> many grants we want to map. On the other hand,
blkfront cannot remove
>>>>>>> the access to those grants at any point, because
blkfront doesn't know
>>>>>>> if blkback has this grants mapped persistently or
not.
>>>>>>>
>>>>>>> So if for example we start expanding the number of
segments in indirect
>>>>>>> requests, to a value like 512 segments per
requests, blkfront will
>>>>>>> probably try to persistently map 512*32+512 = 16896
grants per device,
>>>>>>> that's much more grants that the current
default, which is 32*256 = 8192
>>>>>>> (if using grant tables v2). This can cause serious
problems to other
>>>>>>> interfaces inside the DomU, since blkfront
basically starts hoarding all
>>>>>>> possible grants, leaving other interfaces
completely locked.
>>>>>> Yikes.
>>>>>>
>>>>>>> I've been thinking about different ways to
solve this, but so far I
>>>>>>> haven't been able to found a nice solution:
>>>>>>>
>>>>>>> 1. Limit the number of persistent grants a blkfront
instance can use,
>>>>>>> let's say that only the first X used grants
will be persistently mapped
>>>>>>> by both blkfront and blkback, and if more grants
are needed the previous
>>>>>>> map/unmap will be used.
>>>>>> I'm not thrilled with this option. It would likely
introduce some
>>>>>> significant performance variability, wouldn't it?
>>>>> Probably, and also it will be hard to distribute the number
of available
>>>>> grant across the different interfaces in a performance
sensible way,
>>>>> specially given the fact that once a grant is assigned to a
interface it
>>>>> cannot be returned back to the pool of grants.
>>>>>
>>>>> So if we had two interfaces with very different usage (one
very busy and
>>>>> another one almost idle), and equally distribute the grants
amongst
>>>>> them, one will have a lot of unused grants while the other
will suffer
>>>>> from starvation.
>>>> I do think we need to implement some sort of reclaim scheme,
which
>>>> probably does mean a specific request (per your #4). We simply
can't
>>>> have a device which once upon a time had high throughput but is
no
>>>> mostly ideal continue to tie up all those grants.
>>>>
>>>> If you make the reuse of grants use an MRU scheme and reclaim
the
>>>> currently unused tail fairly infrequently and in large batches
then the
>>>> perf overhead should be minimal, I think.
>>>>
>>>> I also don't think I would discount the idea of using
ephemeral grants
>>>> to cover bursts so easily either, in fact it might fall out
quite
>>>> naturally from an MRU scheme? In that scheme bursting up is
pretty cheap
>>>> since grant map is relative inexpensive, and recovering from
the burst
>>>> shouldn't be too expensive if you batch it. If it turns out
to be not a
>>>> burst but a sustained level of I/O then the MRU scheme would
mean you
>>>> wouldn't be recovering them.
>>>>
>>>> I also think there probably needs to be some tunable per device
limit on
>>>> the maximum persistent grants, perhaps minimum and maximum pool
sizes
>>>> ties in with an MRU scheme? If nothing else it gives the admin
the
>>>> ability to prioritise devices.
>>> If we introduce a reclaim call we have to be careful not to fall
back
>>> to a map/unmap scheme like we had before.
>>>
>>> The way I see it either these additional grants are useful or not.
>>> In the first case we could just limit the maximum amount of
persistent
>>> grants and be done with it.
>>> If they are not useful (they have been allocated for one very large
>>> request and not used much after that), could we find a way to
identify
>>> unusually large requests and avoid using persistent grants for
those?
>> Isn't it possible that these grants are useful for some periods of
>> time, but not for others?  You wouldn't say, "Caching the disk
data in
>> main memory is either useful or not; if it is not useful (if it was
>> allocated for one very large request and not used much after that), we
>> should find a way to identify unusually large requests and avoid
>> caching it."  If you're playing a movie, sure; but in most
cases, the
>> cache was useful for a time, then stopped being useful.  Treating the
>> persistent grants the same way makes sense to me.
> Right, this is what I was trying to suggest with the MRU scheme. If you
> are using lots of grants and you keep on reusing them then they remain
> persistent and don't get reclaimed. If you are not reusing them for a
> while then they get reclaimed. If you make "for a while" big
enough then
> you should find you aren't unintentionally falling back to a map/unmap
> scheme.
And I was trying to say that I agreed with you. :-)

BTW, I presume "MRU" stands for "Most Recently Used", and
means "Keep
the most recently used"; is there a practical difference between that 
and "LRU" ("Discard the Least Recently Used")?

Presumably we could implement the clock algorithm pretty reasonably...

  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Jun-27 14:21 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Thu, 2013-06-27 at 14:58 +0100, George Dunlap wrote:> On 26/06/13 12:37, Ian Campbell wrote:
> > On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote:
> >> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
> >> <stefano.stabellini@eu.citrix.com> wrote:
> >>> On Tue, 25 Jun 2013, Ian Campbell wrote:
> >>>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné wrote:
> >>>>> On 21/06/13 20:07, Matt Wilson wrote:
> >>>>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger
Pau Monné wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> While working on further block improvements
I've found an issue with
> >>>>>>> persistent grants in blkfront.
> >>>>>>>
> >>>>>>> Persistent grants basically allocate grants
and then they are never
> >>>>>>> released, so both blkfront and blkback keep
using the same memory pages
> >>>>>>> for all the transactions.
> >>>>>>>
> >>>>>>> This is not a problem in blkback, because we
can dynamically choose how
> >>>>>>> many grants we want to map. On the other hand,
blkfront cannot remove
> >>>>>>> the access to those grants at any point,
because blkfront doesn't know
> >>>>>>> if blkback has this grants mapped persistently
or not.
> >>>>>>>
> >>>>>>> So if for example we start expanding the
number of segments in indirect
> >>>>>>> requests, to a value like 512 segments per
requests, blkfront will
> >>>>>>> probably try to persistently map 512*32+512 =
16896 grants per device,
> >>>>>>> that's much more grants that the current
default, which is 32*256 = 8192
> >>>>>>> (if using grant tables v2). This can cause
serious problems to other
> >>>>>>> interfaces inside the DomU, since blkfront
basically starts hoarding all
> >>>>>>> possible grants, leaving other interfaces
completely locked.
> >>>>>> Yikes.
> >>>>>>
> >>>>>>> I've been thinking about different ways to
solve this, but so far I
> >>>>>>> haven't been able to found a nice
solution:
> >>>>>>>
> >>>>>>> 1. Limit the number of persistent grants a
blkfront instance can use,
> >>>>>>> let's say that only the first X used
grants will be persistently mapped
> >>>>>>> by both blkfront and blkback, and if more
grants are needed the previous
> >>>>>>> map/unmap will be used.
> >>>>>> I'm not thrilled with this option. It would
likely introduce some
> >>>>>> significant performance variability, wouldn't
it?
> >>>>> Probably, and also it will be hard to distribute the
number of available
> >>>>> grant across the different interfaces in a performance
sensible way,
> >>>>> specially given the fact that once a grant is assigned
to a interface it
> >>>>> cannot be returned back to the pool of grants.
> >>>>>
> >>>>> So if we had two interfaces with very different usage
(one very busy and
> >>>>> another one almost idle), and equally distribute the
grants amongst
> >>>>> them, one will have a lot of unused grants while the
other will suffer
> >>>>> from starvation.
> >>>> I do think we need to implement some sort of reclaim
scheme, which
> >>>> probably does mean a specific request (per your #4). We
simply can't
> >>>> have a device which once upon a time had high throughput
but is no
> >>>> mostly ideal continue to tie up all those grants.
> >>>>
> >>>> If you make the reuse of grants use an MRU scheme and
reclaim the
> >>>> currently unused tail fairly infrequently and in large
batches then the
> >>>> perf overhead should be minimal, I think.
> >>>>
> >>>> I also don't think I would discount the idea of using
ephemeral grants
> >>>> to cover bursts so easily either, in fact it might fall
out quite
> >>>> naturally from an MRU scheme? In that scheme bursting up
is pretty cheap
> >>>> since grant map is relative inexpensive, and recovering
from the burst
> >>>> shouldn't be too expensive if you batch it. If it
turns out to be not a
> >>>> burst but a sustained level of I/O then the MRU scheme
would mean you
> >>>> wouldn't be recovering them.
> >>>>
> >>>> I also think there probably needs to be some tunable per
device limit on
> >>>> the maximum persistent grants, perhaps minimum and maximum
pool sizes
> >>>> ties in with an MRU scheme? If nothing else it gives the
admin the
> >>>> ability to prioritise devices.
> >>> If we introduce a reclaim call we have to be careful not to
fall back
> >>> to a map/unmap scheme like we had before.
> >>>
> >>> The way I see it either these additional grants are useful or
not.
> >>> In the first case we could just limit the maximum amount of
persistent
> >>> grants and be done with it.
> >>> If they are not useful (they have been allocated for one very
large
> >>> request and not used much after that), could we find a way to
identify
> >>> unusually large requests and avoid using persistent grants for
those?
> >> Isn't it possible that these grants are useful for some
periods of
> >> time, but not for others?  You wouldn't say, "Caching the
disk data in
> >> main memory is either useful or not; if it is not useful (if it
was
> >> allocated for one very large request and not used much after
that), we
> >> should find a way to identify unusually large requests and avoid
> >> caching it."  If you're playing a movie, sure; but in
most cases, the
> >> cache was useful for a time, then stopped being useful.  Treating
the
> >> persistent grants the same way makes sense to me.
> > Right, this is what I was trying to suggest with the MRU scheme. If
you
> > are using lots of grants and you keep on reusing them then they remain
> > persistent and don't get reclaimed. If you are not reusing them
for a
> > while then they get reclaimed. If you make "for a while" big
enough then
> > you should find you aren't unintentionally falling back to a
map/unmap
> > scheme.
> 
> And I was trying to say that I agreed with you. :-)
Excellent ;-)
> BTW, I presume "MRU" stands for "Most Recently Used",
and means "Keep
> the most recently used"; is there a practical difference between that 
> and "LRU" ("Discard the Least Recently Used")?
I started off with LRU and then got my self confused and changed it
everywhere. Yes I mean keep Most Recently Used == discard Least Recently
Used.
> Presumably we could implement the clock algorithm pretty reasonably...
That's the sort of approach I was imagining...

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Roger Pau Monné

2013-Jun-27 15:12 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 21/06/13 20:07, Matt Wilson wrote:> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
>> Hello,
>>
>> While working on further block improvements I''ve found an
issue with
>> persistent grants in blkfront.
>>
>> Persistent grants basically allocate grants and then they are never
>> released, so both blkfront and blkback keep using the same memory pages
>> for all the transactions.
>>
>> This is not a problem in blkback, because we can dynamically choose how
>> many grants we want to map. On the other hand, blkfront cannot remove
>> the access to those grants at any point, because blkfront
doesn''t know
>> if blkback has this grants mapped persistently or not.
>>
>> So if for example we start expanding the number of segments in indirect
>> requests, to a value like 512 segments per requests, blkfront will
>> probably try to persistently map 512*32+512 = 16896 grants per device,
>> that''s much more grants that the current default, which is
32*256 = 8192
>> (if using grant tables v2). This can cause serious problems to other
>> interfaces inside the DomU, since blkfront basically starts hoarding
all
>> possible grants, leaving other interfaces completely locked.
> 
> Yikes.
> 
>> I''ve been thinking about different ways to solve this, but so
far I
>> haven''t been able to found a nice solution:
>>
>> 1. Limit the number of persistent grants a blkfront instance can use,
>> let''s say that only the first X used grants will be
persistently mapped
>> by both blkfront and blkback, and if more grants are needed the
previous
>> map/unmap will be used.
> 
> I''m not thrilled with this option. It would likely introduce some
> significant performance variability, wouldn''t it?
> 
>> 2. Switch to grant copy in blkback, and get rid of persistent grants (I
>> have not benchmarked this solution, but I''m quite sure it will
involve a
>> performance regression, specially when scaling to a high number of
domains).
> 
> Why do you think so?
I''ve hacked a prototype blkback using grant_copy instead of persistent
grants, and removed the persistent grants support in blkfront and indeed
the performance of grant_copy is lower than persistent grants, and it
seems to scale much worse. I''ve run several fio read/write benchmarks,
using 512 segments per request on a ramdisk, and the output is the
following:

http://xenbits.xen.org/people/royger/grant_copy/

Roger.

Roger Pau Monné

2013-Jun-27 15:20 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 27/06/13 16:21, Ian Campbell wrote:> On Thu, 2013-06-27 at 14:58 +0100, George Dunlap wrote:
>> On 26/06/13 12:37, Ian Campbell wrote:
>>> On Wed, 2013-06-26 at 10:37 +0100, George Dunlap wrote:
>>>> On Tue, Jun 25, 2013 at 7:04 PM, Stefano Stabellini
>>>> <stefano.stabellini@eu.citrix.com> wrote:
>>>>> On Tue, 25 Jun 2013, Ian Campbell wrote:
>>>>>> On Sat, 2013-06-22 at 09:11 +0200, Roger Pau Monné
wrote:
>>>>>>> On 21/06/13 20:07, Matt Wilson wrote:
>>>>>>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger
Pau Monné wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> While working on further block improvements
I've found an issue with
>>>>>>>>> persistent grants in blkfront.
>>>>>>>>>
>>>>>>>>> Persistent grants basically allocate grants
and then they are never
>>>>>>>>> released, so both blkfront and blkback keep
using the same memory pages
>>>>>>>>> for all the transactions.
>>>>>>>>>
>>>>>>>>> This is not a problem in blkback, because
we can dynamically choose how
>>>>>>>>> many grants we want to map. On the other
hand, blkfront cannot remove
>>>>>>>>> the access to those grants at any point,
because blkfront doesn't know
>>>>>>>>> if blkback has this grants mapped
persistently or not.
>>>>>>>>>
>>>>>>>>> So if for example we start expanding the
number of segments in indirect
>>>>>>>>> requests, to a value like 512 segments per
requests, blkfront will
>>>>>>>>> probably try to persistently map 512*32+512
= 16896 grants per device,
>>>>>>>>> that's much more grants that the
current default, which is 32*256 = 8192
>>>>>>>>> (if using grant tables v2). This can cause
serious problems to other
>>>>>>>>> interfaces inside the DomU, since blkfront
basically starts hoarding all
>>>>>>>>> possible grants, leaving other interfaces
completely locked.
>>>>>>>> Yikes.
>>>>>>>>
>>>>>>>>> I've been thinking about different ways
to solve this, but so far I
>>>>>>>>> haven't been able to found a nice
solution:
>>>>>>>>>
>>>>>>>>> 1. Limit the number of persistent grants a
blkfront instance can use,
>>>>>>>>> let's say that only the first X used
grants will be persistently mapped
>>>>>>>>> by both blkfront and blkback, and if more
grants are needed the previous
>>>>>>>>> map/unmap will be used.
>>>>>>>> I'm not thrilled with this option. It would
likely introduce some
>>>>>>>> significant performance variability,
wouldn't it?
>>>>>>> Probably, and also it will be hard to distribute
the number of available
>>>>>>> grant across the different interfaces in a
performance sensible way,
>>>>>>> specially given the fact that once a grant is
assigned to a interface it
>>>>>>> cannot be returned back to the pool of grants.
>>>>>>>
>>>>>>> So if we had two interfaces with very different
usage (one very busy and
>>>>>>> another one almost idle), and equally distribute
the grants amongst
>>>>>>> them, one will have a lot of unused grants while
the other will suffer
>>>>>>> from starvation.
>>>>>> I do think we need to implement some sort of reclaim
scheme, which
>>>>>> probably does mean a specific request (per your #4). We
simply can't
>>>>>> have a device which once upon a time had high
throughput but is no
>>>>>> mostly ideal continue to tie up all those grants.
>>>>>>
>>>>>> If you make the reuse of grants use an MRU scheme and
reclaim the
>>>>>> currently unused tail fairly infrequently and in large
batches then the
>>>>>> perf overhead should be minimal, I think.
>>>>>>
>>>>>> I also don't think I would discount the idea of
using ephemeral grants
>>>>>> to cover bursts so easily either, in fact it might fall
out quite
>>>>>> naturally from an MRU scheme? In that scheme bursting
up is pretty cheap
>>>>>> since grant map is relative inexpensive, and recovering
from the burst
>>>>>> shouldn't be too expensive if you batch it. If it
turns out to be not a
>>>>>> burst but a sustained level of I/O then the MRU scheme
would mean you
>>>>>> wouldn't be recovering them.
>>>>>>
>>>>>> I also think there probably needs to be some tunable
per device limit on
>>>>>> the maximum persistent grants, perhaps minimum and
maximum pool sizes
>>>>>> ties in with an MRU scheme? If nothing else it gives
the admin the
>>>>>> ability to prioritise devices.
>>>>> If we introduce a reclaim call we have to be careful not to
fall back
>>>>> to a map/unmap scheme like we had before.
>>>>>
>>>>> The way I see it either these additional grants are useful
or not.
>>>>> In the first case we could just limit the maximum amount of
persistent
>>>>> grants and be done with it.
>>>>> If they are not useful (they have been allocated for one
very large
>>>>> request and not used much after that), could we find a way
to identify
>>>>> unusually large requests and avoid using persistent grants
for those?
>>>> Isn't it possible that these grants are useful for some
periods of
>>>> time, but not for others?  You wouldn't say, "Caching
the disk data in
>>>> main memory is either useful or not; if it is not useful (if it
was
>>>> allocated for one very large request and not used much after
that), we
>>>> should find a way to identify unusually large requests and
avoid
>>>> caching it."  If you're playing a movie, sure; but in
most cases, the
>>>> cache was useful for a time, then stopped being useful. 
Treating the
>>>> persistent grants the same way makes sense to me.
>>> Right, this is what I was trying to suggest with the MRU scheme. If
you
>>> are using lots of grants and you keep on reusing them then they
remain
>>> persistent and don't get reclaimed. If you are not reusing them
for a
>>> while then they get reclaimed. If you make "for a while"
big enough then
>>> you should find you aren't unintentionally falling back to a
map/unmap
>>> scheme.
>>
>> And I was trying to say that I agreed with you. :-)
> 
> Excellent ;-)
I also agree that this is the best solution, I will start looking at
implementing it.
>> BTW, I presume "MRU" stands for "Most Recently
Used", and means "Keep
>> the most recently used"; is there a practical difference between
that
>> and "LRU" ("Discard the Least Recently Used")?
> 
> I started off with LRU and then got my self confused and changed it
> everywhere. Yes I mean keep Most Recently Used == discard Least Recently
> Used.
This will help if the disk is only doing intermittent bursts of data,
but if the disk is under high I/O during a long time we might end up
under the same situation (all grants hoarded by a single disk). We
should make sure that there's always a buffer of unused grants so other
disks or nic interfaces can continue to work as expected.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Stefano Stabellini

2013-Jun-27 15:26 UTC

head link

Re: [Hackathon minutes] PV block improvements

On Thu, 27 Jun 2013, Roger Pau Monné wrote:> On 21/06/13 20:07, Matt Wilson wrote:
> > On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
> >> Hello,
> >>
> >> While working on further block improvements I''ve found an
issue with
> >> persistent grants in blkfront.
> >>
> >> Persistent grants basically allocate grants and then they are
never
> >> released, so both blkfront and blkback keep using the same memory
pages
> >> for all the transactions.
> >>
> >> This is not a problem in blkback, because we can dynamically
choose how
> >> many grants we want to map. On the other hand, blkfront cannot
remove
> >> the access to those grants at any point, because blkfront
doesn''t know
> >> if blkback has this grants mapped persistently or not.
> >>
> >> So if for example we start expanding the number of segments in
indirect
> >> requests, to a value like 512 segments per requests, blkfront will
> >> probably try to persistently map 512*32+512 = 16896 grants per
device,
> >> that''s much more grants that the current default, which
is 32*256 = 8192
> >> (if using grant tables v2). This can cause serious problems to
other
> >> interfaces inside the DomU, since blkfront basically starts
hoarding all
> >> possible grants, leaving other interfaces completely locked.
> > 
> > Yikes.
> > 
> >> I''ve been thinking about different ways to solve this,
but so far I
> >> haven''t been able to found a nice solution:
> >>
> >> 1. Limit the number of persistent grants a blkfront instance can
use,
> >> let''s say that only the first X used grants will be
persistently mapped
> >> by both blkfront and blkback, and if more grants are needed the
previous
> >> map/unmap will be used.
> > 
> > I''m not thrilled with this option. It would likely introduce
some
> > significant performance variability, wouldn''t it?
> > 
> >> 2. Switch to grant copy in blkback, and get rid of persistent
grants (I
> >> have not benchmarked this solution, but I''m quite sure it
will involve a
> >> performance regression, specially when scaling to a high number of
domains).
> > 
> > Why do you think so?
> 
> I''ve hacked a prototype blkback using grant_copy instead of
persistent
> grants, and removed the persistent grants support in blkfront and indeed
> the performance of grant_copy is lower than persistent grants, and it
> seems to scale much worse. I''ve run several fio read/write
benchmarks,
> using 512 segments per request on a ramdisk, and the output is the
> following:
> 
> http://xenbits.xen.org/people/royger/grant_copy/
Very impressive. We should consider doing the same experiment with
netfront/netback at some point.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Roger Pau Monné

2013-Jul-02 11:49 UTC

head link

Re: [Hackathon minutes] PV block improvements

On 24/06/13 13:06, Stefano Stabellini wrote:> On Sat, 22 Jun 2013, Wei Liu wrote:
>> On Fri, Jun 21, 2013 at 04:16:25PM -0400, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Jun 21, 2013 at 07:10:59PM +0200, Roger Pau Monné wrote:
>>>> Hello,
>>>>
>>>> While working on further block improvements I've found an
issue with
>>>> persistent grants in blkfront.
>>>>
>>>> Persistent grants basically allocate grants and then they are
never
>>>> released, so both blkfront and blkback keep using the same
memory pages
>>>> for all the transactions.
>>>>
>>>> This is not a problem in blkback, because we can dynamically
choose how
>>>> many grants we want to map. On the other hand, blkfront cannot
remove
>>>> the access to those grants at any point, because blkfront
doesn't know
>>>> if blkback has this grants mapped persistently or not.
>>>>
>>>> So if for example we start expanding the number of segments in
indirect
>>>> requests, to a value like 512 segments per requests, blkfront
will
>>>> probably try to persistently map 512*32+512 = 16896 grants per
device,
>>>> that's much more grants that the current default, which is
32*256 = 8192
>>>> (if using grant tables v2). This can cause serious problems to
other
>>>> interfaces inside the DomU, since blkfront basically starts
hoarding all
>>>> possible grants, leaving other interfaces completely locked.
>>>>
>>>> I've been thinking about different ways to solve this, but
so far I
>>>> haven't been able to found a nice solution:
>>>>
>>>> 1. Limit the number of persistent grants a blkfront instance
can use,
>>>> let's say that only the first X used grants will be
persistently mapped
>>>> by both blkfront and blkback, and if more grants are needed the
previous
>>>> map/unmap will be used.
>>>>
>>>> 2. Switch to grant copy in blkback, and get rid of persistent
grants (I
>>>> have not benchmarked this solution, but I'm quite sure it
will involve a
>>>> performance regression, specially when scaling to a high number
of domains).
>>>>
>>
>> Any chance that the speed of copying is fast enough for block devices?
>>
>>>> 3. Increase the size of the grant_table or the size of a single
grant
>>>> (from 4k to 2M) (this is from Stefano Stabellini).
>>>>
>>>> 4. Introduce a new request type that we can use to request
blkback to
>>>> unmap certain grefs so we can free them in blkfront.
>>>
>>>
>>> 5). Lift the limit of grant pages a domain can have.
>>
>> If I'm not mistaken, this is basically the same as "increase
the size of
>> the grant_table" in #3.
> 
> Yes, that was one of the things I was suggesting, but it needs
> investigating: I wouldn't want that increasing the number of grant
> frames would reach a different scalability limit of the data structure.
I don't think there's any implicit scalability limit in the data
structure itself, it's just an array and grants are ordered as
array[gref]. I've discussed with Stefano the usage of domain pages to
increase the size of the grant table, so instead of using xenheap pages
we could use domain pages and thus remove the limitation (since we will
be consuming domain memory). I have a very hacky prototype that uses
domain pages instead of xenheap pages for expanding the grant table, but
I think that before implementing this it would be more suitable to
implement #4, even if we are using domain pages to increase the grant
table, we need a way to allow blkfront to remove persistent grants, or
we will end up with a lot of unsused pages in blkfront after I/O bursts.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - May 2013 - [Hackathon minutes] PV block improvements

[Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements

Re: [Hackathon minutes] PV block improvements