thr3ads.net - Xen devel - [Xen-devel] Page fault is 4 times faster with XI shadow mechanism [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Robert Phillips

2006-Jul-01 18:55 UTC

[Xen-devel] Page fault is 4 times faster with XI shadow mechanism

Hello Han,

I am pleased you approve of the design and implementation of the XI shadow
mechanism. And I appreciate the time and care you''ve taken in reviewing
this
substantial body of new code.

You asked about performance statistics.  With the current XI patch, we are
seeing the following:

   - page faults times for XI are about 4 times faster than non-XI:
   10.56 (non-XI)  vs 2.43 (XI) usec
   - sync-all times for XI are about 18% faster: 39.72 (non-XI) vs
33.51(XI) usec
   - invalidate-page times for XI are about 5 times faster: 22.75(non-XI) vs
   4.00 (XI) usec.
   - we haven''t measured gva-to-gpa but I would expect it be about the
   same.  It''s quite simple.

You can easily gather your own statistics.  The XI patch gathers statistics
and prints them when you type ''y'' from the XEN console. 
(''Y'' clears the
statistics.)  Statistics gathering occurs even when the XI code is disabled
in xen/Config.mk.  Of course then it gives you statistics for the non-XI
shadow code.

---

In an earlier email you provided a code fix; that is,  "if (c_curr_rw
&&
!_32pae_l3)".  Good catch!
I will incorporate your fix in our code base. As you suggested, should a
guest L3 PTE erroneously have its R/W flag set,
then the XI shadow code would propagate the error and set the R/W flag in
the shadow L3 PTE. Perhaps the XI code could do a better job of validating
guest page table entries but I was reluctant to be more rigorous about
checking guest PTEs than real hardware is.

In your latest email, you ask "Do we really need to reserve one snapshot
page for each smfn at first and retain it
until the HVM domain is destroyed?"

Well I don''t.  I simply pre-allocate a pool of SPTI''s.  It can
be quite a
large pool but certainly not one-SPTI per MFN.  SPTIs are allocated on
demand (when a guest page needs to be shadowed) and, when the pool runs low,
the LRU SPTs are torn down and their SPTIs recycled.

Currently I allocate about 5% of system memory for this purpose (this
includes the SPT, its snapshot and the backlink pages) and, with that
reasonable investment, we get very good performance.  With more study,
I''m
sure things could be tuned even better.  (I hope I have properly understood
your questions.)

-- rsp

On 7/1/06, zhu <vanbas.han@gmail.com > wrote:>
> Hi,
> After taking some time to dig into your patch about XI Shadow page
> table, I have to say it''s really a good design and implementation
IMHO,
> especially the parts about the clear hierarchy for each smfn,decision
> table and how to support 32nopae in a rather elegant way. However, I
> have several questions to discuss with you.:-)
> 1) It seems XI shadow pgt reserve all of the possible resources at the
> early stage for HVM domain(the first time to create the asi). It could
> be quite proper to reserve the smfns and sptis. However, do we really
> need to reserve one snapshot page for each smfn at first and retain it
> until the HVM domain is destroyed? I guess a large number of gpts will
> not been modified frequently after them are totally set up. IMHO, it
> would be better to manage these snapshot pages dynamic. Of course, this
> will change the basic logistic of the code, e.g. you have to sync the
> shadow pgt when invoke spti_make_shadow instead of leaving it out of
> sync, you can''t set up the total low level shadow pgt when invoke
> resync_spte  since it could cost a lot of time.
> 2) GP back link plays a very important role in XI shadow pgt. However,
> it will also cause high memory pressure for the domain(2 pages for each
> smfn). For these normal guest pages instead of GPT pages, I guess its
> usage is limited. Only when invoke xi_invld_mfn, divide_large_page or
> dirty logging, we will refer to the back link for these normal guest
> pages. Is it reasonable to implement the back link only for the GPT
> pages? Of course, this will increase the complexity of the code a little.
> 3) Can you show us the statistics between the current shadow pgt and XI
> pgt for some critical operations, such as shadow_resync_all, gva_to_gpa,
> shadow_fault and so on. I''m really curious about it.
>
> I have to say I''m not very familiar with the current shadow pgt
> implementation so I could miss some important considerations when I post
> these questions. Please point it out.
> Thanks for sharing your idea and code with us. :-)
>
> _______________________________________________________
> Best Regards,
> hanzhu
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

-- 
--------------------------------------------------------------------
Robert S. Phillips                          Virtual Iron Software
rphillips@virtualiron.com                Tower 1, Floor 2
978-849-1220                                 900 Chelmsford Street
                                                    Lowell, MA 01851

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

zhu

2006-Jul-02 04:20 UTC

head link

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

Robert Phillips 写道:> Well I don''t.  I simply pre-allocate a pool of SPTI''s. 
It can be quite a
> large pool but certainly not one-SPTI per MFN.  SPTIs are allocated on
> demand (when a guest page needs to be shadowed) and, when the pool runs 
> low,
> the LRU SPTs are torn down and their SPTIs recycled.
> Well what I mean is that we should not connect a snapshot page with a 
SPTI at the first time the SPTIs are reserved. It would be better to 
manage these snapshot pages in another dynamic pool.
BTW: What do you think of the backlink issue mentioned in my previous
mail?> Currently I allocate about 5% of system memory for this purpose (this
> includes the SPT, its snapshot and the backlink pages) and, with that
> reasonable investment, we get very good performance.  With more study,
I''m
> sure things could be tuned even better.  (I hope I have properly understood
> your questions.)
> 
> -- rsp
> 
> On 7/1/06, zhu <vanbas.han@gmail.com > wrote:
>>
>> Hi,
>> After taking some time to dig into your patch about XI Shadow page
>> table, I have to say it''s really a good design and
implementation IMHO,
>> especially the parts about the clear hierarchy for each smfn,decision
>> table and how to support 32nopae in a rather elegant way. However, I
>> have several questions to discuss with you.:-)
>> 1) It seems XI shadow pgt reserve all of the possible resources at the
>> early stage for HVM domain(the first time to create the asi). It could
>> be quite proper to reserve the smfns and sptis. However, do we really
>> need to reserve one snapshot page for each smfn at first and retain it
>> until the HVM domain is destroyed? I guess a large number of gpts will
>> not been modified frequently after them are totally set up. IMHO, it
>> would be better to manage these snapshot pages dynamic. Of course, this
>> will change the basic logistic of the code, e.g. you have to sync the
>> shadow pgt when invoke spti_make_shadow instead of leaving it out of
>> sync, you can''t set up the total low level shadow pgt when
invoke
>> resync_spte  since it could cost a lot of time.
>> 2) GP back link plays a very important role in XI shadow pgt. However,
>> it will also cause high memory pressure for the domain(2 pages for each
>> smfn). For these normal guest pages instead of GPT pages, I guess its
>> usage is limited. Only when invoke xi_invld_mfn, divide_large_page or
>> dirty logging, we will refer to the back link for these normal guest
>> pages. Is it reasonable to implement the back link only for the GPT
>> pages? Of course, this will increase the complexity of the code a
little.
>> 3) Can you show us the statistics between the current shadow pgt and XI
>> pgt for some critical operations, such as shadow_resync_all,
gva_to_gpa,
>> shadow_fault and so on. I''m really curious about it.
>>
>> I have to say I''m not very familiar with the current shadow
pgt
>> implementation so I could miss some important considerations when I
post
>> these questions. Please point it out.
>> Thanks for sharing your idea and code with us. :-)
>>
>> _______________________________________________________
>> Best Regards,
>> hanzhu
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
> 
> 
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Robert Phillips

2006-Jul-02 14:53 UTC

head link

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

Okay, now I understand your question!  It''s a good question, too.

In XI, the idea is to have a pool of SPTIs all ready to go.  When a page
needs to be shadowed, I simply pull a SPTI  off the list, zero its pages,
and it is ready for use.  No further memory allocation is needed.  This is
the critical path and I want it as short as possible.  (The zeroing is
irksome.  I plan to hook the idle loop and try to keep SPTIs pre-zeroed.)

The shadow pages and backlink pages are pages that the XI code has reserved
and taken for the domain.  The domain owns them for its lifetime.  It only
needs them for use as shadow pages and backlink pages.  It has no other
dynamic need for pages.

Since the pages are owned by the domain, they are not available for use by
some other domain.  However it would be possible to leave them in the
domheap rather than having the domain grab them all up front.  Then they
might be available for use by other domains, and all domains could share a
pool of shadow pages.  But that approach would only be helpful if domains
can overcommit memory.

I think the idea of domains overcommitting memory and sharing pages is
perilous since each domain''s behavior then depends on the good behavior
of
other domains, and there is no mechanism for domains to apply backpressure
to each other to reduce their memory use.

Regarding the backlink pages.  As you note, the common use of backlinks is
to mark guest page tables as readonly.  (The less common uses are to mark
all guest pages as readonly (for logging dirty pages during live migrate),
to find large pages and divide them when they contain a guest page table,
and to invalidate PTEs when a pfn-to-mfn mapping changes.)

One reason I have backlinks on all guest pages is because one can''t
know
ahead of time which guest pages are (or will become) GPTs.  When the code
first detects a guest page being used as a guest page table, it would have
to do a linear search to find all SPTEs that point to the new guest page
table, so it can mark them as readonly.

The backlink mechanism is particularly clean and simple precisely because
there is a backlink per SPTE, regardless of whether the SPTE points to a GP,
GPT, SPT or nothing.  This lets the backlinks be organized as an array with
512 elements, one to one with the SPTEs.  Given a backlink it''s trivial
to
find the corresponding SPTE and vice-versa.

If one wanted to have backlinks for only the SPTEs that point to GPTs,
things would really get complex.  The backlinks themselves would have to be
organized and dynamically allocated and freed, and in the critical page
fault path.

One could do without backlinks altogether if one were willing to put up with
linear searching.  It''s a space/performance tradeoff.  I think, with
machines now having many megabytes of memory, users are more concerned about
performance than a small memory overhead.

-- rsp

On 7/2/06, zhu < vanbas.han@gmail.com> wrote:>
>
> Robert Phillips 写道:
> > Well I don''t.  I simply pre-allocate a pool of
SPTI''s.  It can be quite
> a
> > large pool but certainly not one-SPTI per MFN.  SPTIs are allocated on
> > demand (when a guest page needs to be shadowed) and, when the pool
runs
> > low,
> > the LRU SPTs are torn down and their SPTIs recycled.
> >
> Well what I mean is that we should not connect a snapshot page with a
> SPTI at the first time the SPTIs are reserved. It would be better to
> manage these snapshot pages in another dynamic pool.
> BTW: What do you think of the backlink issue mentioned in my previous
> mail?
> > Currently I allocate about 5% of system memory for this purpose (this
> > includes the SPT, its snapshot and the backlink pages) and, with that
> > reasonable investment, we get very good performance.  With more study,
> I''m
> > sure things could be tuned even better.  (I hope I have properly
> understood
> > your questions.)
> >
> > -- rsp
> >
> > On 7/1/06, zhu <vanbas.han@gmail.com > wrote:
> >>
> >> Hi,
> >> After taking some time to dig into your patch about XI Shadow page
> >> table, I have to say it''s really a good design and
implementation IMHO,
>
> >> especially the parts about the clear hierarchy for each
smfn,decision
> >> table and how to support 32nopae in a rather elegant way. However,
I
> >> have several questions to discuss with you.:-)
> >> 1) It seems XI shadow pgt reserve all of the possible resources at
the
> >> early stage for HVM domain(the first time to create the asi). It
could
> >> be quite proper to reserve the smfns and sptis. However, do we
really
> >> need to reserve one snapshot page for each smfn at first and
retain it
> >> until the HVM domain is destroyed? I guess a large number of gpts
will
> >> not been modified frequently after them are totally set up. IMHO,
it
> >> would be better to manage these snapshot pages dynamic. Of course,
this
> >> will change the basic logistic of the code, e.g. you have to sync
the
> >> shadow pgt when invoke spti_make_shadow instead of leaving it out
of
> >> sync, you can''t set up the total low level shadow pgt
when invoke
> >> resync_spte  since it could cost a lot of time.
> >> 2) GP back link plays a very important role in XI shadow pgt.
However,
> >> it will also cause high memory pressure for the domain(2 pages for
each
> >> smfn). For these normal guest pages instead of GPT pages, I guess
its
> >> usage is limited. Only when invoke xi_invld_mfn, divide_large_page
or
> >> dirty logging, we will refer to the back link for these normal
guest
> >> pages. Is it reasonable to implement the back link only for the
GPT
> >> pages? Of course, this will increase the complexity of the code a
> little.
> >> 3) Can you show us the statistics between the current shadow pgt
and XI
> >> pgt for some critical operations, such as shadow_resync_all,
> gva_to_gpa,
> >> shadow_fault and so on. I''m really curious about it.
> >>
> >> I have to say I''m not very familiar with the current
shadow pgt
> >> implementation so I could miss some important considerations when
I
> post
> >> these questions. Please point it out.
> >> Thanks for sharing your idea and code with us. :-)
> >>
> >> _______________________________________________________
> >> Best Regards,
> >> hanzhu
> >>
> >>
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xensource.com
> >> http://lists.xensource.com/xen-devel
> >>
> >
> >
> >
>

-- 
--------------------------------------------------------------------
Robert S. Phillips                          Virtual Iron Software
rphillips@virtualiron.com                Tower 1, Floor 2
978-849-1220                                 900 Chelmsford Street
                                                    Lowell, MA 01851

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Robert Phillips

2006-Jul-03 10:01 UTC

head link

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

Keir et al have not given any feedback.  Not a peep.  To be generous,
though, it is a large body of code to digest.
-- rsp

On 7/2/06, zhu <vanbas.han@gmail.com> wrote:>
> Really thorough explanation. Now I understand all of your concerns for
> the design. All of us can tune the code if it check in the unstable tree.
> BTW: How about the feedbacks from the Cambridge guys?



Robert Phillips 写道:> > In XI, the idea is to have a pool of SPTIs all ready to go.  When a
page
> > needs to be shadowed, I simply pull a SPTI  off the list, zero its
> pages,
> > and it is ready for use.  No further memory allocation is needed. 
This
> is
> > the critical path and I want it as short as possible.
> That''s quite reasonable. Another classic example of space to time
> trade-off.
> >
> > One reason I have backlinks on all guest pages is because one
can''t know
> > ahead of time which guest pages are (or will become) GPTs.  When the
> code
> > first detects a guest page being used as a guest page table, it would
> have
> > to do a linear search to find all SPTEs that point to the new guest
page
> > table, so it can mark them as readonly.
> When the first time we shadow it, we could know it''s a GPT and
then we
> could connect the backlinks with the SPTE. However, the disadvantages is
> just as you have noted, it will increase the complexity of the critical
> shadow fault path.
> >
> > One could do without backlinks altogether if one were willing to put
up
> > with
> > linear searching.  It''s a space/performance tradeoff.  I
think, with
> > machines now having many megabytes of memory, users are more concerned
> > about
> > performance than a small memory overhead.
> >
> > -- rsp
> >
> >
> > On 7/2/06, zhu < vanbas.han@gmail.com> wrote:
> >>
> >>
> >> Robert Phillips 写道:
> >> > Well I don''t.  I simply pre-allocate a pool of
SPTI''s.  It can be
> quite
> >> a
> >> > large pool but certainly not one-SPTI per MFN.  SPTIs are
allocated
> on
> >> > demand (when a guest page needs to be shadowed) and, when the
pool
> runs
> >> > low,
> >> > the LRU SPTs are torn down and their SPTIs recycled.
> >> >
> >> Well what I mean is that we should not connect a snapshot page
with a
> >> SPTI at the first time the SPTIs are reserved. It would be better
to
> >> manage these snapshot pages in another dynamic pool.
> >> BTW: What do you think of the backlink issue mentioned in my
previous
> >> mail?
> >> > Currently I allocate about 5% of system memory for this
purpose (this
> >> > includes the SPT, its snapshot and the backlink pages) and,
with that
> >> > reasonable investment, we get very good performance.  With
more
> study,
> >> I''m
> >> > sure things could be tuned even better.  (I hope I have
properly
> >> understood
> >> > your questions.)
> >> >
> >> > -- rsp
> >> >
> >> > On 7/1/06, zhu <vanbas.han@gmail.com > wrote:
> >> >>
> >> >> Hi,
> >> >> After taking some time to dig into your patch about XI
Shadow page
> >> >> table, I have to say it''s really a good design
and implementation
> >> IMHO,
> >>
> >> >> especially the parts about the clear hierarchy for each
> smfn,decision
> >> >> table and how to support 32nopae in a rather elegant way.
However, I
> >> >> have several questions to discuss with you.:-)
> >> >> 1) It seems XI shadow pgt reserve all of the possible
resources at
> the
> >> >> early stage for HVM domain(the first time to create the
asi). It
> could
> >> >> be quite proper to reserve the smfns and sptis. However,
do we
> really
> >> >> need to reserve one snapshot page for each smfn at first
and retain
> it
> >> >> until the HVM domain is destroyed? I guess a large number
of gpts
> will
> >> >> not been modified frequently after them are totally set
up. IMHO, it
> >> >> would be better to manage these snapshot pages dynamic.
Of course,
> >> this
> >> >> will change the basic logistic of the code, e.g. you have
to sync
> the
> >> >> shadow pgt when invoke spti_make_shadow instead of
leaving it out of
> >> >> sync, you can''t set up the total low level
shadow pgt when invoke
> >> >> resync_spte  since it could cost a lot of time.
> >> >> 2) GP back link plays a very important role in XI shadow
pgt.
> However,
> >> >> it will also cause high memory pressure for the domain(2
pages for
> >> each
> >> >> smfn). For these normal guest pages instead of GPT pages,
I guess
> its
> >> >> usage is limited. Only when invoke xi_invld_mfn,
divide_large_page
> or
> >> >> dirty logging, we will refer to the back link for these
normal guest
> >> >> pages. Is it reasonable to implement the back link only
for the GPT
> >> >> pages? Of course, this will increase the complexity of
the code a
> >> little.
> >> >> 3) Can you show us the statistics between the current
shadow pgt
> >> and XI
> >> >> pgt for some critical operations, such as
shadow_resync_all,
> >> gva_to_gpa,
> >> >> shadow_fault and so on. I''m really curious about
it.
> >> >>
> >> >> I have to say I''m not very familiar with the
current shadow pgt
> >> >> implementation so I could miss some important
considerations when I
> >> post
> >> >> these questions. Please point it out.
> >> >> Thanks for sharing your idea and code with us. :-)
> >> >>
> >> >> _______________________________________________________
> >> >> Best Regards,
> >> >> hanzhu
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Xen-devel mailing list
> >> >> Xen-devel@lists.xensource.com
> >> >> http://lists.xensource.com/xen-devel
> >> >>
> >> >
> >> >
> >> >
> >>
> >
> >
> >
>


-- 
--------------------------------------------------------------------
Robert S. Phillips                          Virtual Iron Software
rphillips@virtualiron.com                Tower 1, Floor 2
978-849-1220                                 900 Chelmsford Street
                                                    Lowell, MA 01851


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Reasonably Related Threads

Search for more possibly parallel threads

Xen devel - Jul 2006 - Page fault is 4 times faster with XI shadow mechanism

[Xen-devel] Page fault is 4 times faster with XI shadow mechanism

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

[Xen-devel] Re: Page fault is 4 times faster with XI shadow mechanism

Reasonably Related Threads