thr3ads.net - Xen devel - [RFC] [PATCH] x86/mm: use wait queues for mem

If this information is useful, please help other people find it:
Share via:

Tim Deegan

2012-Feb-16 15:20 UTC

[RFC] [PATCH] x86/mm: use wait queues for mem_paging

Hi, 

As promised, a revised version of the waitq-on-missing-pages patch.
I''ve cut it back a fair bit; I think the resulting patch is maybe a bit
less efficient, but easier to understand.

I''ve smoke-tested it a bit and haven''t seen anything catch
fire but I
suspect the VMs aren''t very happy.  I''ll debug more throughly
when I''m
back in the office next week, but would really like some feedback from
the rest of you in the meantime.

Cheers,

Tim.

# HG changeset patch
# User Tim Deegan <tim@xen.org>
# Parent 01c1bcbc62224833304ede44187400f65e8a6b4c
[RFC] x86/mm: use wait queues for mem_paging

Use a wait queue to put a guest vcpu to sleep while the requested gfn is
in paging state. This adds missing p2m_mem_paging_populate() calls to
some callers of the new get_gfn* variants, which would crash now
because they get an invalid mfn. It also fixes guest crashes due to
unexpected returns from do_memory_op because copy_to/from_guest ran into
a paged gfn. Now those places will always get a valid mfn.

This is based on an earlier RFC patch by Olaf Hering, but heavily
simplified (removing a per-gfn queue of waiting vcpus in favour of
a scan of all vcpus on page-in).

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Tim Deegan <tim@xen.org>

diff -r 01c1bcbc6222 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c	Thu Feb 16 08:48:23 2012 +0100
+++ b/xen/arch/x86/mm/p2m.c	Thu Feb 16 14:39:13 2012 +0000
@@ -160,13 +160,49 @@ mfn_t __get_gfn_type_access(struct p2m_d
     }
 
     /* For now only perform locking on hap domains */
-    if ( locked && (hap_enabled(p2m->domain)) )
+    locked = locked && hap_enabled(p2m->domain);
+
+#ifdef __x86_64__
+again:
+#endif
+    if ( locked )
         /* Grab the lock here, don''t release until put_gfn */
         gfn_lock(p2m, gfn, 0);
 
     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
 
 #ifdef __x86_64__
+    if ( p2m_is_paging(*t) && q != p2m_query 
+         && p2m->domain == current->domain ) 
+    {
+        if ( locked )
+            gfn_unlock(p2m, gfn, 0);
+
+        /* Ping the pager */
+        if ( *t == p2m_ram_paging_out || *t == p2m_ram_paged )
+            p2m_mem_paging_populate(p2m->domain, gfn);
+
+        /* Safety catch: it _should_ be safe to wait here but if it''s
not,
+         * crash the VM, not the host */
+        if ( in_atomic() )
+        {
+            WARN();
+            domain_crash(p2m->domain);
+        }
+        else
+        {
+            current->arch.mem_paging_gfn = gfn;
+            wait_event(current->arch.mem_paging_wq, ({
+                        mfn = p2m->get_entry(p2m, gfn, t, a,
+                                             p2m_query, page_order);
+                        (*t != p2m_ram_paging_in);
+                    }));
+            goto again;
+        }
+    }
+#endif
+
+#ifdef __x86_64__
     if ( q == p2m_unshare && p2m_is_shared(*t) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
@@ -942,25 +978,25 @@ void p2m_mem_paging_drop_page(struct dom
  * This function needs to be called whenever gfn_to_mfn() returns any of the
p2m
  * paging types because the gfn may not be backed by a mfn.
  *
- * The gfn can be in any of the paging states, but the pager needs only be
- * notified when the gfn is in the paging-out path (paging_out or paged).  This
- * function may be called more than once from several vcpus. If the vcpu
belongs
- * to the guest, the vcpu must be stopped and the pager notified that the vcpu
- * was stopped. The pager needs to handle several requests for the same gfn.
+ * The gfn can be in any of the paging states, but the pager needs only
+ * be notified when the gfn is in the paging-out path (paging_out or
+ * paged).  This function may be called more than once from several
+ * vcpus.  The pager needs to handle several requests for the same gfn.
  *
- * If the gfn is not in the paging-out path and the vcpu does not belong to the
- * guest, nothing needs to be done and the function assumes that a request was
- * already sent to the pager. In this case the caller has to try again until
the
- * gfn is fully paged in again.
+ * If the gfn is not in the paging-out path nothing needs to be done and
+ * the function assumes that a request was already sent to the pager.
+ * In this case the caller has to try again until the gfn is fully paged
+ * in again.
  */
 void p2m_mem_paging_populate(struct domain *d, unsigned long gfn)
 {
     struct vcpu *v = current;
-    mem_event_request_t req;
+    mem_event_request_t req = {0};
     p2m_type_t p2mt;
     p2m_access_t a;
     mfn_t mfn;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int send_request = 0;
 
     /* We''re paging. There should be a ring */
     int rc = mem_event_claim_slot(d, &d->mem_event->paging);
@@ -974,9 +1010,6 @@ void p2m_mem_paging_populate(struct doma
     else if ( rc < 0 )
         return;
 
-    memset(&req, 0, sizeof(req));
-    req.type = MEM_EVENT_TYPE_PAGING;
-
     /* Fix p2m mapping */
     gfn_lock(p2m, gfn, 0);
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
@@ -986,26 +1019,29 @@ void p2m_mem_paging_populate(struct doma
         /* Evict will fail now, tag this request for pager */
         if ( p2mt == p2m_ram_paging_out )
             req.flags |= MEM_EVENT_FLAG_EVICT_FAIL;
-
-        set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2m_ram_paging_in, a);
+        if ( p2mt == p2m_ram_paging_out && mfn_valid(mfn) &&
v->domain == d )
+            /* Short-cut back to paged-in state (but not for foreign 
+             * mappings, or the pager couldn''t map it to page it out)
*/
+            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K,
+                          paging_mode_log_dirty(d) 
+                          ? p2m_ram_logdirty : p2m_ram_rw, a);
+        else
+        {
+            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2m_ram_paging_in, a);
+            send_request = 1;
+        }
     }
     gfn_unlock(p2m, gfn, 0);
 
-    /* Pause domain if request came from guest and gfn has paging type */
-    if ( p2m_is_paging(p2mt) && v->domain == d )
+    /* No need to inform pager if the gfn is not in the page-out path */
+    if ( !send_request ) 
     {
-        vcpu_pause_nosync(v);
-        req.flags |= MEM_EVENT_FLAG_VCPU_PAUSED;
-    }
-    /* No need to inform pager if the gfn is not in the page-out path */
-    else if ( p2mt != p2m_ram_paging_out && p2mt != p2m_ram_paged )
-    {
-        /* gfn is already on its way back and vcpu is not paused */
         mem_event_cancel_slot(d, &d->mem_event->paging);
         return;
     }
 
     /* Send request to pager */
+    req.type = MEM_EVENT_TYPE_PAGING;
     req.gfn = gfn;
     req.p2mt = p2mt;
     req.vcpu_id = v->vcpu_id;
@@ -1122,6 +1158,7 @@ void p2m_mem_paging_resume(struct domain
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     mem_event_response_t rsp;
+    struct vcpu *v;
     p2m_type_t p2mt;
     p2m_access_t a;
     mfn_t mfn;
@@ -1147,9 +1184,10 @@ void p2m_mem_paging_resume(struct domain
             }
             gfn_unlock(p2m, rsp.gfn, 0);
         }
-        /* Unpause domain */
-        if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED )
-            vcpu_unpause(d->vcpu[rsp.vcpu_id]);
+        /* Wake any vcpus that were waiting for this GFN */
+        for_each_vcpu ( d, v )
+            if ( v->arch.mem_paging_gfn == rsp.gfn )
+                wake_up_all(&v->arch.mem_paging_wq);
     }
 }
 
diff -r 01c1bcbc6222 xen/include/asm-x86/domain.h
--- a/xen/include/asm-x86/domain.h	Thu Feb 16 08:48:23 2012 +0100
+++ b/xen/include/asm-x86/domain.h	Thu Feb 16 14:39:13 2012 +0000
@@ -4,6 +4,7 @@
 #include <xen/config.h>
 #include <xen/mm.h>
 #include <xen/radix-tree.h>
+#include <xen/wait.h>
 #include <asm/hvm/vcpu.h>
 #include <asm/hvm/domain.h>
 #include <asm/e820.h>
@@ -491,6 +492,12 @@ struct arch_vcpu
     
     struct paging_vcpu paging;
 
+#ifdef CONFIG_X86_64
+    /* Mem-paging: this vcpu is waiting for a gfn to be paged in */
+    struct waitqueue_head mem_paging_wq;
+    unsigned long mem_paging_gfn;
+#endif
+
 #ifdef CONFIG_X86_32
     /* map_domain_page() mapping cache. */
     struct mapcache_vcpu mapcache;

Andres Lagar-Cavilla

2012-Feb-17 16:05 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

>
> On Feb 16, 2012, at 10:20 AM, Tim Deegan wrote:
>
>> Hi,
>>
>> As promised, a revised version of the waitq-on-missing-pages patch.
>> I''ve cut it back a fair bit; I think the resulting patch is
maybe a bit
>> less efficient, but easier to understand.
>>
>> I''ve smoke-tested it a bit and haven''t seen anything
catch fire but I
>> suspect the VMs aren''t very happy.  I''ll debug more
throughly when I''m
>> back in the office next week, but would really like some feedback from
>> the rest of you in the meantime.
Well, thanks for taking a stab at it. Looks fairly well. My main
observation is that we do not want to unconditionally go on a wait queue
everywhere. For example, the grant table code pretty reliable unwinds
itself and correctly tells the grant mapper to retry, in the presence of a
paged page. The same could be said of emulation (X86_EMUL_RETRY). And
certainly the balloon code should not sleep waiting for populate!

Hence I had proposed introducing get_gfn_sleep, and I''d be happy if the
wait queue code is invoked only there. And then we can call get_gfn_sleep
in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.

I understand that is cleaner not to have to make that distinction, but I
don''t want to fix things that aren''t broken :)

More comments inline
>>
>> Cheers,
>>
>> Tim.
>>
>> # HG changeset patch
>> # User Tim Deegan <tim@xen.org>
>> # Parent 01c1bcbc62224833304ede44187400f65e8a6b4c
>> [RFC] x86/mm: use wait queues for mem_paging
>>
>> Use a wait queue to put a guest vcpu to sleep while the requested gfn
is
>> in paging state. This adds missing p2m_mem_paging_populate() calls to
>> some callers of the new get_gfn* variants, which would crash now
>> because they get an invalid mfn. It also fixes guest crashes due to
>> unexpected returns from do_memory_op because copy_to/from_guest ran
into
>> a paged gfn. Now those places will always get a valid mfn.
>>
>> This is based on an earlier RFC patch by Olaf Hering, but heavily
>> simplified (removing a per-gfn queue of waiting vcpus in favour of
>> a scan of all vcpus on page-in).
>>
>> Signed-off-by: Olaf Hering <olaf@aepfle.de>
>> Signed-off-by: Tim Deegan <tim@xen.org>
>>
>> diff -r 01c1bcbc6222 xen/arch/x86/mm/p2m.c
>> --- a/xen/arch/x86/mm/p2m.c	Thu Feb 16 08:48:23 2012 +0100
>> +++ b/xen/arch/x86/mm/p2m.c	Thu Feb 16 14:39:13 2012 +0000
>> @@ -160,13 +160,49 @@ mfn_t __get_gfn_type_access(struct p2m_d
>>     }
>>
>>     /* For now only perform locking on hap domains */
>> -    if ( locked && (hap_enabled(p2m->domain)) )
>> +    locked = locked && hap_enabled(p2m->domain);
>> +
>> +#ifdef __x86_64__When are we getting rid of 32 bit hypervisor? (half kidding. Only half)
>> +again:
>> +#endif
>> +    if ( locked )
>>         /* Grab the lock here, don''t release until put_gfn */
>>         gfn_lock(p2m, gfn, 0);
>>
>>     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
>>
>> #ifdef __x86_64__
>> +    if ( p2m_is_paging(*t) && q != p2m_query
>> +         && p2m->domain == current->domain )
>> +    {
>> +        if ( locked )
>> +            gfn_unlock(p2m, gfn, 0);
>> +
>> +        /* Ping the pager */
>> +        if ( *t == p2m_ram_paging_out || *t == p2m_ram_paged )
>> +            p2m_mem_paging_populate(p2m->domain, gfn);
You want to make sure the mfn is not valid. For p2m_ram_paging_out we
still have the mfn in place.
>> +
>> +        /* Safety catch: it _should_ be safe to wait here but if
it''s
>> not,
>> +         * crash the VM, not the host */
>> +        if ( in_atomic() )
>> +        {
>> +            WARN();
>> +            domain_crash(p2m->domain);
>> +        }
>> +        else
>> +        {
>> +            current->arch.mem_paging_gfn = gfn;
>> +            wait_event(current->arch.mem_paging_wq, ({
>> +                        mfn = p2m->get_entry(p2m, gfn, t, a,
>> +                                             p2m_query, page_order);
>> +                        (*t != p2m_ram_paging_in);
Well, first of all mfn->get_entry will not happen in a locked context, so
you will exit the wait loop not holding the p2m lock and crash later. So
you want __get_gfn_type_access with q = p2m_query, and then put_gfn if the
condition fails. Probably not gonna fit in a ({}) block ;)

I think checking for (!p2m_is_paging(*t)) is more appropriate.

Or even (!p2m_is_paging(*t) && mfn_valid(mfn)). Although I''m
mildly
hesitant about the latter because conceivably another vcpu may beat us to
grabbing the p2m lock after the gfn is populated, and do something to its
mfn.
>> +                    }));
>> +            goto again;
>> +        }
>> +    }
>> +#endif
>> +
>> +#ifdef __x86_64__
>>     if ( q == p2m_unshare && p2m_is_shared(*t) )
>>     {
>>         ASSERT(!p2m_is_nestedp2m(p2m));
>> @@ -942,25 +978,25 @@ void p2m_mem_paging_drop_page(struct dom
>>  * This function needs to be called whenever gfn_to_mfn() returns any
of
>> the p2m
>>  * paging types because the gfn may not be backed by a mfn.
>>  *
>> - * The gfn can be in any of the paging states, but the pager needs
only
>> be
>> - * notified when the gfn is in the paging-out path (paging_out or
>> paged).  This
>> - * function may be called more than once from several vcpus. If the
>> vcpu belongs
>> - * to the guest, the vcpu must be stopped and the pager notified that
>> the vcpu
>> - * was stopped. The pager needs to handle several requests for the
same
>> gfn.
>> + * The gfn can be in any of the paging states, but the pager needs
only
>> + * be notified when the gfn is in the paging-out path (paging_out or
>> + * paged).  This function may be called more than once from several
>> + * vcpus.  The pager needs to handle several requests for the same
gfn.
>>  *
>> - * If the gfn is not in the paging-out path and the vcpu does not
>> belong to the
>> - * guest, nothing needs to be done and the function assumes that a
>> request was
>> - * already sent to the pager. In this case the caller has to try again
>> until the
>> - * gfn is fully paged in again.
>> + * If the gfn is not in the paging-out path nothing needs to be done
>> and
>> + * the function assumes that a request was already sent to the pager.
>> + * In this case the caller has to try again until the gfn is fully
>> paged
>> + * in again.
>>  */
>> void p2m_mem_paging_populate(struct domain *d, unsigned long gfn)
>> {
>>     struct vcpu *v = current;
>> -    mem_event_request_t req;
>> +    mem_event_request_t req = {0};
>>     p2m_type_t p2mt;
>>     p2m_access_t a;
>>     mfn_t mfn;
>>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    int send_request = 0;
>>
>>     /* We''re paging. There should be a ring */
>>     int rc = mem_event_claim_slot(d, &d->mem_event->paging);
Given all the optimizations around the send_request flag, it might be
worth moving the claim_slot call to an if(send_request) block at the
bottom, and get rid of cancel_slot altogether. Claim_slot could
potentially go (or cause someone else to go) to a wait queue, and it''s
best to not be that rude.
>> @@ -974,9 +1010,6 @@ void p2m_mem_paging_populate(struct doma
>>     else if ( rc < 0 )
>>         return;
>>
>> -    memset(&req, 0, sizeof(req));
>> -    req.type = MEM_EVENT_TYPE_PAGING;
>> -
>>     /* Fix p2m mapping */
>>     gfn_lock(p2m, gfn, 0);
>>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query,
NULL);
>> @@ -986,26 +1019,29 @@ void p2m_mem_paging_populate(struct doma
>>         /* Evict will fail now, tag this request for pager */
>>         if ( p2mt == p2m_ram_paging_out )
>>             req.flags |= MEM_EVENT_FLAG_EVICT_FAIL;
>> -
>> -        set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2m_ram_paging_in,
>> a);
>> +        if ( p2mt == p2m_ram_paging_out && mfn_valid(mfn)
&& v->domain
>> == d )
>> +            /* Short-cut back to paged-in state (but not for foreign
>> +             * mappings, or the pager couldn''t map it to page
it out)
>> */
>> +            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K,
>> +                          paging_mode_log_dirty(d)
>> +                          ? p2m_ram_logdirty : p2m_ram_rw, a);
>> +        else
>> +        {
>> +            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K,
>> p2m_ram_paging_in, a);
>> +            send_request = 1;
>> +        }
>>     }
>>     gfn_unlock(p2m, gfn, 0);
>>
>> -    /* Pause domain if request came from guest and gfn has paging type
>> */
>> -    if ( p2m_is_paging(p2mt) && v->domain == d )
>> +    /* No need to inform pager if the gfn is not in the page-out path
>> */
>> +    if ( !send_request )
>>     {
>> -        vcpu_pause_nosync(v);
I see no sin in stacking vcpu_pauses. Plus, this will need to stay if we
(hopefully) adopt get_gfn_sleep.
>> -        req.flags |= MEM_EVENT_FLAG_VCPU_PAUSED;
>> -    }
>> -    /* No need to inform pager if the gfn is not in the page-out path
>> */
>> -    else if ( p2mt != p2m_ram_paging_out && p2mt !=
p2m_ram_paged )
>> -    {
>> -        /* gfn is already on its way back and vcpu is not paused */
>>         mem_event_cancel_slot(d, &d->mem_event->paging);
>>         return;
>>     }
>>
>>     /* Send request to pager */
>> +    req.type = MEM_EVENT_TYPE_PAGING;
>>     req.gfn = gfn;
>>     req.p2mt = p2mt;
>>     req.vcpu_id = v->vcpu_id;
>> @@ -1122,6 +1158,7 @@ void p2m_mem_paging_resume(struct domain
>> {
>>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>     mem_event_response_t rsp;
>> +    struct vcpu *v;
>>     p2m_type_t p2mt;
>>     p2m_access_t a;
>>     mfn_t mfn;
>> @@ -1147,9 +1184,10 @@ void p2m_mem_paging_resume(struct domain
>>             }
>>             gfn_unlock(p2m, rsp.gfn, 0);
>>         }
>> -        /* Unpause domain */
>> -        if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED )
>> -            vcpu_unpause(d->vcpu[rsp.vcpu_id]);
I see no sin in stacking vcpu pauses, redux...

That''s all I have. Thanks!
Andres
>> +        /* Wake any vcpus that were waiting for this GFN */
>> +        for_each_vcpu ( d, v )
>> +            if ( v->arch.mem_paging_gfn == rsp.gfn )
>> +                wake_up_all(&v->arch.mem_paging_wq);
>>     }
>> }
>>
>> diff -r 01c1bcbc6222 xen/include/asm-x86/domain.h
>> --- a/xen/include/asm-x86/domain.h	Thu Feb 16 08:48:23 2012 +0100
>> +++ b/xen/include/asm-x86/domain.h	Thu Feb 16 14:39:13 2012 +0000
>> @@ -4,6 +4,7 @@
>> #include <xen/config.h>
>> #include <xen/mm.h>
>> #include <xen/radix-tree.h>
>> +#include <xen/wait.h>
>> #include <asm/hvm/vcpu.h>
>> #include <asm/hvm/domain.h>
>> #include <asm/e820.h>
>> @@ -491,6 +492,12 @@ struct arch_vcpu
>>
>>     struct paging_vcpu paging;
>>
>> +#ifdef CONFIG_X86_64
>> +    /* Mem-paging: this vcpu is waiting for a gfn to be paged in */
>> +    struct waitqueue_head mem_paging_wq;
>> +    unsigned long mem_paging_gfn;
>> +#endif
>> +
>> #ifdef CONFIG_X86_32
>>     /* map_domain_page() mapping cache. */
>>     struct mapcache_vcpu mapcache;
>
>

Tim Deegan

2012-Feb-17 17:56 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Hi,

At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla
wrote:> Well, thanks for taking a stab at it. Looks fairly well. My main
> observation is that we do not want to unconditionally go on a wait queue
> everywhere. For example, the grant table code pretty reliable unwinds
> itself and correctly tells the grant mapper to retry, in the presence of a
> paged page.
The grant table code is unlikely to hit a paged-out gfn in the caller''s
address space.
> The same could be said of emulation (X86_EMUL_RETRY).
Fair enough.  But the emulator uses hvm_copy() in its callbacks anyway
we''d need another flag to say ''special handling for p2m fixups
plz'' in
all the hvm_copy() interfaces.  Unpleasant!
> And certainly the balloon code should not sleep waiting for populate!
The balloon code should not be trying to populate at all!
> Hence I had proposed introducing get_gfn_sleep, and I''d be happy
if the
> wait queue code is invoked only there. And then we can call get_gfn_sleep
> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.
I see.  In the longer term I definitely do not want get_gfn()''s callers
to have to deal with p2m_is_paged(), and PoD, and sharing, and every
other p2m hack we come up with.  I want all that stuff hidden behind
get_gfn_*(), with at most a flag to say ''fix up magic MM
tricks''.

I don''t like get_gfn_sleep() very much but I''ll see if I can
find a way
to do the equivalent with a flag to get_gfn_*().  But I suspect it will
end up with a tri-state argument of:
 a) just give me the current state;
 b) try (without sleeping) to fix up paging, PoD and sharing but 
    still maybe make me handle it myself by error propagation; and
 c) just fix it up.

''b'' is a pretty weird interface.  Specially given that in
corner cases,
''make me handle it'' might involve hitting a wait queue anyway.
If we
intend in the fullness of time to get rid of it (as I hope we would)
then probably we shouldn''t introduce it.  (And if that means pushing
this whole change out past 4.2, we should consider doing that.)
> >> diff -r 01c1bcbc6222 xen/arch/x86/mm/p2m.c
> >> --- a/xen/arch/x86/mm/p2m.c	Thu Feb 16 08:48:23 2012 +0100
> >> +++ b/xen/arch/x86/mm/p2m.c	Thu Feb 16 14:39:13 2012 +0000
> >> @@ -160,13 +160,49 @@ mfn_t __get_gfn_type_access(struct p2m_d
> >>     }
> >>
> >>     /* For now only perform locking on hap domains */
> >> -    if ( locked && (hap_enabled(p2m->domain)) )
> >> +    locked = locked && hap_enabled(p2m->domain);
> >> +
> >> +#ifdef __x86_64__
> When are we getting rid of 32 bit hypervisor? (half kidding. Only half)
Can''t be soon enough for me.  I''ve been sharpening the axe for
that one
for years now. :)
> >> +again:
> >> +#endif
> >> +    if ( locked )
> >>         /* Grab the lock here, don''t release until
put_gfn */
> >>         gfn_lock(p2m, gfn, 0);
> >>
> >>     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
> >>
> >> #ifdef __x86_64__
> >> +    if ( p2m_is_paging(*t) && q != p2m_query
> >> +         && p2m->domain == current->domain )
> >> +    {
> >> +        if ( locked )
> >> +            gfn_unlock(p2m, gfn, 0);
> >> +
> >> +        /* Ping the pager */
> >> +        if ( *t == p2m_ram_paging_out || *t == p2m_ram_paged )
> >> +            p2m_mem_paging_populate(p2m->domain, gfn);
> 
> You want to make sure the mfn is not valid. For p2m_ram_paging_out we
> still have the mfn in place.
Doesn''t p2m_mem_paging_populate already DTRT in all these cases?  If
not, it really should.
> >> +
> >> +        /* Safety catch: it _should_ be safe to wait here but if
it''s
> >> not,
> >> +         * crash the VM, not the host */
> >> +        if ( in_atomic() )
> >> +        {
> >> +            WARN();
> >> +            domain_crash(p2m->domain);
> >> +        }
> >> +        else
> >> +        {
> >> +            current->arch.mem_paging_gfn = gfn;
> >> +            wait_event(current->arch.mem_paging_wq, ({
> >> +                        mfn = p2m->get_entry(p2m, gfn, t, a,
> >> +                                             p2m_query,
page_order);
> >> +                        (*t != p2m_ram_paging_in);
> 
> Well, first of all mfn->get_entry will not happen in a locked context,
so
> you will exit the wait loop not holding the p2m lock and crash later.
Yes - we always exit the loop without the lock; then we ''goto
again'' and
do things properly.  I realise it''s a bit inefficient, but on the
page-in path the p2m lookups shouldn''t be the bottleneck. :)  I did
write the version that handled the locking in this loop but it got a bit
twisty, plus we need the ''goto again'' for other reasons (see
below).

As a follow-up I ought to make the page-sharing fixup code that''s just
below this use ''goto again'' as well; we shouldn''t
really leave this
function until we get a lookup where all these cases are dealt with.
> So you want __get_gfn_type_access with q = p2m_query, and then put_gfn if
the
> condition fails. Probably not gonna fit in a ({}) block ;)
> 
> I think checking for (!p2m_is_paging(*t)) is more appropriate.
No, that was deliberate, and IIRC a change from Olaf''s patch.  If the
page gets paged in and back out while we''re asleep, the type goes back
to p2m_ram_paged, and in that case we _must_ break out and retry or this
vcpu might stall forever.
> >> void p2m_mem_paging_populate(struct domain *d, unsigned long gfn)
> >> {
> >>     struct vcpu *v = current;
> >> -    mem_event_request_t req;
> >> +    mem_event_request_t req = {0};
> >>     p2m_type_t p2mt;
> >>     p2m_access_t a;
> >>     mfn_t mfn;
> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
> >> +    int send_request = 0;
> >>
> >>     /* We''re paging. There should be a ring */
> >>     int rc = mem_event_claim_slot(d,
&d->mem_event->paging);
> 
> Given all the optimizations around the send_request flag, it might be
> worth moving the claim_slot call to an if(send_request) block at the
> bottom, and get rid of cancel_slot altogether. Claim_slot could
> potentially go (or cause someone else to go) to a wait queue, and
it''s
> best to not be that rude.
Sure.  It might be tricky to make sure we unwind properly in the failure
case, but I''ll have a go at a follow-up patch that looks at that.

(This is not really a regression in this patch AFAICS - in the existing code
the get_gfn() callers end up calling this function anyway.)
> >> -    /* Pause domain if request came from guest and gfn has paging
type
> >> */
> >> -    if ( p2m_is_paging(p2mt) && v->domain == d )
> >> +    /* No need to inform pager if the gfn is not in the page-out
path
> >> */
> >> +    if ( !send_request )
> >>     {
> >> -        vcpu_pause_nosync(v);
> 
> I see no sin in stacking vcpu_pauses. Plus, this will need to stay if we
> (hopefully) adopt get_gfn_sleep.
Yes, if we do it that way, this pause definitely needs to stay.  But if
not then the waitqueue takes care of this case entirely, and this code
is just redundant and confusing.

Cheers,

Tim.

Andres Lagar-Cavilla

2012-Feb-17 20:57 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

> Hi,
>
> At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote:
>> Well, thanks for taking a stab at it. Looks fairly well. My main
>> observation is that we do not want to unconditionally go on a wait
queue
>> everywhere. For example, the grant table code pretty reliable unwinds
>> itself and correctly tells the grant mapper to retry, in the presence
of
>> a
>> paged page.
>
> The grant table code is unlikely to hit a paged-out gfn in the
caller''s
> address space.
Oh yeah they do. At least for the target gfn''s behind the grants.
That''s
why we need the retry loops in the PV backends in dom0.
>
>> The same could be said of emulation (X86_EMUL_RETRY).
>
> Fair enough.  But the emulator uses hvm_copy() in its callbacks anyway
> we''d need another flag to say ''special handling for p2m
fixups plz'' in
> all the hvm_copy() interfaces.  Unpleasant!
>
>> And certainly the balloon code should not sleep waiting for populate!
>
> The balloon code should not be trying to populate at all!
Agreed. get_gfn_query is needed there. Nothing bad happens because we
immediately send drop_page. But it''s wrong-ish.
>
>> Hence I had proposed introducing get_gfn_sleep, and I''d be
happy if the
>> wait queue code is invoked only there. And then we can call
>> get_gfn_sleep
>> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.
>
> I see.  In the longer term I definitely do not want get_gfn()''s
callers
> to have to deal with p2m_is_paged(), and PoD, and sharing, and every
> other p2m hack we come up with.  I want all that stuff hidden behind
> get_gfn_*(), with at most a flag to say ''fix up magic MM
tricks''.
>
> I don''t like get_gfn_sleep() very much but I''ll see if I
can find a way
> to do the equivalent with a flag to get_gfn_*().  But I suspect it will
> end up with a tri-state argument of:
>  a) just give me the current state;
>  b) try (without sleeping) to fix up paging, PoD and sharing but
>     still maybe make me handle it myself by error propagation; and
>  c) just fix it up.
>
> ''b'' is a pretty weird interface.  Specially given that in
corner cases,
> ''make me handle it'' might involve hitting a wait queue
anyway.  If we
> intend in the fullness of time to get rid of it (as I hope we would)
> then probably we shouldn''t introduce it.  (And if that means
pushing
> this whole change out past 4.2, we should consider doing that.)
>It''s weird. b) is an evolutionary side-effect. It should disappear. I
guess what I''m arguing for is "it should disappear in the long
term, not
on a patch so close to 4.2"
>> >> diff -r 01c1bcbc6222 xen/arch/x86/mm/p2m.c
>> >> --- a/xen/arch/x86/mm/p2m.c	Thu Feb 16 08:48:23 2012 +0100
>> >> +++ b/xen/arch/x86/mm/p2m.c	Thu Feb 16 14:39:13 2012 +0000
>> >> @@ -160,13 +160,49 @@ mfn_t __get_gfn_type_access(struct p2m_d
>> >>     }
>> >>
>> >>     /* For now only perform locking on hap domains */
>> >> -    if ( locked && (hap_enabled(p2m->domain)) )
>> >> +    locked = locked && hap_enabled(p2m->domain);
>> >> +
>> >> +#ifdef __x86_64__
>> When are we getting rid of 32 bit hypervisor? (half kidding. Only half)
>
> Can''t be soon enough for me.  I''ve been sharpening the
axe for that one
> for years now. :)
What''s stopping the death blow?
>
>> >> +again:
>> >> +#endif
>> >> +    if ( locked )
>> >>         /* Grab the lock here, don''t release until
put_gfn */
>> >>         gfn_lock(p2m, gfn, 0);
>> >>
>> >>     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
>> >>
>> >> #ifdef __x86_64__
>> >> +    if ( p2m_is_paging(*t) && q != p2m_query
>> >> +         && p2m->domain == current->domain )
>> >> +    {
>> >> +        if ( locked )
>> >> +            gfn_unlock(p2m, gfn, 0);
>> >> +
>> >> +        /* Ping the pager */
>> >> +        if ( *t == p2m_ram_paging_out || *t == p2m_ram_paged
)
>> >> +            p2m_mem_paging_populate(p2m->domain, gfn);
>>
>> You want to make sure the mfn is not valid. For p2m_ram_paging_out we
>> still have the mfn in place.
>
> Doesn''t p2m_mem_paging_populate already DTRT in all these cases? 
If
> not, it really should.
>
It''s not about populate, it''s about killing the domain
unnecessarily in
the in_atomic check.
>> >> +
>> >> +        /* Safety catch: it _should_ be safe to wait here but
if
>> it''s
>> >> not,
>> >> +         * crash the VM, not the host */
>> >> +        if ( in_atomic() )
>> >> +        {
>> >> +            WARN();
>> >> +            domain_crash(p2m->domain);
>> >> +        }
>> >> +        else
>> >> +        {
>> >> +            current->arch.mem_paging_gfn = gfn;
>> >> +            wait_event(current->arch.mem_paging_wq, ({
>> >> +                        mfn = p2m->get_entry(p2m, gfn, t,
a,
>> >> +                                             p2m_query,
page_order);
>> >> +                        (*t != p2m_ram_paging_in);
>>
>> Well, first of all mfn->get_entry will not happen in a locked
context,
>> so
>> you will exit the wait loop not holding the p2m lock and crash later.
>
> Yes - we always exit the loop without the lock; then we ''goto
again'' and
> do things properly.  I realise it''s a bit inefficient, but on the
> page-in path the p2m lookups shouldn''t be the bottleneck. :)  I
did
> write the version that handled the locking in this loop but it got a bit
> twisty, plus we need the ''goto again'' for other reasons
(see below).
Yes, but, you''ll be reading the p2m in a racy manner.
>
> As a follow-up I ought to make the page-sharing fixup code that''s
just
> below this use ''goto again'' as well; we
shouldn''t really leave this
> function until we get a lookup where all these cases are dealt with.
>
I was intending to do that and then I stepped back and ended up with the
current patch submitted a couple of days ago. Same problem, can''t
safely
go to a wait queue.

I think we won''t be able to have a magic
get_gfn_solve_everything_silently() call until we find a solution to the
fact that get_gfn happens in locked contexts. I''ve though about a wacky
idea that
1. schedules a sleep-n-retry softirq
2. fails the call
3. avoids crashing the domain on the unwind path
4. when executing softirqs pauses the vcpu as per 1.
5. When woken up retries the last failed hypercall/vmexit/whatever

I shouldn''t have said that out loud...
>> So you want __get_gfn_type_access with q = p2m_query, and then put_gfn
>> if the
>> condition fails. Probably not gonna fit in a ({}) block ;)
>>
>> I think checking for (!p2m_is_paging(*t)) is more appropriate.
>
> No, that was deliberate, and IIRC a change from Olaf''s patch.  If
the
> page gets paged in and back out while we''re asleep, the type goes
back
> to p2m_ram_paged, and in that case we _must_ break out and retry or this
> vcpu might stall forever.
I see, a new populate event is needed.
>
>> >> void p2m_mem_paging_populate(struct domain *d, unsigned long
gfn)
>> >> {
>> >>     struct vcpu *v = current;
>> >> -    mem_event_request_t req;
>> >> +    mem_event_request_t req = {0};
>> >>     p2m_type_t p2mt;
>> >>     p2m_access_t a;
>> >>     mfn_t mfn;
>> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> >> +    int send_request = 0;
>> >>
>> >>     /* We''re paging. There should be a ring */
>> >>     int rc = mem_event_claim_slot(d,
&d->mem_event->paging);
>>
>> Given all the optimizations around the send_request flag, it might be
>> worth moving the claim_slot call to an if(send_request) block at the
>> bottom, and get rid of cancel_slot altogether. Claim_slot could
>> potentially go (or cause someone else to go) to a wait queue, and
it''s
>> best to not be that rude.
>
> Sure.  It might be tricky to make sure we unwind properly in the failure
> case, but I''ll have a go at a follow-up patch that looks at that.
>
> (This is not really a regression in this patch AFAICS - in the existing
> code
> the get_gfn() callers end up calling this function anyway.)
*After* they''ve put gfn. Not a regression, just cleaner code imo.

Andres
>
>> >> -    /* Pause domain if request came from guest and gfn has
paging
>> type
>> >> */
>> >> -    if ( p2m_is_paging(p2mt) && v->domain == d )
>> >> +    /* No need to inform pager if the gfn is not in the
page-out
>> path
>> >> */
>> >> +    if ( !send_request )
>> >>     {
>> >> -        vcpu_pause_nosync(v);
>>
>> I see no sin in stacking vcpu_pauses. Plus, this will need to stay if
we
>> (hopefully) adopt get_gfn_sleep.
>
> Yes, if we do it that way, this pause definitely needs to stay.  But if
> not then the waitqueue takes care of this case entirely, and this code
> is just redundant and confusing.
>
> Cheers,
>
> Tim.
>

Tim Deegan

2012-Feb-23 15:30 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Hi, 

I''m about to repost this patch, with a few other changes and fixups. 
Further comments below.

At 12:57 -0800 on 17 Feb (1329483430), Andres Lagar-Cavilla
wrote:> > At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote:
> >> Well, thanks for taking a stab at it. Looks fairly well. My main
> >> observation is that we do not want to unconditionally go on a wait
queue
> >> everywhere. For example, the grant table code pretty reliable
unwinds
> >> itself and correctly tells the grant mapper to retry, in the
presence of
> >> a
> >> paged page.
> >
> > The grant table code is unlikely to hit a paged-out gfn in the
caller''s
> > address space.
> 
> Oh yeah they do. At least for the target gfn''s behind the grants.
That''s
> why we need the retry loops in the PV backends in dom0.
The target gfns aren''t in the caller''s p2m, though. 
> > The balloon code should not be trying to populate at all!
> 
> Agreed. get_gfn_query is needed there. Nothing bad happens because we
> immediately send drop_page. But it''s wrong-ish.
Fixed
> >> Hence I had proposed introducing get_gfn_sleep, and I''d
be happy if the
> >> wait queue code is invoked only there. And then we can call
> >> get_gfn_sleep
> >> in vmx_load_pdptrs, in guest page table walks, and in __hvm_copy.
> >
> > I see.  In the longer term I definitely do not want
get_gfn()''s callers
> > to have to deal with p2m_is_paged(), and PoD, and sharing, and every
> > other p2m hack we come up with.  I want all that stuff hidden behind
> > get_gfn_*(), with at most a flag to say ''fix up magic MM
tricks''.
> >
> > I don''t like get_gfn_sleep() very much but I''ll see
if I can find a way
> > to do the equivalent with a flag to get_gfn_*().  But I suspect it
will
> > end up with a tri-state argument of:
> >  a) just give me the current state;
> >  b) try (without sleeping) to fix up paging, PoD and sharing but
> >     still maybe make me handle it myself by error propagation; and
> >  c) just fix it up.
> >
> > ''b'' is a pretty weird interface.  Specially given
that in corner cases,
> > ''make me handle it'' might involve hitting a wait
queue anyway.  If we
> > intend in the fullness of time to get rid of it (as I hope we would)
> > then probably we shouldn''t introduce it.  (And if that means
pushing
> > this whole change out past 4.2, we should consider doing that.)
> >
> It''s weird. b) is an evolutionary side-effect. It should
disappear. I
> guess what I''m arguing for is "it should disappear in the
long term, not
> on a patch so close to 4.2"
Well, I tried making the ''but don''t go to a waitq''
flag; the
implementation is trivial on top of the patches I''m about to post but
the question of when to set it was so ugly I gave up.

I have hit another stumbling block though - get_two_gfns() can''t be
safely ask get_gfn to populate or unshare if that might involve a waitq,
since by definition one of its two calls is made with a lock held.  I
can''t see a nice way of having it retry (not one that''s
guaranteed to
make progress) without pulling the fixup-and-retry up into that
function.  I might do that in the next revision.
> > Can''t be soon enough for me.  I''ve been sharpening
the axe for that one
> > for years now. :)
> 
> What''s stopping the death blow?
People said they still need it.
> >> You want to make sure the mfn is not valid. For p2m_ram_paging_out
we
> >> still have the mfn in place.
> >
> > Doesn''t p2m_mem_paging_populate already DTRT in all these
cases?  If
> > not, it really should.
> 
> It''s not about populate, it''s about killing the domain
unnecessarily in
> the in_atomic check.
Oh I see, thanks.  Fixed.
> >> >> +
> >> >> +        /* Safety catch: it _should_ be safe to wait
here but if
> >> it''s
> >> >> not,
> >> >> +         * crash the VM, not the host */
> >> >> +        if ( in_atomic() )
> >> >> +        {
> >> >> +            WARN();
> >> >> +            domain_crash(p2m->domain);
> >> >> +        }
> >> >> +        else
> >> >> +        {
> >> >> +            current->arch.mem_paging_gfn = gfn;
> >> >> +            wait_event(current->arch.mem_paging_wq,
({
> >> >> +                        mfn = p2m->get_entry(p2m,
gfn, t, a,
> >> >> +                                             p2m_query,
page_order);
> >> >> +                        (*t != p2m_ram_paging_in);
> >>
> >> Well, first of all mfn->get_entry will not happen in a locked
context,
> >> so
> >> you will exit the wait loop not holding the p2m lock and crash
later.
> >
> > Yes - we always exit the loop without the lock; then we ''goto
again'' and
> > do things properly.  I realise it''s a bit inefficient, but on
the
> > page-in path the p2m lookups shouldn''t be the bottleneck. :) 
I did
> > write the version that handled the locking in this loop but it got a
bit
> > twisty, plus we need the ''goto again'' for other
reasons (see below).
> 
> Yes, but, you''ll be reading the p2m in a racy manner.
I don''t think there are any real races here.  Either we see
p2m_paging_in (and someone will eventually wake us when it pages in) 
or we don''t (and we go back and do the lookup safely).
> > As a follow-up I ought to make the page-sharing fixup code
that''s just
> > below this use ''goto again'' as well; we
shouldn''t really leave this
> > function until we get a lookup where all these cases are dealt with.
On closer inspection the page-sharing version is OK because it doesn''t
release the lock, and the unshare function won''t leave us with a
paged-out page.
> I think we won''t be able to have a magic
> get_gfn_solve_everything_silently() call until we find a solution to the
> fact that get_gfn happens in locked contexts.
Yes, I agree, and I''m coming round to the opinion that we might be too
late to fix that in 4.2. :|
> I''ve though about a wacky
> idea that
> 1. schedules a sleep-n-retry softirq
> 2. fails the call
> 3. avoids crashing the domain on the unwind path
> 4. when executing softirqs pauses the vcpu as per 1.
> 5. When woken up retries the last failed hypercall/vmexit/whatever
> 
> I shouldn''t have said that out loud...
The first time I approached the idea of paging guest RAM (before Patrick
started on the current system) I prototyped something not a million
miles from this (basically, a sort of setjmp() on hypervisor entry) but
the question of making all hypercalls and VMEXITs idempotent scared me
off. :)
> >> >> void p2m_mem_paging_populate(struct domain *d, unsigned
long gfn)
> >> >> {
> >> >>     struct vcpu *v = current;
> >> >> -    mem_event_request_t req;
> >> >> +    mem_event_request_t req = {0};
> >> >>     p2m_type_t p2mt;
> >> >>     p2m_access_t a;
> >> >>     mfn_t mfn;
> >> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
> >> >> +    int send_request = 0;
> >> >>
> >> >>     /* We''re paging. There should be a ring */
> >> >>     int rc = mem_event_claim_slot(d,
&d->mem_event->paging);
> >>
> >> Given all the optimizations around the send_request flag, it might
be
> >> worth moving the claim_slot call to an if(send_request) block at
the
> >> bottom, and get rid of cancel_slot altogether. Claim_slot could
> >> potentially go (or cause someone else to go) to a wait queue, and
it''s
> >> best to not be that rude.
> >
> > Sure.  It might be tricky to make sure we unwind properly in the
failure
> > case, but I''ll have a go at a follow-up patch that looks at
that.
> >
> > (This is not really a regression in this patch AFAICS - in the
existing
> > code
> > the get_gfn() callers end up calling this function anyway.)
> 
> *After* they''ve put gfn. Not a regression, just cleaner code imo.
This is also called after putting the gfn - but in any case I''ve
reshuffled the mem_event_claim_slot in the new version.

Cheers,

Tim.

Andres Lagar-Cavilla

2012-Feb-23 15:55 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

> Hi,
>
> I''m about to repost this patch, with a few other changes and
fixups.
> Further comments below.
>
> At 12:57 -0800 on 17 Feb (1329483430), Andres Lagar-Cavilla wrote:
>> > At 08:05 -0800 on 17 Feb (1329465959), Andres Lagar-Cavilla wrote:
>> >> Well, thanks for taking a stab at it. Looks fairly well. My
main
>> >> observation is that we do not want to unconditionally go on a
wait
>> queue
>> >> everywhere. For example, the grant table code pretty reliable
unwinds
>> >> itself and correctly tells the grant mapper to retry, in the
presence
>> of
>> >> a
>> >> paged page.
>> >
>> > The grant table code is unlikely to hit a paged-out gfn in the
>> caller''s
>> > address space.
>>
>> Oh yeah they do. At least for the target gfn''s behind the
grants. That''s
>> why we need the retry loops in the PV backends in dom0.
>
> The target gfns aren''t in the caller''s p2m, though.
>
>> > The balloon code should not be trying to populate at all!
>>
>> Agreed. get_gfn_query is needed there. Nothing bad happens because we
>> immediately send drop_page. But it''s wrong-ish.
>
> Fixed
>
>> >> Hence I had proposed introducing get_gfn_sleep, and
I''d be happy if
>> the
>> >> wait queue code is invoked only there. And then we can call
>> >> get_gfn_sleep
>> >> in vmx_load_pdptrs, in guest page table walks, and in
__hvm_copy.
>> >
>> > I see.  In the longer term I definitely do not want
get_gfn()''s
>> callers
>> > to have to deal with p2m_is_paged(), and PoD, and sharing, and
every
>> > other p2m hack we come up with.  I want all that stuff hidden
behind
>> > get_gfn_*(), with at most a flag to say ''fix up magic MM
tricks''.
>> >
>> > I don''t like get_gfn_sleep() very much but I''ll
see if I can find a
>> way
>> > to do the equivalent with a flag to get_gfn_*().  But I suspect it
>> will
>> > end up with a tri-state argument of:
>> >  a) just give me the current state;
>> >  b) try (without sleeping) to fix up paging, PoD and sharing but
>> >     still maybe make me handle it myself by error propagation; and
>> >  c) just fix it up.
>> >
>> > ''b'' is a pretty weird interface.  Specially
given that in corner
>> cases,
>> > ''make me handle it'' might involve hitting a wait
queue anyway.  If we
>> > intend in the fullness of time to get rid of it (as I hope we
would)
>> > then probably we shouldn''t introduce it.  (And if that
means pushing
>> > this whole change out past 4.2, we should consider doing that.)
>> >
>> It''s weird. b) is an evolutionary side-effect. It should
disappear. I
>> guess what I''m arguing for is "it should disappear in the
long term, not
>> on a patch so close to 4.2"
>
> Well, I tried making the ''but don''t go to a
waitq'' flag; the
> implementation is trivial on top of the patches I''m about to post
but
> the question of when to set it was so ugly I gave up.
>
> I have hit another stumbling block though - get_two_gfns() can''t
be
> safely ask get_gfn to populate or unshare if that might involve a waitq,
> since by definition one of its two calls is made with a lock held.  I
> can''t see a nice way of having it retry (not one that''s
guaranteed to
> make progress) without pulling the fixup-and-retry up into that
> function.  I might do that in the next revision.
One way to do this is use query_unlocked on the second. If we sense
danger, drop the lock on the first, then do the right thing for the
second, then drop the lock for the second, then retry at the top.
Front-runner for ugliest-thing-in-the-hypervisor award.

hvm_task_switch also gets two gfns, although in a manner that prevents
easily using get_two_gfns (hvm_map_entry on gdt offsets). It''s unlikely
this will ever be a problem (it''s somewhat reasonable to expect both
gdt
entries to fall in the same gfn) but it''s something worth keeping in
mind.
>
>> > Can''t be soon enough for me.  I''ve been
sharpening the axe for that
>> one
>> > for years now. :)
>>
>> What''s stopping the death blow?
>
> People said they still need it.
>
>> >> You want to make sure the mfn is not valid. For
p2m_ram_paging_out we
>> >> still have the mfn in place.
>> >
>> > Doesn''t p2m_mem_paging_populate already DTRT in all these
cases?  If
>> > not, it really should.
>>
>> It''s not about populate, it''s about killing the
domain unnecessarily in
>> the in_atomic check.
>
> Oh I see, thanks.  Fixed.
>
>> >> >> +
>> >> >> +        /* Safety catch: it _should_ be safe to wait
here but if
>> >> it''s
>> >> >> not,
>> >> >> +         * crash the VM, not the host */
>> >> >> +        if ( in_atomic() )
>> >> >> +        {
>> >> >> +            WARN();
>> >> >> +            domain_crash(p2m->domain);
>> >> >> +        }
>> >> >> +        else
>> >> >> +        {
>> >> >> +            current->arch.mem_paging_gfn = gfn;
>> >> >> +           
wait_event(current->arch.mem_paging_wq, ({
>> >> >> +                        mfn = p2m->get_entry(p2m,
gfn, t, a,
>> >> >> +                                            
p2m_query,
>> page_order);
>> >> >> +                        (*t != p2m_ram_paging_in);
>> >>
>> >> Well, first of all mfn->get_entry will not happen in a
locked
>> context,
>> >> so
>> >> you will exit the wait loop not holding the p2m lock and crash
later.
>> >
>> > Yes - we always exit the loop without the lock; then we
''goto again''
>> and
>> > do things properly.  I realise it''s a bit inefficient,
but on the
>> > page-in path the p2m lookups shouldn''t be the bottleneck.
:)  I did
>> > write the version that handled the locking in this loop but it got
a
>> bit
>> > twisty, plus we need the ''goto again'' for other
reasons (see below).
>>
>> Yes, but, you''ll be reading the p2m in a racy manner.
>
> I don''t think there are any real races here.  Either we see
> p2m_paging_in (and someone will eventually wake us when it pages in)
> or we don''t (and we go back and do the lookup safely).
>
>> > As a follow-up I ought to make the page-sharing fixup code
that''s just
>> > below this use ''goto again'' as well; we
shouldn''t really leave this
>> > function until we get a lookup where all these cases are dealt
with.
>
> On closer inspection the page-sharing version is OK because it
doesn''t
> release the lock, and the unshare function won''t leave us with a
> paged-out page.
>
>> I think we won''t be able to have a magic
>> get_gfn_solve_everything_silently() call until we find a solution to
the
>> fact that get_gfn happens in locked contexts.
>
> Yes, I agree, and I''m coming round to the opinion that we might be
too
> late to fix that in 4.2. :|
>
>> I''ve though about a wacky
>> idea that
>> 1. schedules a sleep-n-retry softirq
>> 2. fails the call
>> 3. avoids crashing the domain on the unwind path
>> 4. when executing softirqs pauses the vcpu as per 1.
>> 5. When woken up retries the last failed hypercall/vmexit/whatever
>>
>> I shouldn''t have said that out loud...
>
> The first time I approached the idea of paging guest RAM (before Patrick
> started on the current system) I prototyped something not a million
> miles from this (basically, a sort of setjmp() on hypervisor entry) but
> the question of making all hypercalls and VMEXITs idempotent scared me
> off. :)
>
We''ve used long jump-like stuff. It''s fragile, but it carried
us through
the day. It''s a non-starter here, I hope.

Looking forward to the new patch.
Andres
>> >> >> void p2m_mem_paging_populate(struct domain *d,
unsigned long gfn)
>> >> >> {
>> >> >>     struct vcpu *v = current;
>> >> >> -    mem_event_request_t req;
>> >> >> +    mem_event_request_t req = {0};
>> >> >>     p2m_type_t p2mt;
>> >> >>     p2m_access_t a;
>> >> >>     mfn_t mfn;
>> >> >>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> >> >> +    int send_request = 0;
>> >> >>
>> >> >>     /* We''re paging. There should be a ring
*/
>> >> >>     int rc = mem_event_claim_slot(d,
&d->mem_event->paging);
>> >>
>> >> Given all the optimizations around the send_request flag, it
might be
>> >> worth moving the claim_slot call to an if(send_request) block
at the
>> >> bottom, and get rid of cancel_slot altogether. Claim_slot
could
>> >> potentially go (or cause someone else to go) to a wait queue,
and
>> it''s
>> >> best to not be that rude.
>> >
>> > Sure.  It might be tricky to make sure we unwind properly in the
>> failure
>> > case, but I''ll have a go at a follow-up patch that looks
at that.
>> >
>> > (This is not really a regression in this patch AFAICS - in the
>> existing
>> > code
>> > the get_gfn() callers end up calling this function anyway.)
>>
>> *After* they''ve put gfn. Not a regression, just cleaner code
imo.
>
> This is also called after putting the gfn - but in any case I''ve
> reshuffled the mem_event_claim_slot in the new version.
>
> Cheers,
>
> Tim.
>

Tim Deegan

2012-Feb-23 16:13 UTC

head link

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

At 07:55 -0800 on 23 Feb (1329983736), Andres Lagar-Cavilla
wrote:> > I have hit another stumbling block though - get_two_gfns()
can''t be
> > safely ask get_gfn to populate or unshare if that might involve a
waitq,
> > since by definition one of its two calls is made with a lock held.  I
> > can''t see a nice way of having it retry (not one
that''s guaranteed to
> > make progress) without pulling the fixup-and-retry up into that
> > function.  I might do that in the next revision.
> 
> One way to do this is use query_unlocked on the second. If we sense
> danger, drop the lock on the first, then do the right thing for the
> second, then drop the lock for the second, then retry at the top.
Sadly, this isn''t guaranteed to make progress in the face of an
aggressive pager or sharer.  It might be OK with yet another waitq
invocation, though...
> hvm_task_switch also gets two gfns, although in a manner that prevents
> easily using get_two_gfns (hvm_map_entry on gdt offsets). It''s
unlikely
> this will ever be a problem (it''s somewhat reasonable to expect
both gdt
> entries to fall in the same gfn) but it''s something worth keeping
in mind.
Thanks, I''ll add it to the list.

Tim.

Xen devel - Feb 2012 - [RFC] [PATCH] x86/mm: use wait queues for mem_paging

[RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging

Re: [RFC] [PATCH] x86/mm: use wait queues for mem_paging