thr3ads.net - Xen devel - [Xen-devel] A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Cui, Dexuan

2008-Oct-07 10:08 UTC

[Xen-devel] A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

For an SMP Linux HVM guest with PV drivers inserted, when we do save/restore (or
LiveMigration) for the guest, it might panic after it''s restored.
The panic point is inside ap_suspend():
 ....
    while (info->do_spin) {
        cpu_relax();
        read_lock(&suspend_lock);
        HYPERVISOR_yield();      ----> guest might panic on the invocation of
this function.
        read_unlock(&suspend_lock);
    }
...

The root cause is: ap might be invoking the hypercall while bsp is asking the
hypervisor to re-initialize the hypercall page when the guest has been just
restored!

What''s the purpose of re-initializing the hypercall page here? To
improve the compatibility in the case the src/target hosts have different
hypercall stub codes?

PS, I''m using c/s 18353 to debug the issue. (the latest
xen-unstable.hg''s S/R and L/M are broken by 18383.

Thanks,
-- Dexuan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Oct-07 10:16 UTC

head link

[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Hi Dexuan,

Are you really sure that this is the problem? The suspend_lock was
introduced specifically to solve this problem. Note that the BSP takes this
lock before messing with the hypercall page.

 -- Keir

On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui@intel.com> wrote:
> For an SMP Linux HVM guest with PV drivers inserted, when we do
save/restore
> (or LiveMigration) for the guest, it might panic after it''s
restored.
> The panic point is inside ap_suspend():
>  ....
>     while (info->do_spin) {
>         cpu_relax();
>         read_lock(&suspend_lock);
>         HYPERVISOR_yield();      ----> guest might panic on the
invocation of
> this function.
>         read_unlock(&suspend_lock);
>     }
> ...
> 
> The root cause is: ap might be invoking the hypercall while bsp is asking
the
> hypervisor to re-initialize the hypercall page when the guest has been just
> restored!
> 
> What''s the purpose of re-initializing the hypercall page here? To
improve the
> compatibility in the case the src/target hosts have different hypercall
stub
> codes?
> 
> PS, I''m using c/s 18353 to debug the issue. (the latest
xen-unstable.hg''s S/R
> and L/M are broken by 18383.
> 
> Thanks,
> -- Dexuan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Cui, Dexuan

2008-Oct-07 10:50 UTC

head link

[Xen-devel] RE: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Sorry, I neglected the lock...
However, guest does panic at that point. Let me dig more.

Thanks,
-- Dexuan


-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
Sent: 2008年10月7日 18:16
To: Cui, Dexuan; 'xen-devel@lists.xensource.com'
Subject: Re: A race condition introduced by changeset 15175: Re-init hypercall
stubs page after HVM save/restore

Hi Dexuan,

Are you really sure that this is the problem? The suspend_lock was
introduced specifically to solve this problem. Note that the BSP takes this
lock before messing with the hypercall page.

 -- Keir

On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui@intel.com> wrote:
> For an SMP Linux HVM guest with PV drivers inserted, when we do
save/restore
> (or LiveMigration) for the guest, it might panic after it's restored.
> The panic point is inside ap_suspend():
>  ....
>     while (info->do_spin) {
>         cpu_relax();
>         read_lock(&suspend_lock);
>         HYPERVISOR_yield();      ----> guest might panic on the
invocation of
> this function.
>         read_unlock(&suspend_lock);
>     }
> ...
>
> The root cause is: ap might be invoking the hypercall while bsp is asking
the
> hypervisor to re-initialize the hypercall page when the guest has been just
> restored!
>
> What's the purpose of re-initializing the hypercall page here? To
improve the
> compatibility in the case the src/target hosts have different hypercall
stub
> codes?
>
> PS, I'm using c/s 18353 to debug the issue. (the latest
xen-unstable.hg's S/R
> and L/M are broken by 18383.
>
> Thanks,
> -- Dexuan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mukesh

2008-Oct-21 03:38 UTC

head link

[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Keir Fraser <keir.fraser <at> eu.citrix.com> writes:
> 
> Hi Dexuan,
> 
> Are you really sure that this is the problem? The suspend_lock was
> introduced specifically to solve this problem. Note that the BSP takes this
> lock before messing with the hypercall page.
> 
>  -- Keir
I''m also looking at this now (I''m on 3.1.4 BTW). I see both
hang and panic. it
appears I see the hang because the "master" vcpu is trying to catch
other vcpus
right at the cpu_relax so it can grab the lock in write mode. With many VCPUs
it''s just not happening..... Not sure i like the design of this very
much... i''m
gonna try to modify it a bit .... 

thanks
mukesh



> On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui <at>
intel.com> wrote:
> 
> > For an SMP Linux HVM guest with PV drivers inserted, when we do
save/restore
> > (or LiveMigration) for the guest, it might panic after it''s
restored.
> > The panic point is inside ap_suspend():
> >  ....
> >     while (info->do_spin) {
> >         cpu_relax();
> >         read_lock(&suspend_lock);
> >         HYPERVISOR_yield();      ----> guest might panic on the
invocation of
> > this function.
> >         read_unlock(&suspend_lock);
> >     }
> > ...
> > 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Oct-21 07:05 UTC

head link

Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com> wrote:
>> Hi Dexuan,
>> 
>> Are you really sure that this is the problem? The suspend_lock was
>> introduced specifically to solve this problem. Note that the BSP takes
this
>> lock before messing with the hypercall page.
>> 
>>  -- Keir
> 
> I''m also looking at this now (I''m on 3.1.4 BTW). I see
both hang and panic. it
> appears I see the hang because the "master" vcpu is trying to
catch other
> vcpus
> right at the cpu_relax so it can grab the lock in write mode. With many
VCPUs
> it''s just not happening..... Not sure i like the design of this
very much...
> i''m
> gonna try to modify it a bit ....
Hmmm, yes, agreed. We need a completely different approach, or a lock which
doesn''t allow in more readers when there is a waiting writer.
I''ll look into
it.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mukesh Rathor

2008-Oct-22 02:16 UTC

head link

Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Keir Fraser wrote:
 > On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com>
wrote:
 >
 >>> Hi Dexuan,
 >>>
 >>> Are you really sure that this is the problem? The suspend_lock was
 >>> introduced specifically to solve this problem. Note that the BSP
takes this
 >>> lock before messing with the hypercall page.
 >>>
 >>>  -- Keir
 >> I''m also looking at this now (I''m on 3.1.4 BTW). I
see both hang and panic. it
 >> appears I see the hang because the "master" vcpu is trying
to catch other
 >> vcpus
 >> right at the cpu_relax so it can grab the lock in write mode. With
many VCPUs
 >> it''s just not happening..... Not sure i like the design of
this very much...
 >> i''m
 >> gonna try to modify it a bit ....
 >
 > Hmmm, yes, agreed. We need a completely different approach, or a lock
which
 > doesn''t allow in more readers when there is a waiting writer.
I''ll look into
 > it.
 >
 >  -- Keir
 >

For now I fixed by removing yield and the lock at the expense of
burning a little cpu. Please let me know if you see any problems
with it...

ap_suspend()
..
         while (info->do_spin) {
                 cpu_relax();
         }
..

bp_suspend():
..
         if (!suspend_cancelled) {
	    platform_pci_resume();
	    gnttab_resume();
	    irq_resume();
	}
...

thanks,
Mukesh


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Cui, Dexuan

2008-Oct-22 02:26 UTC

head link

RE: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Hi Mukesh,
Keir has applied the same idea: please see changeset 18669: 459f7ca6cf2a. :-)

Thanks,
-- Dexuan

-----Original Message-----
From: Mukesh Rathor [mailto:mukesh.rathor@oracle.com]
Sent: 2008年10月22日 10:16
To: Keir Fraser
Cc: xen-devel@lists.xensource.com; Cui, Dexuan
Subject: Re: [Xen-devel] Re: A race condition introduced by changeset 15175:
Re-init hypercall stubs page after HVM save/restore



Keir Fraser wrote:
 > On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com>
wrote:
 >
 >>> Hi Dexuan,
 >>>
 >>> Are you really sure that this is the problem? The suspend_lock was
 >>> introduced specifically to solve this problem. Note that the BSP
takes this
 >>> lock before messing with the hypercall page.
 >>>
 >>>  -- Keir
 >> I'm also looking at this now (I'm on 3.1.4 BTW). I see both
hang and panic. it
 >> appears I see the hang because the "master" vcpu is trying
to catch other
 >> vcpus
 >> right at the cpu_relax so it can grab the lock in write mode. With
many VCPUs
 >> it's just not happening..... Not sure i like the design of this
very much...
 >> i'm
 >> gonna try to modify it a bit ....
 >
 > Hmmm, yes, agreed. We need a completely different approach, or a lock
which
 > doesn't allow in more readers when there is a waiting writer. I'll
look into
 > it.
 >
 >  -- Keir
 >

For now I fixed by removing yield and the lock at the expense of
burning a little cpu. Please let me know if you see any problems
with it...

ap_suspend()
..
         while (info->do_spin) {
                 cpu_relax();
         }
..

bp_suspend():
..
         if (!suspend_cancelled) {
            platform_pci_resume();
            gnttab_resume();
            irq_resume();
        }
...

thanks,
Mukesh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Reasonably Related Threads

Search for more apparently analagous threads

Xen devel - Oct 2008 - A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

[Xen-devel] A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

[Xen-devel] RE: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

RE: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore

Reasonably Related Threads