Cui, Dexuan
2008-Oct-07 10:08 UTC
[Xen-devel] A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
For an SMP Linux HVM guest with PV drivers inserted, when we do save/restore (or
LiveMigration) for the guest, it might panic after it''s restored.
The panic point is inside ap_suspend():
....
while (info->do_spin) {
cpu_relax();
read_lock(&suspend_lock);
HYPERVISOR_yield(); ----> guest might panic on the invocation of
this function.
read_unlock(&suspend_lock);
}
...
The root cause is: ap might be invoking the hypercall while bsp is asking the
hypervisor to re-initialize the hypercall page when the guest has been just
restored!
What''s the purpose of re-initializing the hypercall page here? To
improve the compatibility in the case the src/target hosts have different
hypercall stub codes?
PS, I''m using c/s 18353 to debug the issue. (the latest
xen-unstable.hg''s S/R and L/M are broken by 18383.
Thanks,
-- Dexuan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Keir Fraser
2008-Oct-07 10:16 UTC
[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
Hi Dexuan, Are you really sure that this is the problem? The suspend_lock was introduced specifically to solve this problem. Note that the BSP takes this lock before messing with the hypercall page. -- Keir On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui@intel.com> wrote:> For an SMP Linux HVM guest with PV drivers inserted, when we do save/restore > (or LiveMigration) for the guest, it might panic after it''s restored. > The panic point is inside ap_suspend(): > .... > while (info->do_spin) { > cpu_relax(); > read_lock(&suspend_lock); > HYPERVISOR_yield(); ----> guest might panic on the invocation of > this function. > read_unlock(&suspend_lock); > } > ... > > The root cause is: ap might be invoking the hypercall while bsp is asking the > hypervisor to re-initialize the hypercall page when the guest has been just > restored! > > What''s the purpose of re-initializing the hypercall page here? To improve the > compatibility in the case the src/target hosts have different hypercall stub > codes? > > PS, I''m using c/s 18353 to debug the issue. (the latest xen-unstable.hg''s S/R > and L/M are broken by 18383. > > Thanks, > -- Dexuan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2008-Oct-07 10:50 UTC
[Xen-devel] RE: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
Sorry, I neglected the lock... However, guest does panic at that point. Let me dig more. Thanks, -- Dexuan -----Original Message----- From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] Sent: 2008年10月7日 18:16 To: Cui, Dexuan; 'xen-devel@lists.xensource.com' Subject: Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore Hi Dexuan, Are you really sure that this is the problem? The suspend_lock was introduced specifically to solve this problem. Note that the BSP takes this lock before messing with the hypercall page. -- Keir On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui@intel.com> wrote:> For an SMP Linux HVM guest with PV drivers inserted, when we do save/restore > (or LiveMigration) for the guest, it might panic after it's restored. > The panic point is inside ap_suspend(): > .... > while (info->do_spin) { > cpu_relax(); > read_lock(&suspend_lock); > HYPERVISOR_yield(); ----> guest might panic on the invocation of > this function. > read_unlock(&suspend_lock); > } > ... > > The root cause is: ap might be invoking the hypercall while bsp is asking the > hypervisor to re-initialize the hypercall page when the guest has been just > restored! > > What's the purpose of re-initializing the hypercall page here? To improve the > compatibility in the case the src/target hosts have different hypercall stub > codes? > > PS, I'm using c/s 18353 to debug the issue. (the latest xen-unstable.hg's S/R > and L/M are broken by 18383. > > Thanks, > -- Dexuan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Mukesh
2008-Oct-21 03:38 UTC
[Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
Keir Fraser <keir.fraser <at> eu.citrix.com> writes:> > Hi Dexuan, > > Are you really sure that this is the problem? The suspend_lock was > introduced specifically to solve this problem. Note that the BSP takes this > lock before messing with the hypercall page. > > -- KeirI''m also looking at this now (I''m on 3.1.4 BTW). I see both hang and panic. it appears I see the hang because the "master" vcpu is trying to catch other vcpus right at the cpu_relax so it can grab the lock in write mode. With many VCPUs it''s just not happening..... Not sure i like the design of this very much... i''m gonna try to modify it a bit .... thanks mukesh> On 7/10/08 11:08, "Cui, Dexuan" <dexuan.cui <at> intel.com> wrote: > > > For an SMP Linux HVM guest with PV drivers inserted, when we do save/restore > > (or LiveMigration) for the guest, it might panic after it''s restored. > > The panic point is inside ap_suspend(): > > .... > > while (info->do_spin) { > > cpu_relax(); > > read_lock(&suspend_lock); > > HYPERVISOR_yield(); ----> guest might panic on the invocation of > > this function. > > read_unlock(&suspend_lock); > > } > > ... > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Oct-21 07:05 UTC
Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com> wrote:>> Hi Dexuan, >> >> Are you really sure that this is the problem? The suspend_lock was >> introduced specifically to solve this problem. Note that the BSP takes this >> lock before messing with the hypercall page. >> >> -- Keir > > I''m also looking at this now (I''m on 3.1.4 BTW). I see both hang and panic. it > appears I see the hang because the "master" vcpu is trying to catch other > vcpus > right at the cpu_relax so it can grab the lock in write mode. With many VCPUs > it''s just not happening..... Not sure i like the design of this very much... > i''m > gonna try to modify it a bit ....Hmmm, yes, agreed. We need a completely different approach, or a lock which doesn''t allow in more readers when there is a waiting writer. I''ll look into it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Mukesh Rathor
2008-Oct-22 02:16 UTC
Re: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
Keir Fraser wrote:
> On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com>
wrote:
>
>>> Hi Dexuan,
>>>
>>> Are you really sure that this is the problem? The suspend_lock was
>>> introduced specifically to solve this problem. Note that the BSP
takes this
>>> lock before messing with the hypercall page.
>>>
>>> -- Keir
>> I''m also looking at this now (I''m on 3.1.4 BTW). I
see both hang and panic. it
>> appears I see the hang because the "master" vcpu is trying
to catch other
>> vcpus
>> right at the cpu_relax so it can grab the lock in write mode. With
many VCPUs
>> it''s just not happening..... Not sure i like the design of
this very much...
>> i''m
>> gonna try to modify it a bit ....
>
> Hmmm, yes, agreed. We need a completely different approach, or a lock
which
> doesn''t allow in more readers when there is a waiting writer.
I''ll look into
> it.
>
> -- Keir
>
For now I fixed by removing yield and the lock at the expense of
burning a little cpu. Please let me know if you see any problems
with it...
ap_suspend()
..
while (info->do_spin) {
cpu_relax();
}
..
bp_suspend():
..
if (!suspend_cancelled) {
platform_pci_resume();
gnttab_resume();
irq_resume();
}
...
thanks,
Mukesh
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Cui, Dexuan
2008-Oct-22 02:26 UTC
RE: [Xen-devel] Re: A race condition introduced by changeset 15175: Re-init hypercall stubs page after HVM save/restore
Hi Mukesh,
Keir has applied the same idea: please see changeset 18669: 459f7ca6cf2a. :-)
Thanks,
-- Dexuan
-----Original Message-----
From: Mukesh Rathor [mailto:mukesh.rathor@oracle.com]
Sent: 2008年10月22日 10:16
To: Keir Fraser
Cc: xen-devel@lists.xensource.com; Cui, Dexuan
Subject: Re: [Xen-devel] Re: A race condition introduced by changeset 15175:
Re-init hypercall stubs page after HVM save/restore
Keir Fraser wrote:
> On 21/10/08 04:38, "Mukesh" <mukesh.rathor@oracle.com>
wrote:
>
>>> Hi Dexuan,
>>>
>>> Are you really sure that this is the problem? The suspend_lock was
>>> introduced specifically to solve this problem. Note that the BSP
takes this
>>> lock before messing with the hypercall page.
>>>
>>> -- Keir
>> I'm also looking at this now (I'm on 3.1.4 BTW). I see both
hang and panic. it
>> appears I see the hang because the "master" vcpu is trying
to catch other
>> vcpus
>> right at the cpu_relax so it can grab the lock in write mode. With
many VCPUs
>> it's just not happening..... Not sure i like the design of this
very much...
>> i'm
>> gonna try to modify it a bit ....
>
> Hmmm, yes, agreed. We need a completely different approach, or a lock
which
> doesn't allow in more readers when there is a waiting writer. I'll
look into
> it.
>
> -- Keir
>
For now I fixed by removing yield and the lock at the expense of
burning a little cpu. Please let me know if you see any problems
with it...
ap_suspend()
..
while (info->do_spin) {
cpu_relax();
}
..
bp_suspend():
..
if (!suspend_cancelled) {
platform_pci_resume();
gnttab_resume();
irq_resume();
}
...
thanks,
Mukesh
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel