thr3ads.net - Xen devel - Domain relinquish resources racing with p2m access [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Andres Lagar-Cavilla

2012-Feb-01 20:49 UTC

Domain relinquish resources racing with p2m access

So we''ve run into this interesting (race?) condition while doing
stress-testing. We pummel the domain with paging, sharing and mmap
operations from dom0, and concurrently we launch a domain destruction.
Often we get in the logs something along these lines

(XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1
entry 8000000859b1a625 for l1e_owner=0, pg_owner=1

We''re using the synchronized p2m patches just posted, so my analysis is
as
follows:

- the domain destroy domctl kicks in. It calls relinquish resources. This
disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
entries

- In parallel, a do_mmu_update is making progress, it has no issues
performing a p2m lookup because the p2m has not been torn down yet; we
haven''t gotten to the RCU callback. Eventually, the mapping fails in
page_get_owner in get_pafe_from_l1e.

The map is failed, as expected, but what makes me uneasy is the fact that
there is a still active p2m lurking around, with seemingly valid
translations to valid mfn''s, while all the domain pages are gone.

Is this a race condition? Can this lead to trouble?

Thanks!
Andres

Tim Deegan

2012-Feb-02 13:34 UTC

head link

Re: Domain relinquish resources racing with p2m access

At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla
wrote:> So we''ve run into this interesting (race?) condition while doing
> stress-testing. We pummel the domain with paging, sharing and mmap
> operations from dom0, and concurrently we launch a domain destruction.
> Often we get in the logs something along these lines
> 
> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from L1
> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
> 
> We''re using the synchronized p2m patches just posted, so my
analysis is as
> follows:
> 
> - the domain destroy domctl kicks in. It calls relinquish resources. This
> disowns and puts most domain pages, resulting in invalid (0xff...ff) m2p
> entries
> 
> - In parallel, a do_mmu_update is making progress, it has no issues
> performing a p2m lookup because the p2m has not been torn down yet; we
> haven''t gotten to the RCU callback. Eventually, the mapping fails
in
> page_get_owner in get_pafe_from_l1e.
> 
> The map is failed, as expected, but what makes me uneasy is the fact that
> there is a still active p2m lurking around, with seemingly valid
> translations to valid mfn''s, while all the domain pages are gone.
Yes.  That''s OK as long as we know that any user of that page will
fail, but I''m not sure that we do.   

At one point we talked about get_gfn() taking a refcount on the
underlying MFN, which would fix this more cleanly.  ISTR the problem was
how to make sure the refcount was moved when the gfn->mfn mapping
changed. 

Can you stick a WARN() in mm.c to get the actual path that leads to the
failure?

Tim.

Andres Lagar-Cavilla

2012-Feb-10 18:05 UTC

head link

Re: Domain relinquish resources racing with p2m access

> At 12:49 -0800 on 01 Feb (1328100564), Andres Lagar-Cavilla wrote:
>> So we''ve run into this interesting (race?) condition while
doing
>> stress-testing. We pummel the domain with paging, sharing and mmap
>> operations from dom0, and concurrently we launch a domain destruction.
>> Often we get in the logs something along these lines
>>
>> (XEN) mm.c:958:d0 Error getting mfn 859b1a (pfn ffffffffffffffff) from
>> L1
>> entry 8000000859b1a625 for l1e_owner=0, pg_owner=1
>>
>> We''re using the synchronized p2m patches just posted, so my
analysis is
>> as
>> follows:
>>
>> - the domain destroy domctl kicks in. It calls relinquish resources.
>> This
>> disowns and puts most domain pages, resulting in invalid (0xff...ff)
m2p
>> entries
>>
>> - In parallel, a do_mmu_update is making progress, it has no issues
>> performing a p2m lookup because the p2m has not been torn down yet; we
>> haven''t gotten to the RCU callback. Eventually, the mapping
fails in
>> page_get_owner in get_pafe_from_l1e.
>>
>> The map is failed, as expected, but what makes me uneasy is the fact
>> that
>> there is a still active p2m lurking around, with seemingly valid
>> translations to valid mfn''s, while all the domain pages are
gone.
>
> Yes.  That''s OK as long as we know that any user of that page will
> fail, but I''m not sure that we do.
>
> At one point we talked about get_gfn() taking a refcount on the
> underlying MFN, which would fix this more cleanly.  ISTR the problem was
> how to make sure the refcount was moved when the gfn->mfn mapping
> changed.
Oh, I ditched that because it''s too hairy and error prone. There are
plenty of nested get_gfn''s with the n>1 call changing the mfn. So
unless
we make a point of remembering the mfn at the point of get_gfn, it''s
just
impossible to make this work. And then "remembering the mfn" means a
serious uglification of existing code.
>
> Can you stick a WARN() in mm.c to get the actual path that leads to the
> failure?
As a debug aid or as actual code to make it into the tree? This typically
happens in batches of a few dozens, so a WARN is going to massively spam
the console with stack traces. Guess how I found out ...

The moral is that the code is reasonably defensive, so this gets caught,
albeit in a rather verbose way. But this might eventually bite someone who
does a get_gfn and doesn''t either check that the domain is dying or
ensure
that a get_page succeeds.

Andres
>
> Tim.
>

Xen devel - Feb 2012 - Domain relinquish resources racing with p2m access

Domain relinquish resources racing with p2m access

Re: Domain relinquish resources racing with p2m access

Re: Domain relinquish resources racing with p2m access