thr3ads.net - Xen devel - [Xen-devel] Live migration fails under heavy network use [Feb 2007]

If this information is useful, please help other people find it:
Share via:

John Levon

2007-Feb-20 21:50 UTC

[Xen-devel] Live migration fails under heavy network use

I''ve observed this with both a Solaris and a FC6 domU (up to date as of
last week or so) in 64-bit. If you place the domU under reasonable
networking stress (such as a ''find /nfs/path >/dev/null''),
live
migration usually, but not always fails:

bash-3.00# while xm migrate --live fedora64 localhost ; do echo done ; done
(XEN) memory.c:188:d2 Dom2 freeing in-use page 9f40f (pseudophys 1d007): count=2
type=e8000000
(XEN) memory.c:188:d2 Dom2 freeing in-use page 9f409 (pseudophys 1d00b): count=2
type=e8000000
(XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0 Error pfn
9f738: rd=ffff830000fe0100, od=ffff830000000002, caf=00000000,
taf=0000000000000002
(XEN) mm.c:590:d0 Error getting mfn 9f738 (pfn 12026) from L1 entry
000000009f738705 for dom2
Error: /usr/lib/xen/bin/xc_save 27 2 0 0 1 failed

Some experimentation has revealed that this only happens if a vif is configured
and used, which seems like it''s related to giving away pages (as rd !=
od would
indicate too...). Anybody else seeing this? I''ve only tested on a
Solaris dom0
so far, though I can''t think of anything that would affect this.

thanks
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Feb-20 22:38 UTC

head link

RE: [Xen-devel] Live migration fails under heavy network use

> I''ve observed this with both a Solaris and a FC6 domU (up to date
as
of> bash-3.00# while xm migrate --live fedora64 localhost ; do echo done ;
done> (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f40f (pseudophys
1d007):> count=2 type=e8000000
> (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f409 (pseudophys
1d00b):> count=2 type=e8000000
> (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0
Error> pfn 9f738: rd=ffff830000fe0100, od=ffff830000000002, caf=00000000,
> taf=0000000000000002
> (XEN) mm.c:590:d0 Error getting mfn 9f738 (pfn 12026) from L1 entry
> 000000009f738705 for dom2
> Error: /usr/lib/xen/bin/xc_save 27 2 0 0 1 failed
> 
> Some experimentation has revealed that this only happens if a vif is
> configured
> and used, which seems like it''s related to giving away pages (as
rd !od
> would
> indicate too...). Anybody else seeing this? I''ve only tested on a
Solaris> dom0
> so far, though I can''t think of anything that would affect this.
These guests are using rx-flip rather than rx-copy, right?
This has certainly worked reliably in the past (e.g. 3.0.3), but is now
getting little testing as current guests use rx-copy by default.

The freeing in-use page messages may be unrelated to the actual problem
-- AFAIK that''s a relatively new printk that could occur benignly
during
a live migrate of an rx-flip guest.

Even get_page can fail benignly under certain circumstances during a
live migrate. It''s worth finding out where the actual error in
xc_linux_save is.

Ian
 



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-20 22:55 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On 20/2/07 21:50, "John Levon" <levon@movementarian.org> wrote:
> Some experimentation has revealed that this only happens if a vif is
> configured
> and used, which seems like it''s related to giving away pages (as
rd != od
> would
> indicate too...). Anybody else seeing this? I''ve only tested on a
Solaris dom0
> so far, though I can''t think of anything that would affect this.
Do you know if it''s the case that both netfronts are using the
''old-style''
page-flipping receive mechanism, rather than the copying mechanism?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-20 23:00 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On 20/2/07 22:38, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> These guests are using rx-flip rather than rx-copy, right?
> This has certainly worked reliably in the past (e.g. 3.0.3), but is now
> getting little testing as current guests use rx-copy by default.
> 
> The freeing in-use page messages may be unrelated to the actual problem
> -- AFAIK that''s a relatively new printk that could occur benignly
during
> a live migrate of an rx-flip guest.
> 
> Even get_page can fail benignly under certain circumstances during a
> live migrate. It''s worth finding out where the actual error in
> xc_linux_save is.
Also, apart from the fact that save/restore *should* still work with
old-style netfront, there are other legitimate reasons for pages to
disappear from a guest''s allocation (e.g., running the balloon driver
during
save/restore). It might be worth seeing if you can provoke similar xc_save
errors under balloon load.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-20 23:04 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Tue, Feb 20, 2007 at 10:38:47PM -0000, Ian Pratt wrote:
> These guests are using rx-flip rather than rx-copy, right?
Solaris is, and presumably FC6 too.
> This has certainly worked reliably in the past (e.g. 3.0.3), but is now
I forget to mention that this is 3.0.4-based.
> The freeing in-use page messages may be unrelated to the actual problem
> -- AFAIK that''s a relatively new printk that could occur benignly
during
> a live migrate of an rx-flip guest.
We''re failing here:

[2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Saving memory pages:
iter 2   0%ERROR Internal error: Error when writing to state file (5) (errno 14)
[2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Save exit rc=1
[2007-02-20 13:39:50 xend 100401] ERROR (XendCheckpoint:111) Save failed on
domain fedora64 (2).

1049                     /* We have a normal page: just write it directly. */
1050                     if (ratewrite(io_fd, spage, PAGE_SIZE) != PAGE_SIZE) {
1051                         ERROR("Error when writing to state file
(5)"

IOW, we''re faulting (EFAULT) on the domain''s MFN due to the
above error.

regards
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

David Edmondson

2007-Feb-20 23:12 UTC

head link

[Xen-devel] Re: Live migration fails under heavy network use

* levon@movementarian.org [2007-02-20 23:04:47]> On Tue, Feb 20, 2007 at 10:38:47PM -0000, Ian Pratt wrote:
>> These guests are using rx-flip rather than rx-copy, right?
>
> Solaris is, and presumably FC6 too.
Solaris dom0 supports only rx-flip at this point, so all guests
running alongside a Solaris dom0 are forced to use it.

dme.
-- 
David Edmondson, http://www.dme.org


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Feb-20 23:48 UTC

head link

RE: [Xen-devel] Live migration fails under heavy network use

> > The freeing in-use page messages may be unrelated to the actual
problem> > -- AFAIK that''s a relatively new printk that could occur
benignly
during> > a live migrate of an rx-flip guest.
> 
> We''re failing here:
> 
> [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Saving
memory> pages: iter 2   0%ERROR Internal error: Error when writing to state
file> (5) (errno 14)
> [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Save exit
rc=1> [2007-02-20 13:39:50 xend 100401] ERROR (XendCheckpoint:111) Save
failed on> domain fedora64 (2).
> 
> 1049                     /* We have a normal page: just write it
directly.> */
> 1050                     if (ratewrite(io_fd, spage, PAGE_SIZE) !>
PAGE_SIZE) {
> 1051                         ERROR("Error when writing to state file
(5)"> 
> IOW, we''re faulting (EFAULT) on the domain''s MFN due to
the aboveerror.

Urk. Checkout line 204 of privcmd.c 
That doesn''t look too 64b clean to me....

The top nibble is supposed to be set if its not possible to map the
frame correctly. This will be propagated through the
xc_get_pfn_type_batch call, hence skipping the frame.


Ian




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-21 00:06 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On 20/2/07 23:48, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> Urk. Checkout line 204 of privcmd.c
> That doesn''t look too 64b clean to me....
Should be good for 1TB of memory. Also the subsequent get_pfn_type_batch
hypercall takes an array of 32-bit values, so the type/error nibble has to
be at the top of 32 bits rather than 64 bits.
> The top nibble is supposed to be set if its not possible to map the
> frame correctly. This will be propagated through the
> xc_get_pfn_type_batch call, hence skipping the frame.
That''s the theory. :-)

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-21 00:41 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Wed, Feb 21, 2007 at 12:06:52AM +0000, Keir Fraser wrote:
> > Urk. Checkout line 204 of privcmd.c
> > That doesn''t look too 64b clean to me....
Hmm, I never looked at the Linux privcmd driver before. It seems like
you''re prefaulting all the pages in the ioctl() in on the ioctl.
Currently on Solaris we''re demand-faulting each page in the mmap() ...

It looks like we''re not checking that we ca actually map the page at
the
time of the ioctl(), thus it''s not ending up as marked with that bit.

Looks like our bug...

thanks,
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-21 07:32 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On 21/2/07 00:41, "John Levon" <levon@movementarian.org> wrote:
> On Wed, Feb 21, 2007 at 12:06:52AM +0000, Keir Fraser wrote:
> 
>>> Urk. Checkout line 204 of privcmd.c
>>> That doesn''t look too 64b clean to me....
> 
> Hmm, I never looked at the Linux privcmd driver before. It seems like
> you''re prefaulting all the pages in the ioctl() in on the ioctl.
> Currently on Solaris we''re demand-faulting each page in the mmap()
...
> 
> It looks like we''re not checking that we ca actually map the page
at the
> time of the ioctl(), thus it''s not ending up as marked with that
bit.
> 
> Looks like our bug...
If you check then you may as well map at the same time. I think it would
actually be hard to defer the mapping in a race-free manner.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-22 19:55 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Wed, Feb 21, 2007 at 07:32:10AM +0000, Keir Fraser wrote:
> >>> Urk. Checkout line 204 of privcmd.c
> >>> That doesn''t look too 64b clean to me....
> > 
> > Hmm, I never looked at the Linux privcmd driver before. It seems like
> > you''re prefaulting all the pages in the ioctl() in on the
ioctl.
> > Currently on Solaris we''re demand-faulting each page in the
mmap() ...
> > 
> > It looks like we''re not checking that we ca actually map the
page at the
> > time of the ioctl(), thus it''s not ending up as marked with
that bit.
> > 
> > Looks like our bug...
> 
> If you check then you may as well map at the same time. I think it would
> actually be hard to defer the mapping in a race-free manner.
I''ve modified the segment driver to prefault the MFNs and things seem a
lot better for both Solaris and Linux domUs:

(XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0 Error pfn
5512: rd=ffff830000f92100, od=0000000000000000, caf=00000000,
taf=0000000000000002
(XEN) mm.c:590:d0 Error getting mfn 5512 (pfn 47fa) from L1 entry
0000000005512705 for dom52
(XEN) mm.c:566:d0 Non-privileged (53) attempt to map I/O space 00000000 done
done

Not quite sure why the new domain is trying to map 00000000 though.

I also see a fair amount of:

Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 type=e8000000

Ian mentioned that these could be harmless. Is this because dom0 has
mapped the page for suspending? And if so, why is the count two not one?

Should xc_core.c really be using MMAPBATCH? I suppose it''s convenient
but it does mean that the frame lists have to be locked in memory.

regards,
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-22 20:55 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On 22/2/07 19:55, "John Levon" <levon@movementarian.org> wrote:
> I also see a fair amount of:
> 
> Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 type=e8000000
> 
> Ian mentioned that these could be harmless. Is this because dom0 has
> mapped the page for suspending? And if so, why is the count two not one?
Yes, it''s because of the extra dom0 mapping. The count is 2 because one
reference is held at this point by the freeing function in Xen.

The comment above that warning in the Xen source is a bit strong. I
don''t
think we can change the semantics of the existing command but we may add a
new one to let the caller know whether the page(s) could immediately be
freed.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-22 21:11 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Thu, Feb 22, 2007 at 08:55:39PM +0000, Keir Fraser wrote:
> Yes, it''s because of the extra dom0 mapping. The count is 2
because one
> reference is held at this point by the freeing function in Xen.
> 
> The comment above that warning in the Xen source is a bit strong. I
don''t
> think we can change the semantics of the existing command but we may add a
> new one to let the caller know whether the page(s) could immediately be
> freed.
Hmm. This debug message has caught one real bug in our balloon code
recently. I presume it''s not trivial to distinguish between that case
(where the domain really did have its own mapping) and the
live-migration one though.

regards
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Feb-22 22:34 UTC

head link

RE: [Xen-devel] Live migration fails under heavy network use

> I''ve modified the segment driver to prefault the MFNs and things
seem
a> lot better for both Solaris and Linux domUs:
> 
> (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0
Error> pfn 5512: rd=ffff830000f92100, od=0000000000000000, caf=00000000,
> taf=0000000000000002
> (XEN) mm.c:590:d0 Error getting mfn 5512 (pfn 47fa) from L1 entry
> 0000000005512705 for dom52
> (XEN) mm.c:566:d0 Non-privileged (53) attempt to map I/O space
00000000> done
> done
> 
> Not quite sure why the new domain is trying to map 00000000 though.
The messages from the save side are expected. Is the message from the
restored domain triggered by the restore code i.e. before the domain is
un-paused?

I expect if you change the ''pfn=0'' in
canonicalize_pagetable:539 to
''deadb000'' you''ll see that propagated through to the
restore message. In
which case, its ugly, but benign. 
> I also see a fair amount of:
> 
> Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2type=e8000000

That''s fine. Debug builds are a bit chatty for live migration...

Ian

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-23 00:22 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Thu, Feb 22, 2007 at 10:34:30PM -0000, Ian Pratt wrote:
> > Not quite sure why the new domain is trying to map 00000000 though.
> 
> The messages from the save side are expected. Is the message from the
> restored domain triggered by the restore code i.e. before the domain is
> un-paused?
I suspect so but haven''t proved that.
> I expect if you change the ''pfn=0'' in
canonicalize_pagetable:539 to
> ''deadb000'' you''ll see that propagated through to
the restore message. In
> which case, its ugly, but benign. 
Wouldn''t that pfn of 0 be an MFN other than 0 though? I do not see any
change
when setting pfn as above.  Any further ideas? I can try adding some back
traces. I suppose you''re not seeing it with a Linux dom0?
> > I also see a fair amount of:
> > 
> > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2
> type=e8000000
> 
> That''s fine. Debug builds are a bit chatty for live migration...
Both of these:

(XEN) mm.c:590:d0 Error getting mfn a005e (pfn 4c35) from L1 entry
00000000a005e705 for dom2
(XEN) mm.c:566:d0 Non-privileged (3) attempt to map I/O space 00000000

are also present in a non-debug build. Would you take a patch to make both of
them be XENLOG_INFO? It''s not good that we get console noise for normal
operation
(presuming the I/O space one /is/ normal operation!).

regards
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Feb-23 01:00 UTC

head link

RE: [Xen-devel] Live migration fails under heavy network use

> On Thu, Feb 22, 2007 at 10:34:30PM -0000, Ian Pratt wrote:
> 
> > > Not quite sure why the new domain is trying to map 00000000
though.> >
> > The messages from the save side are expected. Is the message from
the> > restored domain triggered by the restore code i.e. before the domain
is> > un-paused?
> 
> I suspect so but haven''t proved that.
That would be a good test.
> > I expect if you change the ''pfn=0'' in
canonicalize_pagetable:539 to
> > ''deadb000'' you''ll see that propagated
through to the restore
message. In> > which case, its ugly, but benign.
> 
> Wouldn''t that pfn of 0 be an MFN other than 0 though? 
Fair point. Thinking about it, that should be patched up by the next
iteration anyhow.
> I do not see any change
> when setting pfn as above.  Any further ideas? I can try adding some
back> traces. I suppose you''re not seeing it with a Linux dom0?
I don''t think so, but I couldn''t swear to it. It did used to
come out
during Linux domain boot at one point, can''t remember whether it still
does.

I presume the domain itself seems to suffer no ill effects?
> > > I also see a fair amount of:
> > >
> > > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2
> > type=e8000000
> >
> > That''s fine. Debug builds are a bit chatty for live
migration...
> 
> Both of these:
> 
> (XEN) mm.c:590:d0 Error getting mfn a005e (pfn 4c35) from L1 entry
> 00000000a005e705 for dom2
> (XEN) mm.c:566:d0 Non-privileged (3) attempt to map I/O space 00000000
> 
> are also present in a non-debug build. Would you take a patch to make
both> of
> them be XENLOG_INFO? It''s not good that we get console noise for
normal> operation
> (presuming the I/O space one /is/ normal operation!).
I think we need to understand this one first.

Thanks,
Ian





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

John Levon

2007-Feb-23 01:09 UTC

head link

Re: [Xen-devel] Live migration fails under heavy network use

On Fri, Feb 23, 2007 at 01:00:34AM -0000, Ian Pratt wrote:
> I don''t think so, but I couldn''t swear to it. It did used
to come out
> during Linux domain boot at one point, can''t remember whether it
still
> does.
> 
> I presume the domain itself seems to suffer no ill effects?
Right, it seems fine. I''ll add some more debug to try and figure out
where it''s coming from soon.

thanks
john

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Feb 2007 - Live migration fails under heavy network use

[Xen-devel] Live migration fails under heavy network use

RE: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

[Xen-devel] Re: Live migration fails under heavy network use

RE: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

RE: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use

RE: [Xen-devel] Live migration fails under heavy network use

Re: [Xen-devel] Live migration fails under heavy network use