I''ve observed this with both a Solaris and a FC6 domU (up to date as of last week or so) in 64-bit. If you place the domU under reasonable networking stress (such as a ''find /nfs/path >/dev/null''), live migration usually, but not always fails: bash-3.00# while xm migrate --live fedora64 localhost ; do echo done ; done (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f40f (pseudophys 1d007): count=2 type=e8000000 (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f409 (pseudophys 1d00b): count=2 type=e8000000 (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0 Error pfn 9f738: rd=ffff830000fe0100, od=ffff830000000002, caf=00000000, taf=0000000000000002 (XEN) mm.c:590:d0 Error getting mfn 9f738 (pfn 12026) from L1 entry 000000009f738705 for dom2 Error: /usr/lib/xen/bin/xc_save 27 2 0 0 1 failed Some experimentation has revealed that this only happens if a vif is configured and used, which seems like it''s related to giving away pages (as rd != od would indicate too...). Anybody else seeing this? I''ve only tested on a Solaris dom0 so far, though I can''t think of anything that would affect this. thanks john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-20 22:38 UTC
RE: [Xen-devel] Live migration fails under heavy network use
> I''ve observed this with both a Solaris and a FC6 domU (up to date asof> bash-3.00# while xm migrate --live fedora64 localhost ; do echo done ;done> (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f40f (pseudophys1d007):> count=2 type=e8000000 > (XEN) memory.c:188:d2 Dom2 freeing in-use page 9f409 (pseudophys1d00b):> count=2 type=e8000000 > (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0Error> pfn 9f738: rd=ffff830000fe0100, od=ffff830000000002, caf=00000000, > taf=0000000000000002 > (XEN) mm.c:590:d0 Error getting mfn 9f738 (pfn 12026) from L1 entry > 000000009f738705 for dom2 > Error: /usr/lib/xen/bin/xc_save 27 2 0 0 1 failed > > Some experimentation has revealed that this only happens if a vif is > configured > and used, which seems like it''s related to giving away pages (as rd !od > would > indicate too...). Anybody else seeing this? I''ve only tested on aSolaris> dom0 > so far, though I can''t think of anything that would affect this.These guests are using rx-flip rather than rx-copy, right? This has certainly worked reliably in the past (e.g. 3.0.3), but is now getting little testing as current guests use rx-copy by default. The freeing in-use page messages may be unrelated to the actual problem -- AFAIK that''s a relatively new printk that could occur benignly during a live migrate of an rx-flip guest. Even get_page can fail benignly under certain circumstances during a live migrate. It''s worth finding out where the actual error in xc_linux_save is. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Feb-20 22:55 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On 20/2/07 21:50, "John Levon" <levon@movementarian.org> wrote:> Some experimentation has revealed that this only happens if a vif is > configured > and used, which seems like it''s related to giving away pages (as rd != od > would > indicate too...). Anybody else seeing this? I''ve only tested on a Solaris dom0 > so far, though I can''t think of anything that would affect this.Do you know if it''s the case that both netfronts are using the ''old-style'' page-flipping receive mechanism, rather than the copying mechanism? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Feb-20 23:00 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On 20/2/07 22:38, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote:> These guests are using rx-flip rather than rx-copy, right? > This has certainly worked reliably in the past (e.g. 3.0.3), but is now > getting little testing as current guests use rx-copy by default. > > The freeing in-use page messages may be unrelated to the actual problem > -- AFAIK that''s a relatively new printk that could occur benignly during > a live migrate of an rx-flip guest. > > Even get_page can fail benignly under certain circumstances during a > live migrate. It''s worth finding out where the actual error in > xc_linux_save is.Also, apart from the fact that save/restore *should* still work with old-style netfront, there are other legitimate reasons for pages to disappear from a guest''s allocation (e.g., running the balloon driver during save/restore). It might be worth seeing if you can provoke similar xc_save errors under balloon load. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-20 23:04 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Tue, Feb 20, 2007 at 10:38:47PM -0000, Ian Pratt wrote:> These guests are using rx-flip rather than rx-copy, right?Solaris is, and presumably FC6 too.> This has certainly worked reliably in the past (e.g. 3.0.3), but is nowI forget to mention that this is 3.0.4-based.> The freeing in-use page messages may be unrelated to the actual problem > -- AFAIK that''s a relatively new printk that could occur benignly during > a live migrate of an rx-flip guest.We''re failing here: [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Saving memory pages: iter 2 0%ERROR Internal error: Error when writing to state file (5) (errno 14) [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Save exit rc=1 [2007-02-20 13:39:50 xend 100401] ERROR (XendCheckpoint:111) Save failed on domain fedora64 (2). 1049 /* We have a normal page: just write it directly. */ 1050 if (ratewrite(io_fd, spage, PAGE_SIZE) != PAGE_SIZE) { 1051 ERROR("Error when writing to state file (5)" IOW, we''re faulting (EFAULT) on the domain''s MFN due to the above error. regards john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
David Edmondson
2007-Feb-20 23:12 UTC
[Xen-devel] Re: Live migration fails under heavy network use
* levon@movementarian.org [2007-02-20 23:04:47]> On Tue, Feb 20, 2007 at 10:38:47PM -0000, Ian Pratt wrote: >> These guests are using rx-flip rather than rx-copy, right? > > Solaris is, and presumably FC6 too.Solaris dom0 supports only rx-flip at this point, so all guests running alongside a Solaris dom0 are forced to use it. dme. -- David Edmondson, http://www.dme.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-20 23:48 UTC
RE: [Xen-devel] Live migration fails under heavy network use
> > The freeing in-use page messages may be unrelated to the actualproblem> > -- AFAIK that''s a relatively new printk that could occur benignlyduring> > a live migrate of an rx-flip guest. > > We''re failing here: > > [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Savingmemory> pages: iter 2 0%ERROR Internal error: Error when writing to statefile> (5) (errno 14) > [2007-02-20 13:39:50 xend 100401] INFO (XendCheckpoint:247) Save exitrc=1> [2007-02-20 13:39:50 xend 100401] ERROR (XendCheckpoint:111) Savefailed on> domain fedora64 (2). > > 1049 /* We have a normal page: just write itdirectly.> */ > 1050 if (ratewrite(io_fd, spage, PAGE_SIZE) !> PAGE_SIZE) { > 1051 ERROR("Error when writing to state file(5)"> > IOW, we''re faulting (EFAULT) on the domain''s MFN due to the aboveerror. Urk. Checkout line 204 of privcmd.c That doesn''t look too 64b clean to me.... The top nibble is supposed to be set if its not possible to map the frame correctly. This will be propagated through the xc_get_pfn_type_batch call, hence skipping the frame. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Feb-21 00:06 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On 20/2/07 23:48, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote:> Urk. Checkout line 204 of privcmd.c > That doesn''t look too 64b clean to me....Should be good for 1TB of memory. Also the subsequent get_pfn_type_batch hypercall takes an array of 32-bit values, so the type/error nibble has to be at the top of 32 bits rather than 64 bits.> The top nibble is supposed to be set if its not possible to map the > frame correctly. This will be propagated through the > xc_get_pfn_type_batch call, hence skipping the frame.That''s the theory. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-21 00:41 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Wed, Feb 21, 2007 at 12:06:52AM +0000, Keir Fraser wrote:> > Urk. Checkout line 204 of privcmd.c > > That doesn''t look too 64b clean to me....Hmm, I never looked at the Linux privcmd driver before. It seems like you''re prefaulting all the pages in the ioctl() in on the ioctl. Currently on Solaris we''re demand-faulting each page in the mmap() ... It looks like we''re not checking that we ca actually map the page at the time of the ioctl(), thus it''s not ending up as marked with that bit. Looks like our bug... thanks, john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Feb-21 07:32 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On 21/2/07 00:41, "John Levon" <levon@movementarian.org> wrote:> On Wed, Feb 21, 2007 at 12:06:52AM +0000, Keir Fraser wrote: > >>> Urk. Checkout line 204 of privcmd.c >>> That doesn''t look too 64b clean to me.... > > Hmm, I never looked at the Linux privcmd driver before. It seems like > you''re prefaulting all the pages in the ioctl() in on the ioctl. > Currently on Solaris we''re demand-faulting each page in the mmap() ... > > It looks like we''re not checking that we ca actually map the page at the > time of the ioctl(), thus it''s not ending up as marked with that bit. > > Looks like our bug...If you check then you may as well map at the same time. I think it would actually be hard to defer the mapping in a race-free manner. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-22 19:55 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Wed, Feb 21, 2007 at 07:32:10AM +0000, Keir Fraser wrote:> >>> Urk. Checkout line 204 of privcmd.c > >>> That doesn''t look too 64b clean to me.... > > > > Hmm, I never looked at the Linux privcmd driver before. It seems like > > you''re prefaulting all the pages in the ioctl() in on the ioctl. > > Currently on Solaris we''re demand-faulting each page in the mmap() ... > > > > It looks like we''re not checking that we ca actually map the page at the > > time of the ioctl(), thus it''s not ending up as marked with that bit. > > > > Looks like our bug... > > If you check then you may as well map at the same time. I think it would > actually be hard to defer the mapping in a race-free manner.I''ve modified the segment driver to prefault the MFNs and things seem a lot better for both Solaris and Linux domUs: (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0 Error pfn 5512: rd=ffff830000f92100, od=0000000000000000, caf=00000000, taf=0000000000000002 (XEN) mm.c:590:d0 Error getting mfn 5512 (pfn 47fa) from L1 entry 0000000005512705 for dom52 (XEN) mm.c:566:d0 Non-privileged (53) attempt to map I/O space 00000000 done done Not quite sure why the new domain is trying to map 00000000 though. I also see a fair amount of: Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 type=e8000000 Ian mentioned that these could be harmless. Is this because dom0 has mapped the page for suspending? And if so, why is the count two not one? Should xc_core.c really be using MMAPBATCH? I suppose it''s convenient but it does mean that the frame lists have to be locked in memory. regards, john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Feb-22 20:55 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On 22/2/07 19:55, "John Levon" <levon@movementarian.org> wrote:> I also see a fair amount of: > > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 type=e8000000 > > Ian mentioned that these could be harmless. Is this because dom0 has > mapped the page for suspending? And if so, why is the count two not one?Yes, it''s because of the extra dom0 mapping. The count is 2 because one reference is held at this point by the freeing function in Xen. The comment above that warning in the Xen source is a bit strong. I don''t think we can change the semantics of the existing command but we may add a new one to let the caller know whether the page(s) could immediately be freed. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-22 21:11 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Thu, Feb 22, 2007 at 08:55:39PM +0000, Keir Fraser wrote:> Yes, it''s because of the extra dom0 mapping. The count is 2 because one > reference is held at this point by the freeing function in Xen. > > The comment above that warning in the Xen source is a bit strong. I don''t > think we can change the semantics of the existing command but we may add a > new one to let the caller know whether the page(s) could immediately be > freed.Hmm. This debug message has caught one real bug in our balloon code recently. I presume it''s not trivial to distinguish between that case (where the domain really did have its own mapping) and the live-migration one though. regards john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-22 22:34 UTC
RE: [Xen-devel] Live migration fails under heavy network use
> I''ve modified the segment driver to prefault the MFNs and things seema> lot better for both Solaris and Linux domUs: > > (XEN) /export/johnlev/xen/xen-work/xen.hg/xen/include/asm/mm.h:184:d0Error> pfn 5512: rd=ffff830000f92100, od=0000000000000000, caf=00000000, > taf=0000000000000002 > (XEN) mm.c:590:d0 Error getting mfn 5512 (pfn 47fa) from L1 entry > 0000000005512705 for dom52 > (XEN) mm.c:566:d0 Non-privileged (53) attempt to map I/O space00000000> done > done > > Not quite sure why the new domain is trying to map 00000000 though.The messages from the save side are expected. Is the message from the restored domain triggered by the restore code i.e. before the domain is un-paused? I expect if you change the ''pfn=0'' in canonicalize_pagetable:539 to ''deadb000'' you''ll see that propagated through to the restore message. In which case, its ugly, but benign.> I also see a fair amount of: > > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2type=e8000000 That''s fine. Debug builds are a bit chatty for live migration... Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-23 00:22 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Thu, Feb 22, 2007 at 10:34:30PM -0000, Ian Pratt wrote:> > Not quite sure why the new domain is trying to map 00000000 though. > > The messages from the save side are expected. Is the message from the > restored domain triggered by the restore code i.e. before the domain is > un-paused?I suspect so but haven''t proved that.> I expect if you change the ''pfn=0'' in canonicalize_pagetable:539 to > ''deadb000'' you''ll see that propagated through to the restore message. In > which case, its ugly, but benign.Wouldn''t that pfn of 0 be an MFN other than 0 though? I do not see any change when setting pfn as above. Any further ideas? I can try adding some back traces. I suppose you''re not seeing it with a Linux dom0?> > I also see a fair amount of: > > > > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 > type=e8000000 > > That''s fine. Debug builds are a bit chatty for live migration...Both of these: (XEN) mm.c:590:d0 Error getting mfn a005e (pfn 4c35) from L1 entry 00000000a005e705 for dom2 (XEN) mm.c:566:d0 Non-privileged (3) attempt to map I/O space 00000000 are also present in a non-debug build. Would you take a patch to make both of them be XENLOG_INFO? It''s not good that we get console noise for normal operation (presuming the I/O space one /is/ normal operation!). regards john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-Feb-23 01:00 UTC
RE: [Xen-devel] Live migration fails under heavy network use
> On Thu, Feb 22, 2007 at 10:34:30PM -0000, Ian Pratt wrote: > > > > Not quite sure why the new domain is trying to map 00000000though.> > > > The messages from the save side are expected. Is the message fromthe> > restored domain triggered by the restore code i.e. before the domainis> > un-paused? > > I suspect so but haven''t proved that.That would be a good test.> > I expect if you change the ''pfn=0'' in canonicalize_pagetable:539 to > > ''deadb000'' you''ll see that propagated through to the restoremessage. In> > which case, its ugly, but benign. > > Wouldn''t that pfn of 0 be an MFN other than 0 though?Fair point. Thinking about it, that should be patched up by the next iteration anyhow.> I do not see any change > when setting pfn as above. Any further ideas? I can try adding someback> traces. I suppose you''re not seeing it with a Linux dom0?I don''t think so, but I couldn''t swear to it. It did used to come out during Linux domain boot at one point, can''t remember whether it still does. I presume the domain itself seems to suffer no ill effects?> > > I also see a fair amount of: > > > > > > Dom48 freeing in-use page 2991 (pseudophys 100a4): count=2 > > type=e8000000 > > > > That''s fine. Debug builds are a bit chatty for live migration... > > Both of these: > > (XEN) mm.c:590:d0 Error getting mfn a005e (pfn 4c35) from L1 entry > 00000000a005e705 for dom2 > (XEN) mm.c:566:d0 Non-privileged (3) attempt to map I/O space 00000000 > > are also present in a non-debug build. Would you take a patch to makeboth> of > them be XENLOG_INFO? It''s not good that we get console noise fornormal> operation > (presuming the I/O space one /is/ normal operation!).I think we need to understand this one first. Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2007-Feb-23 01:09 UTC
Re: [Xen-devel] Live migration fails under heavy network use
On Fri, Feb 23, 2007 at 01:00:34AM -0000, Ian Pratt wrote:> I don''t think so, but I couldn''t swear to it. It did used to come out > during Linux domain boot at one point, can''t remember whether it still > does. > > I presume the domain itself seems to suffer no ill effects?Right, it seems fine. I''ll add some more debug to try and figure out where it''s coming from soon. thanks john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel