Dan Magenheimer
2009-Jul-30 22:05 UTC
[Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
(This is a continuation of
http://lists.xensource.com/archives/html/xen-devel/2009-06/msg00795.html )
I''m working on save/restore migrate for tmem. Due to the
way that tmem works, tmem will sometimes have to save/migrate
large amounts of (unmapped) data... perhaps gigabytes. As a
result, in the case of live migration, "save"ing tmem
data cannot wait until the domain has been suspended.
It appears that the "next negative number as marker"
mechanism only works for data that trails the last
iteration of (mapped) pages, and thus works only after
the domain has been suspended. (True?)
I thought about rewriting the format, or waiting for
someone else to rewrite it, but save/restore/migrate
is the last major functionality missing from tmem,
so I decided to deal with the current format as
best as possible.
As a result, I have extended the format somewhat to
allow for a "negative number as marker" to PREcede
the pages of data. Since the first data item in
the save file (or migration data stream) is an
"unsigned long" representing the number of pages
(in the p2m table), a small negative number represents
nearly 4G pages, or 16TB of data. So my change essentially
reduces the number of pages to a handful less than
16TB worth of data. This is true for both the 32-bit
tools and 64-bit tools. Hopefully, the fragile
save/restore/migrate system will be completely rewritten
before Xen needs to support more than 16TB per
domain. Other than this limit, I think the extension
is backwards compatible. It''s ugly... but really not
much worse than the existing format.
Patch fragments below... feedback appreciated. It''s made
uglier by the fact that it needs to handle both ILP32
and I32/LP64. (Ignore the DPRINTK''s.) Basically,
grab the "first int"... if it matches the marker,
do tmem stuff. If not, if I32/LP64, grab the second
part and reconstruct the unsigned long. Else the
assign the "first int" to the unsigned long.
Thanks,
Dan
diff -r 5333e6497af6 tools/libxc/xc_domain_restore.c
--- a/tools/libxc/xc_domain_restore.c Mon Jul 20 15:45:50 2009 +0100
+++ b/tools/libxc/xc_domain_restore.c Thu Jul 30 15:25:38 2009 -0600
@@ -367,15 +367,52 @@ int xc_domain_restore(int xc_handle, int
/* Buffer for holding HVM context */
uint8_t *hvm_buf = NULL;
+ int first_int = 0;
+
/* For info only */
nr_pfns = 0;
- if ( read_exact(io_fd, &p2m_size, sizeof(unsigned long)) )
+
+ if ( read_exact(io_fd, &first_int, sizeof(int)) )
{
ERROR("read: p2m_size");
goto out;
}
- DPRINTF("xc_domain_restore start: p2m_size = %lx\n", p2m_size);
+ if ( first_int == -5 )
+ {
+ DPRINTF("xc_domain_restore start tmem\n");
+DPRINTF("xc_tmem_restore called: xc=%d, dom=%d, io_fd=%d\n",
xc_handle,dom,io_fd);
+ if ( xc_tmem_restore(xc_handle, dom, io_fd) )
+ {
+DPRINTF("xc_tmem_restore failed\n");
+ ERROR("error reading/restoring tmem");
+ goto out;
+ }
+DPRINTF("xc_tmem_restore succeeded\n");
+ if ( read_exact(io_fd, &p2m_size, sizeof(long)) )
+ {
+ ERROR("read: p2m_size");
+ goto out;
+ }
+ }
+ else
+#ifdef __X86_64__
+ {
+ int next_int = 0;
+
+ if ( read_exact(io_fd, &next_int, sizeof(int)) )
+ {
+ ERROR("read: p2m_size");
+ goto out;
+ }
+ p2m_size = (next_int << (sizeof(int) * 8)) | first_int;
+ }
+#else
+ p2m_size = first_int;
+#endif
+
+ DPRINTF("xc_domain_restore start memory: p2m_size = %lx\n",
p2m_size);
+
if ( !get_platform_info(xc_handle, dom,
&max_mfn, &hvirt_start, &pt_levels,
&guest_width) )
@@ -533,6 +570,16 @@ int xc_domain_restore(int xc_handle, int
}
xc_set_hvm_param(xc_handle, dom, HVM_PARAM_VM86_TSS, vm86_tss);
+ continue;
+ }
+
+ if ( j == -6 )
+ {
+ if ( xc_tmem_restore_extra(xc_handle, dom, io_fd) )
+ {
+ ERROR("error reading/restoring tmem extra");
+ goto out;
+ }
continue;
}
diff -r 5333e6497af6 tools/libxc/xc_domain_save.c
--- a/tools/libxc/xc_domain_save.c Mon Jul 20 15:45:50 2009 +0100
+++ b/tools/libxc/xc_domain_save.c Thu Jul 30 15:25:38 2009 -0600
@@ -758,6 +758,7 @@ int xc_domain_save(int xc_handle, int io
int live = (flags & XCFLAGS_LIVE);
int debug = (flags & XCFLAGS_DEBUG);
int race = 0, sent_last_iter, skip_this_iter;
+ int tmem_saved = 0;
/* The new domain''s shared-info frame number. */
unsigned long shared_info_frame;
@@ -880,6 +881,13 @@ int xc_domain_save(int xc_handle, int io
ERROR("Domain appears not to have suspended");
goto out;
}
+ }
+
+ tmem_saved = xc_tmem_save(xc_handle, dom, io_fd, live, -5);
+ if ( tmem_saved == -1 )
+ {
+ ERROR("Error when writing to state file (tmem)");
+ goto out;
}
last_iter = !live;
@@ -1600,10 +1608,22 @@ int xc_domain_save(int xc_handle, int io
goto out;
}
+ if ( tmem_saved > 0 && live )
+ {
+ if ( xc_tmem_save_extra(xc_handle, dom, io_fd, -6) == -1 )
+ {
+ ERROR("Error when writing to state file (tmem)");
+ goto out;
+ }
+ }
+
/* Success! */
rc = 0;
out:
+
+ if ( tmem_saved != 0 && live )
+ xc_tmem_save_done(xc_handle, dom);
if ( live )
{
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-30 22:42 UTC
Re: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
On 30/07/2009 23:05, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> It appears that the "next negative number as marker" > mechanism only works for data that trails the last > iteration of (mapped) pages, and thus works only after > the domain has been suspended. (True?)No. It so happens that all such markers are emitted after saved pages right now, but you can see in xc_domain_restore.c that markers are detected right at the top of the read pages loop. You could add a new marker, emit it before any pages in xc_domain_save.c, and pick it up just fine in the restore loop. You''d read the rest of your stuff and act on it, then do a C ''continue'' to kick off the next loop iteration, which would presumably start reading ordinary saved memory pages. Your patch is definitely not required. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-30 22:59 UTC
RE: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
Well for PV guests, the p2m table is definitely assumed to be first. Are you saying, I can/should put the tmem stuff between the p2m table and the mapped data pages? If so, cool, I will give that a try. Thanks! Dan> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Thursday, July 30, 2009 4:42 PM > To: Dan Magenheimer; Xen-Devel (E-mail) > Cc: Tim Deegan; Gianluca Guida; Stefano Stabellini; John Levon; > Stefano@acsinet11.oracle.com > Subject: Re: [Xen-devel] [RFC] save image file format CHANGE > (minor, but > feedback appreciated) > > > On 30/07/2009 23:05, "Dan Magenheimer" > <dan.magenheimer@oracle.com> wrote: > > > It appears that the "next negative number as marker" > > mechanism only works for data that trails the last > > iteration of (mapped) pages, and thus works only after > > the domain has been suspended. (True?) > > No. It so happens that all such markers are emitted after > saved pages right > now, but you can see in xc_domain_restore.c that markers are > detected right > at the top of the read pages loop. You could add a new marker, emit it > before any pages in xc_domain_save.c, and pick it up just fine in the > restore loop. You''d read the rest of your stuff and act on > it, then do a C > ''continue'' to kick off the next loop iteration, which would > presumably start > reading ordinary saved memory pages. > > Your patch is definitely not required. > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-31 07:20 UTC
Re: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
Oh yes, it''s harder to go earlier than the p2m data. But going before the (possibly multiple rounds of) data pages is easy. -- Keir On 30/07/2009 23:59, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Well for PV guests, the p2m table is definitely assumed > to be first. Are you saying, I can/should put the tmem > stuff between the p2m table and the mapped data pages? > If so, cool, I will give that a try. Thanks! > > Dan > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >> Sent: Thursday, July 30, 2009 4:42 PM >> To: Dan Magenheimer; Xen-Devel (E-mail) >> Cc: Tim Deegan; Gianluca Guida; Stefano Stabellini; John Levon; >> Stefano@acsinet11.oracle.com >> Subject: Re: [Xen-devel] [RFC] save image file format CHANGE >> (minor, but >> feedback appreciated) >> >> >> On 30/07/2009 23:05, "Dan Magenheimer" >> <dan.magenheimer@oracle.com> wrote: >> >>> It appears that the "next negative number as marker" >>> mechanism only works for data that trails the last >>> iteration of (mapped) pages, and thus works only after >>> the domain has been suspended. (True?) >> >> No. It so happens that all such markers are emitted after >> saved pages right >> now, but you can see in xc_domain_restore.c that markers are >> detected right >> at the top of the read pages loop. You could add a new marker, emit it >> before any pages in xc_domain_save.c, and pick it up just fine in the >> restore loop. You''d read the rest of your stuff and act on >> it, then do a C >> ''continue'' to kick off the next loop iteration, which would >> presumably start >> reading ordinary saved memory pages. >> >> Your patch is definitely not required. >> >> -- Keir >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel