Dan Magenheimer
2009-Jul-30 22:05 UTC
[Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
(This is a continuation of http://lists.xensource.com/archives/html/xen-devel/2009-06/msg00795.html ) I''m working on save/restore migrate for tmem. Due to the way that tmem works, tmem will sometimes have to save/migrate large amounts of (unmapped) data... perhaps gigabytes. As a result, in the case of live migration, "save"ing tmem data cannot wait until the domain has been suspended. It appears that the "next negative number as marker" mechanism only works for data that trails the last iteration of (mapped) pages, and thus works only after the domain has been suspended. (True?) I thought about rewriting the format, or waiting for someone else to rewrite it, but save/restore/migrate is the last major functionality missing from tmem, so I decided to deal with the current format as best as possible. As a result, I have extended the format somewhat to allow for a "negative number as marker" to PREcede the pages of data. Since the first data item in the save file (or migration data stream) is an "unsigned long" representing the number of pages (in the p2m table), a small negative number represents nearly 4G pages, or 16TB of data. So my change essentially reduces the number of pages to a handful less than 16TB worth of data. This is true for both the 32-bit tools and 64-bit tools. Hopefully, the fragile save/restore/migrate system will be completely rewritten before Xen needs to support more than 16TB per domain. Other than this limit, I think the extension is backwards compatible. It''s ugly... but really not much worse than the existing format. Patch fragments below... feedback appreciated. It''s made uglier by the fact that it needs to handle both ILP32 and I32/LP64. (Ignore the DPRINTK''s.) Basically, grab the "first int"... if it matches the marker, do tmem stuff. If not, if I32/LP64, grab the second part and reconstruct the unsigned long. Else the assign the "first int" to the unsigned long. Thanks, Dan diff -r 5333e6497af6 tools/libxc/xc_domain_restore.c --- a/tools/libxc/xc_domain_restore.c Mon Jul 20 15:45:50 2009 +0100 +++ b/tools/libxc/xc_domain_restore.c Thu Jul 30 15:25:38 2009 -0600 @@ -367,15 +367,52 @@ int xc_domain_restore(int xc_handle, int /* Buffer for holding HVM context */ uint8_t *hvm_buf = NULL; + int first_int = 0; + /* For info only */ nr_pfns = 0; - if ( read_exact(io_fd, &p2m_size, sizeof(unsigned long)) ) + + if ( read_exact(io_fd, &first_int, sizeof(int)) ) { ERROR("read: p2m_size"); goto out; } - DPRINTF("xc_domain_restore start: p2m_size = %lx\n", p2m_size); + if ( first_int == -5 ) + { + DPRINTF("xc_domain_restore start tmem\n"); +DPRINTF("xc_tmem_restore called: xc=%d, dom=%d, io_fd=%d\n", xc_handle,dom,io_fd); + if ( xc_tmem_restore(xc_handle, dom, io_fd) ) + { +DPRINTF("xc_tmem_restore failed\n"); + ERROR("error reading/restoring tmem"); + goto out; + } +DPRINTF("xc_tmem_restore succeeded\n"); + if ( read_exact(io_fd, &p2m_size, sizeof(long)) ) + { + ERROR("read: p2m_size"); + goto out; + } + } + else +#ifdef __X86_64__ + { + int next_int = 0; + + if ( read_exact(io_fd, &next_int, sizeof(int)) ) + { + ERROR("read: p2m_size"); + goto out; + } + p2m_size = (next_int << (sizeof(int) * 8)) | first_int; + } +#else + p2m_size = first_int; +#endif + + DPRINTF("xc_domain_restore start memory: p2m_size = %lx\n", p2m_size); + if ( !get_platform_info(xc_handle, dom, &max_mfn, &hvirt_start, &pt_levels, &guest_width) ) @@ -533,6 +570,16 @@ int xc_domain_restore(int xc_handle, int } xc_set_hvm_param(xc_handle, dom, HVM_PARAM_VM86_TSS, vm86_tss); + continue; + } + + if ( j == -6 ) + { + if ( xc_tmem_restore_extra(xc_handle, dom, io_fd) ) + { + ERROR("error reading/restoring tmem extra"); + goto out; + } continue; } diff -r 5333e6497af6 tools/libxc/xc_domain_save.c --- a/tools/libxc/xc_domain_save.c Mon Jul 20 15:45:50 2009 +0100 +++ b/tools/libxc/xc_domain_save.c Thu Jul 30 15:25:38 2009 -0600 @@ -758,6 +758,7 @@ int xc_domain_save(int xc_handle, int io int live = (flags & XCFLAGS_LIVE); int debug = (flags & XCFLAGS_DEBUG); int race = 0, sent_last_iter, skip_this_iter; + int tmem_saved = 0; /* The new domain''s shared-info frame number. */ unsigned long shared_info_frame; @@ -880,6 +881,13 @@ int xc_domain_save(int xc_handle, int io ERROR("Domain appears not to have suspended"); goto out; } + } + + tmem_saved = xc_tmem_save(xc_handle, dom, io_fd, live, -5); + if ( tmem_saved == -1 ) + { + ERROR("Error when writing to state file (tmem)"); + goto out; } last_iter = !live; @@ -1600,10 +1608,22 @@ int xc_domain_save(int xc_handle, int io goto out; } + if ( tmem_saved > 0 && live ) + { + if ( xc_tmem_save_extra(xc_handle, dom, io_fd, -6) == -1 ) + { + ERROR("Error when writing to state file (tmem)"); + goto out; + } + } + /* Success! */ rc = 0; out: + + if ( tmem_saved != 0 && live ) + xc_tmem_save_done(xc_handle, dom); if ( live ) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-30 22:42 UTC
Re: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
On 30/07/2009 23:05, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> It appears that the "next negative number as marker" > mechanism only works for data that trails the last > iteration of (mapped) pages, and thus works only after > the domain has been suspended. (True?)No. It so happens that all such markers are emitted after saved pages right now, but you can see in xc_domain_restore.c that markers are detected right at the top of the read pages loop. You could add a new marker, emit it before any pages in xc_domain_save.c, and pick it up just fine in the restore loop. You''d read the rest of your stuff and act on it, then do a C ''continue'' to kick off the next loop iteration, which would presumably start reading ordinary saved memory pages. Your patch is definitely not required. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-30 22:59 UTC
RE: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
Well for PV guests, the p2m table is definitely assumed to be first. Are you saying, I can/should put the tmem stuff between the p2m table and the mapped data pages? If so, cool, I will give that a try. Thanks! Dan> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Thursday, July 30, 2009 4:42 PM > To: Dan Magenheimer; Xen-Devel (E-mail) > Cc: Tim Deegan; Gianluca Guida; Stefano Stabellini; John Levon; > Stefano@acsinet11.oracle.com > Subject: Re: [Xen-devel] [RFC] save image file format CHANGE > (minor, but > feedback appreciated) > > > On 30/07/2009 23:05, "Dan Magenheimer" > <dan.magenheimer@oracle.com> wrote: > > > It appears that the "next negative number as marker" > > mechanism only works for data that trails the last > > iteration of (mapped) pages, and thus works only after > > the domain has been suspended. (True?) > > No. It so happens that all such markers are emitted after > saved pages right > now, but you can see in xc_domain_restore.c that markers are > detected right > at the top of the read pages loop. You could add a new marker, emit it > before any pages in xc_domain_save.c, and pick it up just fine in the > restore loop. You''d read the rest of your stuff and act on > it, then do a C > ''continue'' to kick off the next loop iteration, which would > presumably start > reading ordinary saved memory pages. > > Your patch is definitely not required. > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-31 07:20 UTC
Re: [Xen-devel] [RFC] save image file format CHANGE (minor, but feedback appreciated)
Oh yes, it''s harder to go earlier than the p2m data. But going before the (possibly multiple rounds of) data pages is easy. -- Keir On 30/07/2009 23:59, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Well for PV guests, the p2m table is definitely assumed > to be first. Are you saying, I can/should put the tmem > stuff between the p2m table and the mapped data pages? > If so, cool, I will give that a try. Thanks! > > Dan > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >> Sent: Thursday, July 30, 2009 4:42 PM >> To: Dan Magenheimer; Xen-Devel (E-mail) >> Cc: Tim Deegan; Gianluca Guida; Stefano Stabellini; John Levon; >> Stefano@acsinet11.oracle.com >> Subject: Re: [Xen-devel] [RFC] save image file format CHANGE >> (minor, but >> feedback appreciated) >> >> >> On 30/07/2009 23:05, "Dan Magenheimer" >> <dan.magenheimer@oracle.com> wrote: >> >>> It appears that the "next negative number as marker" >>> mechanism only works for data that trails the last >>> iteration of (mapped) pages, and thus works only after >>> the domain has been suspended. (True?) >> >> No. It so happens that all such markers are emitted after >> saved pages right >> now, but you can see in xc_domain_restore.c that markers are >> detected right >> at the top of the read pages loop. You could add a new marker, emit it >> before any pages in xc_domain_save.c, and pick it up just fine in the >> restore loop. You''d read the rest of your stuff and act on >> it, then do a C >> ''continue'' to kick off the next loop iteration, which would >> presumably start >> reading ordinary saved memory pages. >> >> Your patch is definitely not required. >> >> -- Keir >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel