Liu, Jinsong
2012-Nov-22 16:28 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Sorry all, please ignore last 2 email, I mis-operate hg email :( I''m check the reason and will send out later. Thanks, Jinsong Liu Jinsong wrote:> Generally, there are 2 cases: > 1. broken page occurs before migration > 2. broken page occurs during migration > > We had submitted 2 versions of patches. Their difference is how to > handle case 2: > V1 patch aborted migration for the case ''broken page occurs during > migration''; > V2 patch marked broken page to dirty bitmap then got handled next > round and no abort; > > This is V3 patch, adding handle for vMCE occur at last iteration of > migration: > At the sender > For case 1, the broken page will be mapped but not copied to target > (otherwise it may trigger more serious error, say, SRAR error). > While its pfn_type and pfn number will be transferred to target > so that target take appropriate action. > > For case 2, mce handler marks the broken page to dirty bitmap, so > that at copypages stage of migration, its pfn_type and pfn number > will be transferred to target and then take appropriate action. > > At the target > When migration target populates pages for guest. As for the case > of broken page wrt migration, we prefer to keep the type of the > page, for the sake of seamless migration. > > At target it will set p2m as p2m_ram_broken for broken page. Guest > MCE may have good chance to handle its broken page, while if guest > access the broken page again it will kill itself as expected. > > If vMCE occur at the last memory copy iteration of migration, we do > one more > iteration so that the broken page''s pfn_type and pfn number has > chance to > be transferred to target. > > Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> > > - for tools part of the patch > Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com> > > - for hypervisor part of the patch > Acked-by: Jan Beulich <JBeulich@suse.com> > > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain.c > --- a/tools/libxc/xc_domain.c Tue Nov 13 11:19:17 2012 +0000 > +++ b/tools/libxc/xc_domain.c Sat Nov 17 09:46:05 2012 +0800 > @@ -283,6 +283,22 @@ > return ret; > } > > +/* set broken page p2m */ > +int xc_set_broken_page_p2m(xc_interface *xch, > + uint32_t domid, > + unsigned long pfn) > +{ > + int ret; > + DECLARE_DOMCTL; > + > + domctl.cmd = XEN_DOMCTL_set_broken_page_p2m; > + domctl.domain = (domid_t)domid; > + domctl.u.set_broken_page_p2m.pfn = pfn; > + ret = do_domctl(xch, &domctl); > + > + return ret ? -1 : 0; > +} > + > /* get info from hvm guest for save */ > int xc_domain_hvm_getcontext(xc_interface *xch, > uint32_t domid, > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain_restore.c > --- a/tools/libxc/xc_domain_restore.c Tue Nov 13 11:19:17 2012 +0000 > +++ b/tools/libxc/xc_domain_restore.c Sat Nov 17 09:46:05 2012 +0800 > @@ -1023,9 +1023,15 @@ > > countpages = count; > for (i = oldcount; i < buf->nr_pages; ++i) > - if ((buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) => XEN_DOMCTL_PFINFO_XTAB > - ||(buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) => XEN_DOMCTL_PFINFO_XALLOC) + { > + unsigned long pagetype; > + > + pagetype = buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK; > + if ( pagetype == XEN_DOMCTL_PFINFO_XTAB || > + pagetype == XEN_DOMCTL_PFINFO_BROKEN || > + pagetype == XEN_DOMCTL_PFINFO_XALLOC ) > --countpages; > + } > > if (!countpages) > return count; > @@ -1267,6 +1273,17 @@ > /* a bogus/unmapped/allocate-only page: skip it */ > continue; > > + if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN ) > + { > + if ( xc_set_broken_page_p2m(xch, dom, pfn) ) > + { > + ERROR("Set p2m for broken page failed, " > + "dom=%d, pfn=%lx\n", dom, pfn); > + goto err_mapped; > + } > + continue; > + } > + > if (pfn_err[i]) > { > ERROR("unexpected PFN mapping failure pfn %lx map_mfn > %lx p2m_mfn %lx", > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain_save.c > --- a/tools/libxc/xc_domain_save.c Tue Nov 13 11:19:17 2012 +0000 > +++ b/tools/libxc/xc_domain_save.c Sat Nov 17 09:46:05 2012 +0800 > @@ -1118,7 +1118,7 @@ > /* Now write out each data page, canonicalising page tables as > we go... */ for ( ; ; ) > { > - unsigned int N, batch, run; > + unsigned int N, batch, run, broken_page_num1, > broken_page_num2; char reportbuf[80]; > > snprintf(reportbuf, sizeof(reportbuf), > @@ -1270,6 +1270,7 @@ > goto out; > } > > + broken_page_num1 = 0; > for ( run = j = 0; j < batch; j++ ) > { > unsigned long gmfn = pfn_batch[j]; > @@ -1277,6 +1278,14 @@ > if ( !hvm ) > gmfn = pfn_to_mfn(gmfn); > > + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) > + { > + pfn_type[j] |= pfn_batch[j]; > + ++broken_page_num1; > + ++run; > + continue; > + } > + > if ( pfn_err[j] ) > { > if ( pfn_type[j] == XEN_DOMCTL_PFINFO_XTAB ) > @@ -1371,8 +1380,12 @@ > } > } > > - /* skip pages that aren''t present or are alloc-only > */ + /* > + * skip pages that aren''t present, > + * or are broken, or are alloc-only > + */ > if ( pagetype == XEN_DOMCTL_PFINFO_XTAB > + || pagetype == XEN_DOMCTL_PFINFO_BROKEN > || pagetype == XEN_DOMCTL_PFINFO_XALLOC ) > continue; > > @@ -1484,6 +1497,35 @@ > > munmap(region_base, batch*PAGE_SIZE); > > + /* > + * if vMCE occur at last iter, do one more iter so that > it get + * chance to transfer broken page''s pfn_type and > pfn number to + * target and then take appropriate action > + */ > + if ( last_iter ) > + { > + for ( j = 0; j < batch; j++ ) > + { > + if ( hvm ) > + pfn_type[j] = pfn_batch[j]; > + else > + pfn_type[j] = pfn_to_mfn(pfn_batch[j]); > + } > + > + if ( xc_get_pfn_type_batch(xch, dom, batch, > pfn_type) ) + { > + PERROR("get_pfn_type_batch failed"); > + goto out; > + } > + > + broken_page_num2 = 0; > + for ( j = 0; j < batch; j++ ) > + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) > + broken_page_num2++; > + > + if ( broken_page_num1 < broken_page_num2 ) > + last_iter = 0; > + } > } /* end of this while loop for this iteration */ > > skip: > @@ -1550,23 +1592,22 @@ > PERROR("Error when writing to state file (tsc)"); > goto out; > } > + } > + } > + else > + last_iter = 1; > > + if ( xc_shadow_control(xch, dom, > + XEN_DOMCTL_SHADOW_OP_CLEAN, > HYPERCALL_BUFFER(to_send), + > dinfo->p2m_size, NULL, 0, &shadow_stats) != dinfo->p2m_size ) + > { + PERROR("Error flushing shadow PT"); > + goto out; > + } > > - } > + sent_last_iter = sent_this_iter; > > - if ( xc_shadow_control(xch, dom, > - XEN_DOMCTL_SHADOW_OP_CLEAN, > HYPERCALL_BUFFER(to_send), > - dinfo->p2m_size, NULL, 0, > &shadow_stats) != dinfo->p2m_size ) > - { > - PERROR("Error flushing shadow PT"); > - goto out; > - } > - > - sent_last_iter = sent_this_iter; > - > - print_stats(xch, dom, sent_this_iter, &time_stats, > &shadow_stats, 1); - > - } > + print_stats(xch, dom, sent_this_iter, &time_stats, > &shadow_stats, 1); } /* end of infinite for loop */ > > DPRINTF("All memory is saved\n"); > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xenctrl.h > --- a/tools/libxc/xenctrl.h Tue Nov 13 11:19:17 2012 +0000 > +++ b/tools/libxc/xenctrl.h Sat Nov 17 09:46:05 2012 +0800 > @@ -575,6 +575,17 @@ > xc_domaininfo_t *info); > > /** > + * This function set p2m for broken page > + * &parm xch a handle to an open hypervisor interface > + * @parm domid the domain id which broken page belong to > + * @parm pfn the pfn number of the broken page > + * @return 0 on success, -1 on failure > + */ > +int xc_set_broken_page_p2m(xc_interface *xch, > + uint32_t domid, > + unsigned long pfn); > + > +/** > * This function returns information about the context of a hvm > domain > * @parm xch a handle to an open hypervisor interface > * @parm domid the domain to get information from > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 > xen/arch/x86/cpu/mcheck/mcaction.c --- > a/xen/arch/x86/cpu/mcheck/mcaction.c Tue Nov 13 11:19:17 2012 +0000 > +++ b/xen/arch/x86/cpu/mcheck/mcaction.c Sat Nov 17 09:46:05 2012 > +0800 @@ -1,5 +1,6 @@ #include <xen/types.h> > #include <xen/sched.h> > +#include <asm/p2m.h> > #include "mcaction.h" > #include "vmce.h" > #include "mce.h" > @@ -91,6 +92,24 @@ > goto vmce_failed; > } > > + if ( is_hvm_domain(d) && > !d->arch.hvm_domain.dirty_vram && + > paging_mode_log_dirty(d) ) + { > + /* > + * vMCE occur during migration > + * > + * At sender, it marks broken page to dirty > bitmap, + * so that at copypages stage of > migration, broken + * page''s pfn_type and pfn > number would be transferred + * to target and > then take appropriate action. + * > + * At target, it would set p2m as > p2m_ram_broken for + * broken page, so that if > guest access the broken page + * again, it > would kill itself as expected. + */ > + paging_mark_dirty(d, mfn); > + } > + > if ( unmmap_broken_page(d, _mfn(mfn), gfn) ) > { > printk("Unmap broken memory %lx for DOM%d > failed\n", > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 xen/arch/x86/domctl.c > --- a/xen/arch/x86/domctl.c Tue Nov 13 11:19:17 2012 +0000 > +++ b/xen/arch/x86/domctl.c Sat Nov 17 09:46:05 2012 +0800 > @@ -209,12 +209,18 @@ > for ( j = 0; j < k; j++ ) > { > unsigned long type = 0; > + p2m_type_t t; > > - page = get_page_from_gfn(d, arr[j], NULL, > P2M_ALLOC); + page = get_page_from_gfn(d, arr[j], > &t, P2M_ALLOC); > > if ( unlikely(!page) || > unlikely(is_xen_heap_page(page)) ) > - type = XEN_DOMCTL_PFINFO_XTAB; > + { > + if ( p2m_is_broken(t) ) > + type = XEN_DOMCTL_PFINFO_BROKEN; > + else > + type = XEN_DOMCTL_PFINFO_XTAB; > + } > else > { > switch( page->u.inuse.type_info & > PGT_type_mask ) @@ -235,6 +241,9 @@ > > if ( page->u.inuse.type_info & PGT_pinned ) > type |= XEN_DOMCTL_PFINFO_LPINTAB; > + > + if ( page->count_info & PGC_broken ) > + type = XEN_DOMCTL_PFINFO_BROKEN; > } > > if ( page ) > @@ -1568,6 +1577,29 @@ > } > break; > > + case XEN_DOMCTL_set_broken_page_p2m: > + { > + struct domain *d; > + > + d = rcu_lock_domain_by_id(domctl->domain); > + if ( d != NULL ) > + { > + p2m_type_t pt; > + unsigned long pfn = domctl->u.set_broken_page_p2m.pfn; > + mfn_t mfn = get_gfn_query(d, pfn, &pt); > + > + if ( unlikely(!mfn_valid(mfn_x(mfn)) || !p2m_is_ram(pt) > || + (p2m_change_type(d, pfn, pt, > p2m_ram_broken) != pt)) ) + ret = -EINVAL; > + > + put_gfn(d, pfn); > + rcu_unlock_domain(d); > + } > + else > + ret = -ESRCH; > + } > + break; > + > default: > ret = iommu_do_domctl(domctl, u_domctl); > break; > diff -r 8b93ac0c93f3 -r 446f6b9bfc89 xen/include/public/domctl.h > --- a/xen/include/public/domctl.h Tue Nov 13 11:19:17 2012 +0000 > +++ b/xen/include/public/domctl.h Sat Nov 17 09:46:05 2012 +0800 > @@ -136,6 +136,7 @@ > #define XEN_DOMCTL_PFINFO_LPINTAB (0x1U<<31) > #define XEN_DOMCTL_PFINFO_XTAB (0xfU<<28) /* invalid page */ > #define XEN_DOMCTL_PFINFO_XALLOC (0xeU<<28) /* allocate-only page */ > +#define XEN_DOMCTL_PFINFO_BROKEN (0xdU<<28) /* broken page */ > #define XEN_DOMCTL_PFINFO_LTAB_MASK (0xfU<<28) > > struct xen_domctl_getpageframeinfo { > @@ -834,6 +835,12 @@ > typedef struct xen_domctl_set_access_required > xen_domctl_set_access_required_t; > DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_access_required_t); > > +struct xen_domctl_set_broken_page_p2m { > + uint64_aligned_t pfn; > +}; > +typedef struct xen_domctl_set_broken_page_p2m > xen_domctl_set_broken_page_p2m_t; > +DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_broken_page_p2m_t); + > struct xen_domctl { > uint32_t cmd; > #define XEN_DOMCTL_createdomain 1 > @@ -899,6 +906,7 @@ > #define XEN_DOMCTL_set_access_required 64 > #define XEN_DOMCTL_audit_p2m 65 > #define XEN_DOMCTL_set_virq_handler 66 > +#define XEN_DOMCTL_set_broken_page_p2m 67 > #define XEN_DOMCTL_gdbsx_guestmemio 1000 > #define XEN_DOMCTL_gdbsx_pausevcpu 1001 > #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 > @@ -954,6 +962,7 @@ > struct xen_domctl_audit_p2m audit_p2m; > struct xen_domctl_set_virq_handler set_virq_handler; > struct xen_domctl_gdbsx_memio gdbsx_guest_memio; > + struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; > struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; > struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; > uint8_t pad[128]; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Liu Jinsong
2012-Nov-23 00:13 UTC
[PATCH V4] X86/vMCE: handle broken page with regard to migration
Generally, there are 2 cases: 1. broken page occurs before migration 2. broken page occurs during migration We had submitted 2 versions of patches. Their difference is how to handle case 2: V1 patch aborted migration for the case ''broken page occurs during migration''; V2 patch marked broken page to dirty bitmap then got handled next round and no abort; This is V3 patch, adding handle for vMCE occur at last iteration of migration: At the sender For case 1, the broken page will be mapped but not copied to target (otherwise it may trigger more serious error, say, SRAR error). While its pfn_type and pfn number will be transferred to target so that target take appropriate action. For case 2, mce handler marks the broken page to dirty bitmap, so that at copypages stage of migration, its pfn_type and pfn number will be transferred to target and then take appropriate action. At the target When migration target populates pages for guest. As for the case of broken page wrt migration, we prefer to keep the type of the page, for the sake of seamless migration. At target it will set p2m as p2m_ram_broken for broken page. Guest MCE may have good chance to handle its broken page, while if guest access the broken page again it will kill itself as expected. If vMCE occur at the last memory copy iteration of migration, we do one more iteration so that the broken page''s pfn_type and pfn number has chance to be transferred to target. Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> - for tools part of the patch Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com> - for hypervisor part of the patch Acked-by: Jan Beulich <JBeulich@suse.com> diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain.c Sat Nov 17 09:46:05 2012 +0800 @@ -283,6 +283,22 @@ return ret; } +/* set broken page p2m */ +int xc_set_broken_page_p2m(xc_interface *xch, + uint32_t domid, + unsigned long pfn) +{ + int ret; + DECLARE_DOMCTL; + + domctl.cmd = XEN_DOMCTL_set_broken_page_p2m; + domctl.domain = (domid_t)domid; + domctl.u.set_broken_page_p2m.pfn = pfn; + ret = do_domctl(xch, &domctl); + + return ret ? -1 : 0; +} + /* get info from hvm guest for save */ int xc_domain_hvm_getcontext(xc_interface *xch, uint32_t domid, diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain_restore.c --- a/tools/libxc/xc_domain_restore.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain_restore.c Sat Nov 17 09:46:05 2012 +0800 @@ -1023,9 +1023,15 @@ countpages = count; for (i = oldcount; i < buf->nr_pages; ++i) - if ((buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) == XEN_DOMCTL_PFINFO_XTAB - ||(buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) == XEN_DOMCTL_PFINFO_XALLOC) + { + unsigned long pagetype; + + pagetype = buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK; + if ( pagetype == XEN_DOMCTL_PFINFO_XTAB || + pagetype == XEN_DOMCTL_PFINFO_BROKEN || + pagetype == XEN_DOMCTL_PFINFO_XALLOC ) --countpages; + } if (!countpages) return count; @@ -1267,6 +1273,17 @@ /* a bogus/unmapped/allocate-only page: skip it */ continue; + if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN ) + { + if ( xc_set_broken_page_p2m(xch, dom, pfn) ) + { + ERROR("Set p2m for broken page failed, " + "dom=%d, pfn=%lx\n", dom, pfn); + goto err_mapped; + } + continue; + } + if (pfn_err[i]) { ERROR("unexpected PFN mapping failure pfn %lx map_mfn %lx p2m_mfn %lx", diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xc_domain_save.c --- a/tools/libxc/xc_domain_save.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain_save.c Sat Nov 17 09:46:05 2012 +0800 @@ -1118,7 +1118,7 @@ /* Now write out each data page, canonicalising page tables as we go... */ for ( ; ; ) { - unsigned int N, batch, run; + unsigned int N, batch, run, broken_page_num1, broken_page_num2; char reportbuf[80]; snprintf(reportbuf, sizeof(reportbuf), @@ -1270,6 +1270,7 @@ goto out; } + broken_page_num1 = 0; for ( run = j = 0; j < batch; j++ ) { unsigned long gmfn = pfn_batch[j]; @@ -1277,6 +1278,14 @@ if ( !hvm ) gmfn = pfn_to_mfn(gmfn); + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) + { + pfn_type[j] |= pfn_batch[j]; + ++broken_page_num1; + ++run; + continue; + } + if ( pfn_err[j] ) { if ( pfn_type[j] == XEN_DOMCTL_PFINFO_XTAB ) @@ -1371,8 +1380,12 @@ } } - /* skip pages that aren''t present or are alloc-only */ + /* + * skip pages that aren''t present, + * or are broken, or are alloc-only + */ if ( pagetype == XEN_DOMCTL_PFINFO_XTAB + || pagetype == XEN_DOMCTL_PFINFO_BROKEN || pagetype == XEN_DOMCTL_PFINFO_XALLOC ) continue; @@ -1484,6 +1497,35 @@ munmap(region_base, batch*PAGE_SIZE); + /* + * if vMCE occur at last iter, do one more iter so that it get + * chance to transfer broken page''s pfn_type and pfn number to + * target and then take appropriate action + */ + if ( last_iter ) + { + for ( j = 0; j < batch; j++ ) + { + if ( hvm ) + pfn_type[j] = pfn_batch[j]; + else + pfn_type[j] = pfn_to_mfn(pfn_batch[j]); + } + + if ( xc_get_pfn_type_batch(xch, dom, batch, pfn_type) ) + { + PERROR("get_pfn_type_batch failed"); + goto out; + } + + broken_page_num2 = 0; + for ( j = 0; j < batch; j++ ) + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) + broken_page_num2++; + + if ( broken_page_num1 < broken_page_num2 ) + last_iter = 0; + } } /* end of this while loop for this iteration */ skip: @@ -1550,23 +1592,22 @@ PERROR("Error when writing to state file (tsc)"); goto out; } + } + } + else + last_iter = 1; + if ( xc_shadow_control(xch, dom, + XEN_DOMCTL_SHADOW_OP_CLEAN, HYPERCALL_BUFFER(to_send), + dinfo->p2m_size, NULL, 0, &shadow_stats) != dinfo->p2m_size ) + { + PERROR("Error flushing shadow PT"); + goto out; + } - } + sent_last_iter = sent_this_iter; - if ( xc_shadow_control(xch, dom, - XEN_DOMCTL_SHADOW_OP_CLEAN, HYPERCALL_BUFFER(to_send), - dinfo->p2m_size, NULL, 0, &shadow_stats) != dinfo->p2m_size ) - { - PERROR("Error flushing shadow PT"); - goto out; - } - - sent_last_iter = sent_this_iter; - - print_stats(xch, dom, sent_this_iter, &time_stats, &shadow_stats, 1); - - } + print_stats(xch, dom, sent_this_iter, &time_stats, &shadow_stats, 1); } /* end of infinite for loop */ DPRINTF("All memory is saved\n"); diff -r 8b93ac0c93f3 -r 446f6b9bfc89 tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xenctrl.h Sat Nov 17 09:46:05 2012 +0800 @@ -575,6 +575,17 @@ xc_domaininfo_t *info); /** + * This function set p2m for broken page + * &parm xch a handle to an open hypervisor interface + * @parm domid the domain id which broken page belong to + * @parm pfn the pfn number of the broken page + * @return 0 on success, -1 on failure + */ +int xc_set_broken_page_p2m(xc_interface *xch, + uint32_t domid, + unsigned long pfn); + +/** * This function returns information about the context of a hvm domain * @parm xch a handle to an open hypervisor interface * @parm domid the domain to get information from diff -r 8b93ac0c93f3 -r 446f6b9bfc89 xen/arch/x86/cpu/mcheck/mcaction.c --- a/xen/arch/x86/cpu/mcheck/mcaction.c Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/arch/x86/cpu/mcheck/mcaction.c Sat Nov 17 09:46:05 2012 +0800 @@ -1,5 +1,6 @@ #include <xen/types.h> #include <xen/sched.h> +#include <asm/p2m.h> #include "mcaction.h" #include "vmce.h" #include "mce.h" @@ -91,6 +92,24 @@ goto vmce_failed; } + if ( is_hvm_domain(d) && !d->arch.hvm_domain.dirty_vram && + paging_mode_log_dirty(d) ) + { + /* + * vMCE occur during migration + * + * At sender, it marks broken page to dirty bitmap, + * so that at copypages stage of migration, broken + * page''s pfn_type and pfn number would be transferred + * to target and then take appropriate action. + * + * At target, it would set p2m as p2m_ram_broken for + * broken page, so that if guest access the broken page + * again, it would kill itself as expected. + */ + paging_mark_dirty(d, mfn); + } + if ( unmmap_broken_page(d, _mfn(mfn), gfn) ) { printk("Unmap broken memory %lx for DOM%d failed\n", diff -r 8b93ac0c93f3 -r 446f6b9bfc89 xen/arch/x86/domctl.c --- a/xen/arch/x86/domctl.c Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/arch/x86/domctl.c Sat Nov 17 09:46:05 2012 +0800 @@ -209,12 +209,18 @@ for ( j = 0; j < k; j++ ) { unsigned long type = 0; + p2m_type_t t; - page = get_page_from_gfn(d, arr[j], NULL, P2M_ALLOC); + page = get_page_from_gfn(d, arr[j], &t, P2M_ALLOC); if ( unlikely(!page) || unlikely(is_xen_heap_page(page)) ) - type = XEN_DOMCTL_PFINFO_XTAB; + { + if ( p2m_is_broken(t) ) + type = XEN_DOMCTL_PFINFO_BROKEN; + else + type = XEN_DOMCTL_PFINFO_XTAB; + } else { switch( page->u.inuse.type_info & PGT_type_mask ) @@ -235,6 +241,9 @@ if ( page->u.inuse.type_info & PGT_pinned ) type |= XEN_DOMCTL_PFINFO_LPINTAB; + + if ( page->count_info & PGC_broken ) + type = XEN_DOMCTL_PFINFO_BROKEN; } if ( page ) @@ -1568,6 +1577,29 @@ } break; + case XEN_DOMCTL_set_broken_page_p2m: + { + struct domain *d; + + d = rcu_lock_domain_by_id(domctl->domain); + if ( d != NULL ) + { + p2m_type_t pt; + unsigned long pfn = domctl->u.set_broken_page_p2m.pfn; + mfn_t mfn = get_gfn_query(d, pfn, &pt); + + if ( unlikely(!mfn_valid(mfn_x(mfn)) || !p2m_is_ram(pt) || + (p2m_change_type(d, pfn, pt, p2m_ram_broken) != pt)) ) + ret = -EINVAL; + + put_gfn(d, pfn); + rcu_unlock_domain(d); + } + else + ret = -ESRCH; + } + break; + default: ret = iommu_do_domctl(domctl, u_domctl); break; diff -r 8b93ac0c93f3 -r 446f6b9bfc89 xen/include/public/domctl.h --- a/xen/include/public/domctl.h Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/include/public/domctl.h Sat Nov 17 09:46:05 2012 +0800 @@ -136,6 +136,7 @@ #define XEN_DOMCTL_PFINFO_LPINTAB (0x1U<<31) #define XEN_DOMCTL_PFINFO_XTAB (0xfU<<28) /* invalid page */ #define XEN_DOMCTL_PFINFO_XALLOC (0xeU<<28) /* allocate-only page */ +#define XEN_DOMCTL_PFINFO_BROKEN (0xdU<<28) /* broken page */ #define XEN_DOMCTL_PFINFO_LTAB_MASK (0xfU<<28) struct xen_domctl_getpageframeinfo { @@ -834,6 +835,12 @@ typedef struct xen_domctl_set_access_required xen_domctl_set_access_required_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_access_required_t); +struct xen_domctl_set_broken_page_p2m { + uint64_aligned_t pfn; +}; +typedef struct xen_domctl_set_broken_page_p2m xen_domctl_set_broken_page_p2m_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_broken_page_p2m_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -899,6 +906,7 @@ #define XEN_DOMCTL_set_access_required 64 #define XEN_DOMCTL_audit_p2m 65 #define XEN_DOMCTL_set_virq_handler 66 +#define XEN_DOMCTL_set_broken_page_p2m 67 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -954,6 +962,7 @@ struct xen_domctl_audit_p2m audit_p2m; struct xen_domctl_set_virq_handler set_virq_handler; struct xen_domctl_gdbsx_memio gdbsx_guest_memio; + struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; uint8_t pad[128];
Liu Jinsong
2012-Nov-23 00:25 UTC
[PATCH V4] X86/vMCE: handle broken page with regard to migration
This patch handles broken page wrt migration. Generally there are below cases: 1. broken page occurs before migration 2. broken page occurs during migration 2.1 broken page occurs during migration but not at the last iteration 2.2 broken page occurs at the last iteration of migration For case 1, at the sender the broken page will be mapped but not copied to target (otherwise it may trigger more serious error, say, SRAR error). While its pfn_type and pfn number will be transferred to target so that target take appropriate action. For case 2.1, at the sender mce handler marks the broken page to dirty bitmap, so that at next iteration, its pfn_type and pfn number will be transferred to the target and then take appropriate action. For case 2.2, at the sender it adds a check to see if vMCE occurs at the last iteration. If yes, it will do more iteration(s) so that the broken page''s pfn_type and pfn number will be transferred to target. Another point is, if guest save to disk and during which vMCE occurs, it also need do more iteration(s). For all cases at the target (if migration not aborted by vMCE): Target will populates pages for guest. As for the case of broken page, we prefer to keep the type of the page for the sake of seamless migration. Target will set p2m as p2m_ram_broken for broken page. If guest access the broken page again it will kill itself as expected. All above description is based on the assumption that migration will success. However, for case 2 there are scenario that may result in guest or hypervisor crash. When pfn_type detecting fail to get p2m_ram_broken (vMCE occur after the detecting) and read the broken page, guest/hypervisor may survive or crash, depending on error nature and how guest/hypervisor handle it. If guest/hypervisor survive, migration is OK since it will transfer pfn_type to the target at next iter and then prevent further harm at target. If guest/hypervisor crash it definitely needn''t care migration any more. Unfortunately we have no way to predict it, so what we can do is to do the best to handle it, after all we cannot forbid migration for fear that it may crash guest/hypervisor. Patch version history: V4: - adjust variables and patch description based on feedback V3: - handle pages broken at the last iteration V2: - migrate continue when broken page occur during migration, via marking broken page to dirty bitmap V1: - migration abort when broken page occur during migration - transfer pfn_type to target for broken page occur before migration Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> - for hypervisor part of the patch Acked-by: Jan Beulich <JBeulich@suse.com> diff -r 8b93ac0c93f3 -r d3378692eece tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain.c Fri Nov 23 08:21:57 2012 +0800 @@ -283,6 +283,22 @@ return ret; } +/* set broken page p2m */ +int xc_set_broken_page_p2m(xc_interface *xch, + uint32_t domid, + unsigned long pfn) +{ + int ret; + DECLARE_DOMCTL; + + domctl.cmd = XEN_DOMCTL_set_broken_page_p2m; + domctl.domain = (domid_t)domid; + domctl.u.set_broken_page_p2m.pfn = pfn; + ret = do_domctl(xch, &domctl); + + return ret ? -1 : 0; +} + /* get info from hvm guest for save */ int xc_domain_hvm_getcontext(xc_interface *xch, uint32_t domid, diff -r 8b93ac0c93f3 -r d3378692eece tools/libxc/xc_domain_restore.c --- a/tools/libxc/xc_domain_restore.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain_restore.c Fri Nov 23 08:21:57 2012 +0800 @@ -1023,9 +1023,15 @@ countpages = count; for (i = oldcount; i < buf->nr_pages; ++i) - if ((buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) == XEN_DOMCTL_PFINFO_XTAB - ||(buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK) == XEN_DOMCTL_PFINFO_XALLOC) + { + unsigned long pagetype; + + pagetype = buf->pfn_types[i] & XEN_DOMCTL_PFINFO_LTAB_MASK; + if ( pagetype == XEN_DOMCTL_PFINFO_XTAB || + pagetype == XEN_DOMCTL_PFINFO_BROKEN || + pagetype == XEN_DOMCTL_PFINFO_XALLOC ) --countpages; + } if (!countpages) return count; @@ -1267,6 +1273,17 @@ /* a bogus/unmapped/allocate-only page: skip it */ continue; + if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN ) + { + if ( xc_set_broken_page_p2m(xch, dom, pfn) ) + { + ERROR("Set p2m for broken page failed, " + "dom=%d, pfn=%lx\n", dom, pfn); + goto err_mapped; + } + continue; + } + if (pfn_err[i]) { ERROR("unexpected PFN mapping failure pfn %lx map_mfn %lx p2m_mfn %lx", diff -r 8b93ac0c93f3 -r d3378692eece tools/libxc/xc_domain_save.c --- a/tools/libxc/xc_domain_save.c Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xc_domain_save.c Fri Nov 23 08:21:57 2012 +0800 @@ -1118,7 +1118,7 @@ /* Now write out each data page, canonicalising page tables as we go... */ for ( ; ; ) { - unsigned int N, batch, run; + unsigned int N, batch, run, broken_before_send, broken_after_send; char reportbuf[80]; snprintf(reportbuf, sizeof(reportbuf), @@ -1270,6 +1270,7 @@ goto out; } + broken_before_send = 0; for ( run = j = 0; j < batch; j++ ) { unsigned long gmfn = pfn_batch[j]; @@ -1277,6 +1278,14 @@ if ( !hvm ) gmfn = pfn_to_mfn(gmfn); + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) + { + pfn_type[j] |= pfn_batch[j]; + ++broken_before_send; + ++run; + continue; + } + if ( pfn_err[j] ) { if ( pfn_type[j] == XEN_DOMCTL_PFINFO_XTAB ) @@ -1371,8 +1380,12 @@ } } - /* skip pages that aren''t present or are alloc-only */ + /* + * skip pages that aren''t present, + * or are broken, or are alloc-only + */ if ( pagetype == XEN_DOMCTL_PFINFO_XTAB + || pagetype == XEN_DOMCTL_PFINFO_BROKEN || pagetype == XEN_DOMCTL_PFINFO_XALLOC ) continue; @@ -1484,6 +1497,38 @@ munmap(region_base, batch*PAGE_SIZE); + /* + * At the last iter, count the number of broken pages after sending, + * and if there are more than before sending, do one or more iter + * to make sure the pages are marked broken on the receiving side. + */ + if ( last_iter ) + { + /* Read pfn type again */ + for ( j = 0; j < batch; j++ ) + { + if ( hvm ) + pfn_type[j] = pfn_batch[j]; + else + pfn_type[j] = pfn_to_mfn(pfn_batch[j]); + } + + if ( xc_get_pfn_type_batch(xch, dom, batch, pfn_type) ) + { + PERROR("get_pfn_type_batch failed"); + goto out; + } + + /* Count the number of broken pages */ + broken_after_send = 0; + for ( j = 0; j < batch; j++ ) + if ( pfn_type[j] == XEN_DOMCTL_PFINFO_BROKEN ) + broken_after_send++; + + /* If there are more than before sending, do more iter(s) */ + if ( broken_before_send < broken_after_send ) + last_iter = 0; + } } /* end of this while loop for this iteration */ skip: @@ -1550,23 +1595,22 @@ PERROR("Error when writing to state file (tsc)"); goto out; } + } + } + else + last_iter = 1; + if ( xc_shadow_control(xch, dom, + XEN_DOMCTL_SHADOW_OP_CLEAN, HYPERCALL_BUFFER(to_send), + dinfo->p2m_size, NULL, 0, &shadow_stats) != dinfo->p2m_size ) + { + PERROR("Error flushing shadow PT"); + goto out; + } - } + sent_last_iter = sent_this_iter; - if ( xc_shadow_control(xch, dom, - XEN_DOMCTL_SHADOW_OP_CLEAN, HYPERCALL_BUFFER(to_send), - dinfo->p2m_size, NULL, 0, &shadow_stats) != dinfo->p2m_size ) - { - PERROR("Error flushing shadow PT"); - goto out; - } - - sent_last_iter = sent_this_iter; - - print_stats(xch, dom, sent_this_iter, &time_stats, &shadow_stats, 1); - - } + print_stats(xch, dom, sent_this_iter, &time_stats, &shadow_stats, 1); } /* end of infinite for loop */ DPRINTF("All memory is saved\n"); diff -r 8b93ac0c93f3 -r d3378692eece tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h Tue Nov 13 11:19:17 2012 +0000 +++ b/tools/libxc/xenctrl.h Fri Nov 23 08:21:57 2012 +0800 @@ -575,6 +575,17 @@ xc_domaininfo_t *info); /** + * This function set p2m for broken page + * &parm xch a handle to an open hypervisor interface + * @parm domid the domain id which broken page belong to + * @parm pfn the pfn number of the broken page + * @return 0 on success, -1 on failure + */ +int xc_set_broken_page_p2m(xc_interface *xch, + uint32_t domid, + unsigned long pfn); + +/** * This function returns information about the context of a hvm domain * @parm xch a handle to an open hypervisor interface * @parm domid the domain to get information from diff -r 8b93ac0c93f3 -r d3378692eece xen/arch/x86/cpu/mcheck/mcaction.c --- a/xen/arch/x86/cpu/mcheck/mcaction.c Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/arch/x86/cpu/mcheck/mcaction.c Fri Nov 23 08:21:57 2012 +0800 @@ -1,5 +1,6 @@ #include <xen/types.h> #include <xen/sched.h> +#include <asm/p2m.h> #include "mcaction.h" #include "vmce.h" #include "mce.h" @@ -91,6 +92,24 @@ goto vmce_failed; } + if ( is_hvm_domain(d) && !d->arch.hvm_domain.dirty_vram && + paging_mode_log_dirty(d) ) + { + /* + * vMCE occur during migration + * + * At sender, it marks broken page to dirty bitmap, + * so that at copypages stage of migration, broken + * page''s pfn_type and pfn number would be transferred + * to target and then take appropriate action. + * + * At target, it would set p2m as p2m_ram_broken for + * broken page, so that if guest access the broken page + * again, it would kill itself as expected. + */ + paging_mark_dirty(d, mfn); + } + if ( unmmap_broken_page(d, _mfn(mfn), gfn) ) { printk("Unmap broken memory %lx for DOM%d failed\n", diff -r 8b93ac0c93f3 -r d3378692eece xen/arch/x86/domctl.c --- a/xen/arch/x86/domctl.c Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/arch/x86/domctl.c Fri Nov 23 08:21:57 2012 +0800 @@ -209,12 +209,18 @@ for ( j = 0; j < k; j++ ) { unsigned long type = 0; + p2m_type_t t; - page = get_page_from_gfn(d, arr[j], NULL, P2M_ALLOC); + page = get_page_from_gfn(d, arr[j], &t, P2M_ALLOC); if ( unlikely(!page) || unlikely(is_xen_heap_page(page)) ) - type = XEN_DOMCTL_PFINFO_XTAB; + { + if ( p2m_is_broken(t) ) + type = XEN_DOMCTL_PFINFO_BROKEN; + else + type = XEN_DOMCTL_PFINFO_XTAB; + } else { switch( page->u.inuse.type_info & PGT_type_mask ) @@ -235,6 +241,9 @@ if ( page->u.inuse.type_info & PGT_pinned ) type |= XEN_DOMCTL_PFINFO_LPINTAB; + + if ( page->count_info & PGC_broken ) + type = XEN_DOMCTL_PFINFO_BROKEN; } if ( page ) @@ -1568,6 +1577,29 @@ } break; + case XEN_DOMCTL_set_broken_page_p2m: + { + struct domain *d; + + d = rcu_lock_domain_by_id(domctl->domain); + if ( d != NULL ) + { + p2m_type_t pt; + unsigned long pfn = domctl->u.set_broken_page_p2m.pfn; + mfn_t mfn = get_gfn_query(d, pfn, &pt); + + if ( unlikely(!mfn_valid(mfn_x(mfn)) || !p2m_is_ram(pt) || + (p2m_change_type(d, pfn, pt, p2m_ram_broken) != pt)) ) + ret = -EINVAL; + + put_gfn(d, pfn); + rcu_unlock_domain(d); + } + else + ret = -ESRCH; + } + break; + default: ret = iommu_do_domctl(domctl, u_domctl); break; diff -r 8b93ac0c93f3 -r d3378692eece xen/include/public/domctl.h --- a/xen/include/public/domctl.h Tue Nov 13 11:19:17 2012 +0000 +++ b/xen/include/public/domctl.h Fri Nov 23 08:21:57 2012 +0800 @@ -136,6 +136,7 @@ #define XEN_DOMCTL_PFINFO_LPINTAB (0x1U<<31) #define XEN_DOMCTL_PFINFO_XTAB (0xfU<<28) /* invalid page */ #define XEN_DOMCTL_PFINFO_XALLOC (0xeU<<28) /* allocate-only page */ +#define XEN_DOMCTL_PFINFO_BROKEN (0xdU<<28) /* broken page */ #define XEN_DOMCTL_PFINFO_LTAB_MASK (0xfU<<28) struct xen_domctl_getpageframeinfo { @@ -834,6 +835,12 @@ typedef struct xen_domctl_set_access_required xen_domctl_set_access_required_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_access_required_t); +struct xen_domctl_set_broken_page_p2m { + uint64_aligned_t pfn; +}; +typedef struct xen_domctl_set_broken_page_p2m xen_domctl_set_broken_page_p2m_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_broken_page_p2m_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -899,6 +906,7 @@ #define XEN_DOMCTL_set_access_required 64 #define XEN_DOMCTL_audit_p2m 65 #define XEN_DOMCTL_set_virq_handler 66 +#define XEN_DOMCTL_set_broken_page_p2m 67 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -954,6 +962,7 @@ struct xen_domctl_audit_p2m audit_p2m; struct xen_domctl_set_virq_handler set_virq_handler; struct xen_domctl_gdbsx_memio gdbsx_guest_memio; + struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; uint8_t pad[128];
George Dunlap
2012-Nov-23 16:26 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
On Fri, Nov 23, 2012 at 12:25 AM, Liu Jinsong <jinsong.liu@intel.com> wrote:> This patch handles broken page wrt migration. Generally there are below > cases: > 1. broken page occurs before migration > 2. broken page occurs during migration > 2.1 broken page occurs during migration but not at the last iteration > 2.2 broken page occurs at the last iteration of migration > > For case 1, at the sender the broken page will be mapped but not copied > to target (otherwise it may trigger more serious error, say, SRAR error). > While its pfn_type and pfn number will be transferred to target so that > target take appropriate action. > > For case 2.1, at the sender mce handler marks the broken page to dirty > bitmap, so that at next iteration, its pfn_type and pfn number will be > transferred to the target and then take appropriate action. > > For case 2.2, at the sender it adds a check to see if vMCE occurs at the > last iteration. If yes, it will do more iteration(s) so that the broken > page''s pfn_type and pfn number will be transferred to target. > Another point is, if guest save to disk and during which vMCE occurs, > it also need do more iteration(s). > > For all cases at the target (if migration not aborted by vMCE): > Target will populates pages for guest. As for the case of broken page, > we prefer to keep the type of the page for the sake of seamless > migration. > Target will set p2m as p2m_ram_broken for broken page. If guest access > the broken page again it will kill itself as expected. > > All above description is based on the assumption that migration will > success. > However, for case 2 there are scenario that may result in guest or > hypervisor > crash. When pfn_type detecting fail to get p2m_ram_broken (vMCE occur > after the > detecting) and read the broken page, guest/hypervisor may survive or crash, > depending on error nature and how guest/hypervisor handle it. If > guest/hypervisor > survive, migration is OK since it will transfer pfn_type to the target at > next > iter and then prevent further harm at target. If guest/hypervisor crash it > definitely needn''t care migration any more. Unfortunately we have no way to > predict it, so what we can do is to do the best to handle it, after all we > cannot forbid migration for fear that it may crash guest/hypervisor. > > Patch version history: > V4: > - adjust variables and patch description based on feedback > V3: > - handle pages broken at the last iteration > V2: > - migrate continue when broken page occur during migration, > via marking broken page to dirty bitmap > V1: > - migration abort when broken page occur during migration > - transfer pfn_type to target for broken page occur before migration > > Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> > Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> >Technically, since you changed part of the code I acked, you should have removed this ack. But now I''ve read the patch: Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Thanks Jinsong! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Liu, Jinsong
2012-Nov-25 13:24 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Thanks George! ________________________________ From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George Dunlap Sent: Saturday, November 24, 2012 12:26 AM To: Liu, Jinsong Cc: Ian Campbell; Ian Jackson; Jan Beulich; xen-devel@lists.xensource.com Subject: Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration On Fri, Nov 23, 2012 at 12:25 AM, Liu Jinsong <jinsong.liu@intel.com<mailto:jinsong.liu@intel.com>> wrote: This patch handles broken page wrt migration. Generally there are below cases: 1. broken page occurs before migration 2. broken page occurs during migration 2.1 broken page occurs during migration but not at the last iteration 2.2 broken page occurs at the last iteration of migration For case 1, at the sender the broken page will be mapped but not copied to target (otherwise it may trigger more serious error, say, SRAR error). While its pfn_type and pfn number will be transferred to target so that target take appropriate action. For case 2.1, at the sender mce handler marks the broken page to dirty bitmap, so that at next iteration, its pfn_type and pfn number will be transferred to the target and then take appropriate action. For case 2.2, at the sender it adds a check to see if vMCE occurs at the last iteration. If yes, it will do more iteration(s) so that the broken page''s pfn_type and pfn number will be transferred to target. Another point is, if guest save to disk and during which vMCE occurs, it also need do more iteration(s). For all cases at the target (if migration not aborted by vMCE): Target will populates pages for guest. As for the case of broken page, we prefer to keep the type of the page for the sake of seamless migration. Target will set p2m as p2m_ram_broken for broken page. If guest access the broken page again it will kill itself as expected. All above description is based on the assumption that migration will success. However, for case 2 there are scenario that may result in guest or hypervisor crash. When pfn_type detecting fail to get p2m_ram_broken (vMCE occur after the detecting) and read the broken page, guest/hypervisor may survive or crash, depending on error nature and how guest/hypervisor handle it. If guest/hypervisor survive, migration is OK since it will transfer pfn_type to the target at next iter and then prevent further harm at target. If guest/hypervisor crash it definitely needn''t care migration any more. Unfortunately we have no way to predict it, so what we can do is to do the best to handle it, after all we cannot forbid migration for fear that it may crash guest/hypervisor. Patch version history: V4: - adjust variables and patch description based on feedback V3: - handle pages broken at the last iteration V2: - migrate continue when broken page occur during migration, via marking broken page to dirty bitmap V1: - migration abort when broken page occur during migration - transfer pfn_type to target for broken page occur before migration Suggested-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com<mailto:jinsong.liu@intel.com>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Technically, since you changed part of the code I acked, you should have removed this ack. But now I''ve read the patch: Acked-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Thanks Jinsong! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Liu, Jinsong
2012-Nov-28 14:37 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Ping? Thanks, Jinsong ________________________________ From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George Dunlap Sent: Saturday, November 24, 2012 12:26 AM To: Liu, Jinsong Cc: Ian Campbell; Ian Jackson; Jan Beulich; xen-devel@lists.xensource.com Subject: Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration On Fri, Nov 23, 2012 at 12:25 AM, Liu Jinsong <jinsong.liu@intel.com<mailto:jinsong.liu@intel.com>> wrote: This patch handles broken page wrt migration. Generally there are below cases: 1. broken page occurs before migration 2. broken page occurs during migration 2.1 broken page occurs during migration but not at the last iteration 2.2 broken page occurs at the last iteration of migration For case 1, at the sender the broken page will be mapped but not copied to target (otherwise it may trigger more serious error, say, SRAR error). While its pfn_type and pfn number will be transferred to target so that target take appropriate action. For case 2.1, at the sender mce handler marks the broken page to dirty bitmap, so that at next iteration, its pfn_type and pfn number will be transferred to the target and then take appropriate action. For case 2.2, at the sender it adds a check to see if vMCE occurs at the last iteration. If yes, it will do more iteration(s) so that the broken page''s pfn_type and pfn number will be transferred to target. Another point is, if guest save to disk and during which vMCE occurs, it also need do more iteration(s). For all cases at the target (if migration not aborted by vMCE): Target will populates pages for guest. As for the case of broken page, we prefer to keep the type of the page for the sake of seamless migration. Target will set p2m as p2m_ram_broken for broken page. If guest access the broken page again it will kill itself as expected. All above description is based on the assumption that migration will success. However, for case 2 there are scenario that may result in guest or hypervisor crash. When pfn_type detecting fail to get p2m_ram_broken (vMCE occur after the detecting) and read the broken page, guest/hypervisor may survive or crash, depending on error nature and how guest/hypervisor handle it. If guest/hypervisor survive, migration is OK since it will transfer pfn_type to the target at next iter and then prevent further harm at target. If guest/hypervisor crash it definitely needn''t care migration any more. Unfortunately we have no way to predict it, so what we can do is to do the best to handle it, after all we cannot forbid migration for fear that it may crash guest/hypervisor. Patch version history: V4: - adjust variables and patch description based on feedback V3: - handle pages broken at the last iteration V2: - migrate continue when broken page occur during migration, via marking broken page to dirty bitmap V1: - migration abort when broken page occur during migration - transfer pfn_type to target for broken page occur before migration Suggested-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Signed-off-by: Liu Jinsong <jinsong.liu@intel.com<mailto:jinsong.liu@intel.com>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Technically, since you changed part of the code I acked, you should have removed this ack. But now I''ve read the patch: Acked-by: George Dunlap <george.dunlap@eu.citrix.com<mailto:george.dunlap@eu.citrix.com>> Thanks Jinsong! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Nov-29 10:02 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:> Ping?Sorry I''ve been meaning to reply but didn''t manage to yet. Also you replied to V4 saying to ignore it, so I was half waiting for V5 but I see this should actually be labelled V5 anyway. I''m afraid I still don''t fully grok the reason for the loop that goes with: + /* + * At the last iter, count the number of broken pages after sending, + * and if there are more than before sending, do one or more iter + * to make sure the pages are marked broken on the receiving side. + */ Can we go through it one more time? Sorry. Let me outline the sequence of events and you can point out where I''m going wrong. I''m afraid this has turned out to be rather long, again I''m sorry for that. First we do some number of iterations with the guest live. If an MCE occurs during this phase then the page will be marked dirty and we will pick this up on the next iteration and resend the page with the dirty type etc and all is fine. This all looks good to me, so we don''t need to worry about anything at this stage. Eventually we get to the last iteration, at which point we pause the guest. From here on in the guest itself is not going to cause an MCE (e.g. by touching its RAM) because it is not running but we must still account for the possibility of an asynchronous MCE of some sort e.g. triggered by the error detection mechanisms in the hardware, cosmic rays and such like. The final iteration proceeds roughly as follows. 1. The domain is paused 2. We scan the dirty bitmap and add dirty pages to the batch of pages to process (there may be several batches in the last iteration, we only need to concern ourselves with any one batch here). 3. We map all of the pages in the resulting batch with xc_map_foreign_bulk 4. We query the types of all the pages in the batch with xc_get_pfn_type_batch 5. We iterate over the batch, checking the type of each page, in some cases we do some incidental processing. 6. We send the types of the pages in the batch over the wire. 7. We iterate over the batch again, and send any normal page (not broken, xtab etc) over the wire. Actually we do this as runs of normal pages, but the key point is we avoid touching any special page (including ones marked as broken by #4) Is this sequence of events accurate? Now lets consider the consequences of an MCE occurring at various stages here. Any MCE which happens before #4 is fine, we will pick that up in #4 and the following steps will do the right thing. Note that I am assuming that the mapping step in #3 is safe even for a broken page, so long as we don''t actually try and use the mapping (more on that later), is this true? If an MCE occurs after #4 then the page will be marked as dirty in the bitmap and Xen will internally mark it as broken, but we won''t see either of those with the current algorithm. There are two cases to think about here AFAICT, A. The page was not already dirty at #2. In this case we know that the guest hasn''t dirtied the page since the previous iteration and therefore the target has a good copy of this page from that time. The page isn''t even in the batch we are processing So we don''t particularly care about the MCE here and can, from the PoV of migrating this guest, ignore it. B. The page was already dirty (but not broken, we handled that case above in "Any MCE which happens before #4...") at #2 which means we have do not have an up to date copy on the target. This has two subcases: I. The MCE occurs before (or during) #6 (sending the page) and therefore we do not have a good up to date copy of that data at either end. II. The MCE occurs after #6, in which case we already have a good copy at the target end. To fix B you have added an 8th step to the above: 8. Query the types of the pages again, using xc_get_pfn_type_batch, and if there are more pages dirty now than we say at #4 (actually #5 when we scanned the array, but that distinction doesn''t matter) then a new MCE must have occurred. Go back to #2 and try again. This won''t do anything for A since the page wasn''t in the batch to start with and so neither #4 or #8 will look at its type, this is good and proper. So now we consider the two subcases of B. Lets consider B.II first since it seems to be the more obvious case. In case B.II the target end already has a good copy of the data page, there is no need to mark the page as broken on the far end, nor to arrange for a vMCE to be injected. I don''t know if/how we arrange for vMCEs to be injected under these circumstances, however even if a vMCE does get injected into the guest when it eventually gets unpaused on the target then all that will happen is that it will needlessly throw away a good page. However this is a rare corner case which is not worth concerning ourselves with (it''s largely indistinguishable from case A). If the MCE had happened even a single cycle earlier then this would have been a B.I event instead of a B.II one. In any case there is no need to return to #2 and try again, everything will be just fine if we complete the migration at this point. In case B.I the MCE occurs before (or while) we send the page onto the wire. We will therefore try to read from this page because we haven''t looked at the type since #4 and have no idea that it is now broken. Reading from the broken page will cause a fault, perhaps causing a vMCE to be delivered to dom0, which causes the kernel to kill the process doing the migration. Or maybe it kills dom0 or the host entirely. Either way the idea of looping again is rather moot. Have I missed a case which needs thinking about? I suspect B.I is the case where you are most likely to find a flaw in my argument. Is there something else which is done in this case which would allow us to continue? Ian.
George Dunlap
2012-Nov-29 10:50 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
On 29/11/12 10:02, Ian Campbell wrote:> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >> Ping? > Sorry I''ve been meaning to reply but didn''t manage to yet. Also you > replied to V4 saying to ignore it, so I was half waiting for V5 but I > see this should actually be labelled V5 anyway. > > I''m afraid I still don''t fully grok the reason for the loop that goes > with: > + /* > + * At the last iter, count the number of broken pages after sending, > + * and if there are more than before sending, do one or more iter > + * to make sure the pages are marked broken on the receiving side. > + */ > > Can we go through it one more time? Sorry. > > Let me outline the sequence of events and you can point out where I''m > going wrong. I''m afraid this has turned out to be rather long, again I''m > sorry for that. > > First we do some number of iterations with the guest live. If an MCE > occurs during this phase then the page will be marked dirty and we will > pick this up on the next iteration and resend the page with the dirty > type etc and all is fine. This all looks good to me, so we don''t need to > worry about anything at this stage. > > Eventually we get to the last iteration, at which point we pause the > guest. From here on in the guest itself is not going to cause an MCE > (e.g. by touching its RAM) because it is not running but we must still > account for the possibility of an asynchronous MCE of some sort e.g. > triggered by the error detection mechanisms in the hardware, cosmic rays > and such like. > > The final iteration proceeds roughly as follows. > > 1. The domain is paused > 2. We scan the dirty bitmap and add dirty pages to the batch of > pages to process (there may be several batches in the last > iteration, we only need to concern ourselves with any one batch > here). > 3. We map all of the pages in the resulting batch with > xc_map_foreign_bulk > 4. We query the types of all the pages in the batch with > xc_get_pfn_type_batch > 5. We iterate over the batch, checking the type of each page, in > some cases we do some incidental processing. > 6. We send the types of the pages in the batch over the wire. > 7. We iterate over the batch again, and send any normal page (not > broken, xtab etc) over the wire. Actually we do this as runs of > normal pages, but the key point is we avoid touching any special > page (including ones marked as broken by #4) > > Is this sequence of events accurate? > > Now lets consider the consequences of an MCE occurring at various stages > here. > > Any MCE which happens before #4 is fine, we will pick that up in #4 and > the following steps will do the right thing. > > Note that I am assuming that the mapping step in #3 is safe even for a > broken page, so long as we don''t actually try and use the mapping (more > on that later), is this true? > > If an MCE occurs after #4 then the page will be marked as dirty in the > bitmap and Xen will internally mark it as broken, but we won''t see > either of those with the current algorithm. There are two cases to think > about here AFAICT, > A. The page was not already dirty at #2. In this case we know that > the guest hasn''t dirtied the page since the previous iteration > and therefore the target has a good copy of this page from that > time. The page isn''t even in the batch we are processing So we > don''t particularly care about the MCE here and can, from the PoV > of migrating this guest, ignore it. > B. The page was already dirty (but not broken, we handled that case > above in "Any MCE which happens before #4...") at #2 which means > we have do not have an up to date copy on the target. This has > two subcases: > I. The MCE occurs before (or during) #6 (sending the page) > and therefore we do not have a good up to date copy of > that data at either end. > II. The MCE occurs after #6, in which case we already have a > good copy at the target end. > > To fix B you have added an 8th step to the above: > > 8. Query the types of the pages again, using > xc_get_pfn_type_batch, and if there are more pages dirty now > than we say at #4 (actually #5 when we scanned the array, but > that distinction doesn''t matter) then a new MCE must have > occurred. Go back to #2 and try again. > > This won''t do anything for A since the page wasn''t in the batch to start > with and so neither #4 or #8 will look at its type, this is good and > proper. > > So now we consider the two subcases of B. Lets consider B.II first since > it seems to be the more obvious case. > > In case B.II the target end already has a good copy of the data page, > there is no need to mark the page as broken on the far end, nor to > arrange for a vMCE to be injected. I don''t know if/how we arrange for > vMCEs to be injected under these circumstances, however even if a vMCE > does get injected into the guest when it eventually gets unpaused on the > target then all that will happen is that it will needlessly throw away a > good page. However this is a rare corner case which is not worth > concerning ourselves with (it''s largely indistinguishable from case A). > If the MCE had happened even a single cycle earlier then this would have > been a B.I event instead of a B.II one. In any case there is no need to > return to #2 and try again, everything will be just fine if we complete > the migration at this point. > > In case B.I the MCE occurs before (or while) we send the page onto the > wire. We will therefore try to read from this page because we haven''t > looked at the type since #4 and have no idea that it is now broken. > Reading from the broken page will cause a fault, perhaps causing a vMCE > to be delivered to dom0, which causes the kernel to kill the process > doing the migration. Or maybe it kills dom0 or the host entirely. Either > way the idea of looping again is rather moot. > > Have I missed a case which needs thinking about? > > I suspect B.I is the case where you are most likely to find a flaw in my > argument. Is there something else which is done in this case which would > allow us to continue?I think your analysis is correct -- the only question is whether B.I will 100% cause the migration to crash, or whether there''s a chance of not crashing on the read. I had tried to ask that question before, and understood Jinsong''s response to be that it''s not 100% sure that the read would cause an error. However, looking back at the thread, I think I may have understood something that was not there. :-) So I guess the question for Jinsong is this: The only time this extra loop could help is if there is a page broken after being paused but before being sent the last time (B.I in Ian''s analysis) -- in which case, the migration code is 100% guaranteed to read a now-broken page. What are the chances that this read of a broken page will *not* cause a fault which will kill at least the migration process, if not dom0? If the chances are "slim-to-none", then there is no point for the extra check. -George
Liu, Jinsong
2012-Nov-30 18:51 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Ian Campbell wrote:> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >> Ping? > > Sorry I''ve been meaning to reply but didn''t manage to yet. Also you > replied to V4 saying to ignore it, so I was half waiting for V5 but I > see this should actually be labelled V5 anyway. > > I''m afraid I still don''t fully grok the reason for the loop that goes > with: > + /* > + * At the last iter, count the number of broken pages > after sending, + * and if there are more than before > sending, do one or more iter + * to make sure the pages > are marked broken on the receiving side. + */ > > Can we go through it one more time? Sorry.Sure, and I''m very much appreciated your kindly/patient review. Your comments help/force me to re-think it more detailed.> > Let me outline the sequence of events and you can point out where I''m > going wrong. I''m afraid this has turned out to be rather long, again > I''m > sorry for that. > > First we do some number of iterations with the guest live. If an MCE > occurs during this phase then the page will be marked dirty and we > will > pick this up on the next iteration and resend the page with the dirty > type etc and all is fine. This all looks good to me, so we don''t need > to > worry about anything at this stage.Hmm, seems not so safe. If the page is good and it''s in dirty bitmap (get from step #2), while at the iteration it broken after #4, same issue occur as your analysis of last iteration --> migration process will read broken page (except this case I agree it''s OK) --> so let''s merge analysis of ''migration process read broken page'' with last iteration B.I case.> > Eventually we get to the last iteration, at which point we pause the > guest. From here on in the guest itself is not going to cause an MCE > (e.g. by touching its RAM) because it is not running but we must still > account for the possibility of an asynchronous MCE of some sort e.g. > triggered by the error detection mechanisms in the hardware, cosmic > rays > and such like. > > The final iteration proceeds roughly as follows. > > 1. The domain is paused > 2. We scan the dirty bitmap and add dirty pages to the batch of > pages to process (there may be several batches in the last > iteration, we only need to concern ourselves with any one > batch here). > 3. We map all of the pages in the resulting batch with > xc_map_foreign_bulk > 4. We query the types of all the pages in the batch with > xc_get_pfn_type_batch > 5. We iterate over the batch, checking the type of each page, in > some cases we do some incidental processing. > 6. We send the types of the pages in the batch over the wire. > 7. We iterate over the batch again, and send any normal page (not > broken, xtab etc) over the wire. Actually we do this as runs > of normal pages, but the key point is we avoid touching any > special page (including ones marked as broken by #4) > > Is this sequence of events accurate?Yes, exactly.> > Now lets consider the consequences of an MCE occurring at various > stages > here. > > Any MCE which happens before #4 is fine, we will pick that up in #4 > and > the following steps will do the right thing. > > Note that I am assuming that the mapping step in #3 is safe even for a > broken page, so long as we don''t actually try and use the mapping > (more > on that later), is this true?I think so. #3 mmu mapping is safe itself.> > If an MCE occurs after #4 then the page will be marked as dirty in the > bitmap and Xen will internally mark it as broken, but we won''t see > either of those with the current algorithm. There are two cases to > think > about here AFAICT, > A. The page was not already dirty at #2. In this case we know > that the guest hasn''t dirtied the page since the previous > iteration and therefore the target has a good copy of this > page from that time. The page isn''t even in the batch we are > processing So we don''t particularly care about the MCE here > and can, from the PoV of migrating this guest, ignore it. > B. The page was already dirty (but not broken, we handled that > case above in "Any MCE which happens before #4...") at #2 > which means we have do not have an up to date copy on the > target. This has two subcases: > I. The MCE occurs before (or during) #6 (sending the > page) and therefore we do not have a good up to date > copy of that data at either end. > II. The MCE occurs after #6, in which case we already > have a good copy at the target end. > > To fix B you have added an 8th step to the above: > > 8. Query the types of the pages again, using > xc_get_pfn_type_batch, and if there are more pages dirty now > than we say at #4 (actually #5 when we scanned the array, but > that distinction doesn''t matter) then a new MCE must have > occurred. Go back to #2 and try again. > > This won''t do anything for A since the page wasn''t in the batch to > start > with and so neither #4 or #8 will look at its type, this is good and > proper. > > So now we consider the two subcases of B. Lets consider B.II first > since > it seems to be the more obvious case. > > In case B.II the target end already has a good copy of the data page, > there is no need to mark the page as broken on the far end, nor to > arrange for a vMCE to be injected. I don''t know if/how we arrange for > vMCEs to be injected under these circumstances, however even if a vMCE > does get injected into the guest when it eventually gets unpaused on > the > target then all that will happen is that it will needlessly throw > away a > good page. However this is a rare corner case which is not worth > concerning ourselves with (it''s largely indistinguishable from case > A). > If the MCE had happened even a single cycle earlier then this would > have > been a B.I event instead of a B.II one. In any case there is no need > to > return to #2 and try again, everything will be just fine if we > complete > the migration at this point. > > In case B.I the MCE occurs before (or while) we send the page onto the > wire. We will therefore try to read from this page because we haven''t > looked at the type since #4 and have no idea that it is now broken.Right.> Reading from the broken page will cause a fault, perhaps causing a > vMCE > to be delivered to dom0, which causes the kernel to kill the process > doing the migration. Or maybe it kills dom0 or the host entirely.Not exactly I think. Reading the broken page will trigger a serious SRAR error. Under such case hypervisor will inject a vMCE to the guest which was migrating, not dom0. The reason of this injection is, guest is best one to handle it, w/ sufficient clue/status/information (other component like hypervisor/dom0 are not proper). For xl migration process, after return from MCE context, it *again* read the broken page ... this will kill system entirely --> so we definitely not care migration any more.> Either > way the idea of looping again is rather moot.Agree the conclusion, though a little different opinion about its reason. So let''s remove step #8? Thanks, Jinsong> > Have I missed a case which needs thinking about? > > I suspect B.I is the case where you are most likely to find a flaw in > my argument. Is there something else which is done in this case which > would > allow us to continue? > > Ian.
Liu, Jinsong
2012-Nov-30 18:57 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
George Dunlap wrote:> On 29/11/12 10:02, Ian Campbell wrote: >> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >>> Ping? >> Sorry I''ve been meaning to reply but didn''t manage to yet. Also you >> replied to V4 saying to ignore it, so I was half waiting for V5 but I >> see this should actually be labelled V5 anyway. >> >> I''m afraid I still don''t fully grok the reason for the loop that goes >> with: >> + /* >> + * At the last iter, count the number of broken pages >> after sending, + * and if there are more than before >> sending, do one or more iter + * to make sure the pages >> are marked broken on the receiving side. + */ >> >> Can we go through it one more time? Sorry. >> >> Let me outline the sequence of events and you can point out where I''m >> going wrong. I''m afraid this has turned out to be rather long, again >> I''m >> sorry for that. >> >> First we do some number of iterations with the guest live. If an MCE >> occurs during this phase then the page will be marked dirty and we >> will >> pick this up on the next iteration and resend the page with the dirty >> type etc and all is fine. This all looks good to me, so we don''t >> need to >> worry about anything at this stage. >> >> Eventually we get to the last iteration, at which point we pause the >> guest. From here on in the guest itself is not going to cause an MCE >> (e.g. by touching its RAM) because it is not running but we must >> still >> account for the possibility of an asynchronous MCE of some sort e.g. >> triggered by the error detection mechanisms in the hardware, cosmic >> rays >> and such like. >> >> The final iteration proceeds roughly as follows. >> >> 1. The domain is paused >> 2. We scan the dirty bitmap and add dirty pages to the batch of >> pages to process (there may be several batches in the last >> iteration, we only need to concern ourselves with any one >> batch here). >> 3. We map all of the pages in the resulting batch with >> xc_map_foreign_bulk >> 4. We query the types of all the pages in the batch with >> xc_get_pfn_type_batch >> 5. We iterate over the batch, checking the type of each page, >> in some cases we do some incidental processing. >> 6. We send the types of the pages in the batch over the wire. >> 7. We iterate over the batch again, and send any normal page >> (not broken, xtab etc) over the wire. Actually we do this >> as runs of normal pages, but the key point is we avoid >> touching any special page (including ones marked as broken >> by #4) >> >> Is this sequence of events accurate? >> >> Now lets consider the consequences of an MCE occurring at various >> stages >> here. >> >> Any MCE which happens before #4 is fine, we will pick that up in #4 >> and >> the following steps will do the right thing. >> >> Note that I am assuming that the mapping step in #3 is safe even for >> a >> broken page, so long as we don''t actually try and use the mapping >> (more >> on that later), is this true? >> >> If an MCE occurs after #4 then the page will be marked as dirty in >> the >> bitmap and Xen will internally mark it as broken, but we won''t see >> either of those with the current algorithm. There are two cases to >> think >> about here AFAICT, >> A. The page was not already dirty at #2. In this case we know >> that the guest hasn''t dirtied the page since the previous >> iteration and therefore the target has a good copy of this >> page from that time. The page isn''t even in the batch we >> are processing So we don''t particularly care about the MCE >> here and can, from the PoV of migrating this guest, ignore >> it. B. The page was already dirty (but not broken, we handled >> that case above in "Any MCE which happens before #4...") at >> #2 which means we have do not have an up to date copy on >> the target. This has two subcases: I. The MCE >> occurs before (or during) #6 (sending the page) and >> therefore we do not have a good up to date copy of >> that data at either end. II. The MCE occurs after #6, >> in which case we already have a good copy at the >> target end. >> >> To fix B you have added an 8th step to the above: >> >> 8. Query the types of the pages again, using >> xc_get_pfn_type_batch, and if there are more pages dirty now >> than we say at #4 (actually #5 when we scanned the array, >> but that distinction doesn''t matter) then a new MCE must >> have occurred. Go back to #2 and try again. >> >> This won''t do anything for A since the page wasn''t in the batch to >> start >> with and so neither #4 or #8 will look at its type, this is good and >> proper. >> >> So now we consider the two subcases of B. Lets consider B.II first >> since >> it seems to be the more obvious case. >> >> In case B.II the target end already has a good copy of the data page, >> there is no need to mark the page as broken on the far end, nor to >> arrange for a vMCE to be injected. I don''t know if/how we arrange for >> vMCEs to be injected under these circumstances, however even if a >> vMCE >> does get injected into the guest when it eventually gets unpaused on >> the >> target then all that will happen is that it will needlessly throw >> away a >> good page. However this is a rare corner case which is not worth >> concerning ourselves with (it''s largely indistinguishable from case >> A). >> If the MCE had happened even a single cycle earlier then this would >> have >> been a B.I event instead of a B.II one. In any case there is no need >> to >> return to #2 and try again, everything will be just fine if we >> complete >> the migration at this point. >> >> In case B.I the MCE occurs before (or while) we send the page onto >> the >> wire. We will therefore try to read from this page because we haven''t >> looked at the type since #4 and have no idea that it is now broken. >> Reading from the broken page will cause a fault, perhaps causing a >> vMCE >> to be delivered to dom0, which causes the kernel to kill the process >> doing the migration. Or maybe it kills dom0 or the host entirely. >> Either >> way the idea of looping again is rather moot. >> >> Have I missed a case which needs thinking about? >> >> I suspect B.I is the case where you are most likely to find a flaw >> in my argument. Is there something else which is done in this case >> which would >> allow us to continue? > > I think your analysis is correct -- the only question is whether B.I > will 100% cause the migration to crash, or whether there''s a chance of > not crashing on the read. I had tried to ask that question before, > and understood Jinsong''s response to be that it''s not 100% sure that > the read would cause an error. However, looking back at the thread, > I think I may have understood something that was not there. :-) > > So I guess the question for Jinsong is this: > > The only time this extra loop could help is if there is a page broken > after being paused but before being sent the last time (B.I in Ian''s > analysis) -- in which case, the migration code is 100% guaranteed to > read a now-broken page. What are the chances that this read of a > broken page will *not* cause a fault which will kill at least the > migration process, if not dom0? If the chances are "slim-to-none", > then there is no point for the extra check. > > -GeorgeYes, the extra check is pointless, xl migration reading broken page will eventually kill system. See my reply IanC''s email :-) Thanks, Jinsong
George Dunlap
2012-Dec-03 11:24 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
On 30/11/12 18:51, Liu, Jinsong wrote:> > Not exactly I think. Reading the broken page will trigger a serious SRAR error. Under such case hypervisor will inject a vMCE to the guest which was migrating, not dom0. The reason of this injection is, guest is best one to handle it, w/ sufficient clue/status/information (other component like hypervisor/dom0 are not proper). For xl migration process, after return from MCE context, it *again* read the broken page ... this will kill system entirely --> so we definitely not care migration any more.Why would the vMCE be sent to the guest, rather than the vcpu that was running when the SRAR error occured (i.e., the vcpu on which the migration process was running)? -George
Liu, Jinsong
2012-Dec-05 15:57 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Liu, Jinsong wrote:> Ian Campbell wrote: >> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >>> Ping? >> >> Sorry I''ve been meaning to reply but didn''t manage to yet. Also you >> replied to V4 saying to ignore it, so I was half waiting for V5 but I >> see this should actually be labelled V5 anyway. >> >> I''m afraid I still don''t fully grok the reason for the loop that >> goes with: + /* >> + * At the last iter, count the number of broken pages >> after sending, + * and if there are more than before >> sending, do one or more iter + * to make sure the pages >> are marked broken on the receiving side. + */ >> >> Can we go through it one more time? Sorry. > > Sure, and I''m very much appreciated your kindly/patient review. Your > comments help/force me to re-think it more detailed. > >> >> Let me outline the sequence of events and you can point out where I''m >> going wrong. I''m afraid this has turned out to be rather long, again >> I''m sorry for that. >> >> First we do some number of iterations with the guest live. If an MCE >> occurs during this phase then the page will be marked dirty and we >> will pick this up on the next iteration and resend the page with the >> dirty type etc and all is fine. This all looks good to me, so we >> don''t need to worry about anything at this stage. > > Hmm, seems not so safe. If the page is good and it''s in dirty bitmap > (get from step #2), while at the iteration it broken after #4, same > issue occur as your analysis of last iteration --> migration process > will read broken page (except this case I agree it''s OK) --> so let''s > merge analysis of ''migration process read broken page'' with last > iteration B.I case. > >> >> Eventually we get to the last iteration, at which point we pause the >> guest. From here on in the guest itself is not going to cause an MCE >> (e.g. by touching its RAM) because it is not running but we must >> still account for the possibility of an asynchronous MCE of some >> sort e.g. triggered by the error detection mechanisms in the >> hardware, cosmic rays and such like. >> >> The final iteration proceeds roughly as follows. >> >> 1. The domain is paused >> 2. We scan the dirty bitmap and add dirty pages to the batch of >> pages to process (there may be several batches in the last >> iteration, we only need to concern ourselves with any one >> batch here). >> 3. We map all of the pages in the resulting batch with >> xc_map_foreign_bulk >> 4. We query the types of all the pages in the batch with >> xc_get_pfn_type_batch >> 5. We iterate over the batch, checking the type of each page, in >> some cases we do some incidental processing. >> 6. We send the types of the pages in the batch over the wire. >> 7. We iterate over the batch again, and send any normal page >> (not broken, xtab etc) over the wire. Actually we do this as >> runs of normal pages, but the key point is we avoid touching >> any special page (including ones marked as broken by #4) >> >> Is this sequence of events accurate? > > Yes, exactly. > >> >> Now lets consider the consequences of an MCE occurring at various >> stages here. >> >> Any MCE which happens before #4 is fine, we will pick that up in #4 >> and the following steps will do the right thing. >> >> Note that I am assuming that the mapping step in #3 is safe even for >> a broken page, so long as we don''t actually try and use the mapping >> (more on that later), is this true? > > I think so. #3 mmu mapping is safe itself. > >> >> If an MCE occurs after #4 then the page will be marked as dirty in >> the bitmap and Xen will internally mark it as broken, but we won''t >> see either of those with the current algorithm. There are two cases >> to think about here AFAICT, >> A. The page was not already dirty at #2. In this case we know >> that the guest hasn''t dirtied the page since the previous >> iteration and therefore the target has a good copy of this >> page from that time. The page isn''t even in the batch we are >> processing So we don''t particularly care about the MCE here >> and can, from the PoV of migrating this guest, ignore it. >> B. The page was already dirty (but not broken, we handled that >> case above in "Any MCE which happens before #4...") at #2 >> which means we have do not have an up to date copy on the >> target. This has two subcases: >> I. The MCE occurs before (or during) #6 (sending the >> page) and therefore we do not have a good up to date >> copy of that data at either end. >> II. The MCE occurs after #6, in which case we already >> have a good copy at the target end. >> >> To fix B you have added an 8th step to the above: >> >> 8. Query the types of the pages again, using >> xc_get_pfn_type_batch, and if there are more pages dirty now >> than we say at #4 (actually #5 when we scanned the array, but >> that distinction doesn''t matter) then a new MCE must have >> occurred. Go back to #2 and try again. >> >> This won''t do anything for A since the page wasn''t in the batch to >> start with and so neither #4 or #8 will look at its type, this is >> good and proper. >> >> So now we consider the two subcases of B. Lets consider B.II first >> since it seems to be the more obvious case. >> >> In case B.II the target end already has a good copy of the data page, >> there is no need to mark the page as broken on the far end, nor to >> arrange for a vMCE to be injected. I don''t know if/how we arrange for >> vMCEs to be injected under these circumstances, however even if a >> vMCE does get injected into the guest when it eventually gets >> unpaused on the target then all that will happen is that it will >> needlessly throw away a good page. However this is a rare corner >> case which is not worth concerning ourselves with (it''s largely >> indistinguishable from case A). If the MCE had happened even a >> single cycle earlier then this would have been a B.I event instead >> of a B.II one. In any case there is no need to return to #2 and try >> again, everything will be just fine if we complete the migration at >> this point. >> >> In case B.I the MCE occurs before (or while) we send the page onto >> the wire. We will therefore try to read from this page because we >> haven''t looked at the type since #4 and have no idea that it is now >> broken. > > Right. > >> Reading from the broken page will cause a fault, perhaps causing a >> vMCE to be delivered to dom0, which causes the kernel to kill the >> process doing the migration. Or maybe it kills dom0 or the host >> entirely. > > Not exactly I think. Reading the broken page will trigger a serious > SRAR error. Under such case hypervisor will inject a vMCE to the > guest which was migrating, not dom0. The reason of this injection is, > guest is best one to handle it, w/ sufficient clue/status/information > (other component like hypervisor/dom0 are not proper). For xl > migration process, after return from MCE context, it *again* read the > broken page ... this will kill system entirely --> so we definitely > not care migration any more. > >> Either >> way the idea of looping again is rather moot. > > Agree the conclusion, though a little different opinion about its > reason. So let''s remove step #8? > > Thanks, > JinsongMore simply, we can remove not only step #8 for last iteration check, but also the action to ''mark broken page to dirty bitmap'': In any iteration, there is a key point #4, if vmce occur before #4 it will transfer proper pfn_type/pfn_number and (xl migration) will not access broken page (if guest access the broken page again it will be killed by hypervisor), if vmce occur after #4 system will crash and no need care migration any more. So we can go back to the original patch which used to handle ''vmce occur before migration'' and entirely don''t need add specific code to handle ''vmce occur during migration'', since it in fact has handled both cases (and simple). Thoughts? Thanks, Jinsong
George Dunlap
2012-Dec-05 15:59 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
On 05/12/12 15:57, Liu, Jinsong wrote:> More simply, we can remove not only step #8 for last iteration check, but also the action to ''mark broken page to dirty bitmap'': > In any iteration, there is a key point #4, if vmce occur before #4 it will transfer proper pfn_type/pfn_number and (xl migration) will not access broken page (if guest access the broken page again it will be killed by hypervisor), if vmce occur after #4 system will crash and no need care migration any more. > > So we can go back to the original patch which used to handle ''vmce occur before migration'' and entirely don''t need add specific code to handle ''vmce occur during migration'', since it in fact has handled both cases (and simple). Thoughts?I thought part of the point of that was to have a consistent behavior from the presented virtual hardware -- i.e,. if the guest OS receives a vMCE, and subsequently touches that page, it should get an SRAR? Is that important or not? -George
Ian Campbell
2012-Dec-05 16:01 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
(please trim your quotes)> More simply, we can remove not only step #8 for last iteration check, > but also the action to ''mark broken page to dirty bitmap'': > In any iteration, there is a key point #4, if vmce occur before #4 it > will transfer proper pfn_type/pfn_number and (xl migration) will not > access broken page (if guest access the broken page again it will be > killed by hypervisor), if vmce occur after #4 system will crash and no > need care migration any more. > > So we can go back to the original patch which used to handle ''vmce > occur before migration'' and entirely don''t need add specific code to > handle ''vmce occur during migration'', since it in fact has handled > both cases (and simple). Thoughts?Do you not need that stuff to handle MCEs during the live phase? Specifically an MCE occurring on a page which is not in the current batch at all can be handled safely during the live phase. Ian.
Liu, Jinsong
2012-Dec-05 16:15 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
Ian Campbell wrote:> (please trim your quotes) > >> More simply, we can remove not only step #8 for last iteration check, >> but also the action to ''mark broken page to dirty bitmap'': >> In any iteration, there is a key point #4, if vmce occur before #4 it >> will transfer proper pfn_type/pfn_number and (xl migration) will not >> access broken page (if guest access the broken page again it will be >> killed by hypervisor), if vmce occur after #4 system will crash and >> no need care migration any more. >> >> So we can go back to the original patch which used to handle ''vmce >> occur before migration'' and entirely don''t need add specific code to >> handle ''vmce occur during migration'', since it in fact has handled >> both cases (and simple). Thoughts? > > Do you not need that stuff to handle MCEs during the live phase? > > Specifically an MCE occurring on a page which is not in the current > batch at all can be handled safely during the live phase. > > Ian.No, we don''t need care it. Whether a page (mce occurred at) is in current batch or not is not important: 1. if xl will not transfer it in the future, that''s fine; 2. if xl will transfer it in the future, but it was not in current batch, that''s also fine --> it will be handled at its batch; (guest will be killed if it access the broken page again) Thanks, Jinsong
Liu, Jinsong
2012-Dec-05 16:36 UTC
Re: [PATCH V4] X86/vMCE: handle broken page with regard to migration
George Dunlap wrote:> On 05/12/12 15:57, Liu, Jinsong wrote: >> More simply, we can remove not only step #8 for last iteration >> check, but also the action to ''mark broken page to dirty bitmap'': >> In any iteration, there is a key point #4, if vmce occur before #4 >> it will transfer proper pfn_type/pfn_number and (xl migration) will >> not access broken page (if guest access the broken page again it >> will be killed by hypervisor), if vmce occur after #4 system will >> crash and no need care migration any more. >> >> So we can go back to the original patch which used to handle ''vmce >> occur before migration'' and entirely don''t need add specific code to >> handle ''vmce occur during migration'', since it in fact has handled >> both cases (and simple). Thoughts? > > I thought part of the point of that was to have a consistent behavior > from the presented virtual hardware -- i.e,. if the guest OS receives > a vMCE, and subsequently touches that page, it should get an SRAR? Is > that important or not? > > -GeorgeCurrently hypervisor will kill guest under such case (instead of inject SRAR to guest). The reason is, not all h/w platform support SRAR so if guest access broken page it hardwarely will trigger more serious error: in some old platform which not support SRAR it will crash whole system. So to prevent this bad situation hypervisor prefer kill the guest. (Under MCA architecture, there is no cpuid to distinguish if the platform support SRAR or not --> if MCi_status report an known type indicate a SRAR it means h/w support SRAR, otherwise h/w will report an unknown error --> hypervisor then kill system under this case) So from guest point of view, it will not feel strange: if it received a vMCE (SRAO) then it access the page again, it was killed thinking itself runs at a platform not support SRAR. Thanks, Jinsong