Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 0/6] xen/arm: live migration support in arndale board
Note: Resend the patch series since the last one was white-space corrupted. Hello! here goes the v5 patch series. The major changes in this version are the following: 1) Save and restore the vcpu-related registers in hvm context. (In patch 1) 2) Rather than using set_memory_map hypercall use GUEST_RAM_BASE macro as the start of physical address of guest RAM and just let the hypervisor know the maximum memory of guest (In patch 2 and 6) 3) Use bitmap for temporary saving the dirty pages rather than linked list. (In patch 5) Patch 1 implements hvm save/restore (including vcpu). Patch 2 implements 'get_maximum_gpfn' hypercall Patch 3 implements 'modify_returncode' for switching the return value of suspend hypercall from domU. Patch 4 implements base functions for VLPT. Patch 5 implements dirty-page tracing by using VLPT. Patch 6 implements the toolstack part for live migration of ARM. NOTE: In this version, I do not use the p2m walker infrastructure for p2m_change_entry_type_global. I think I can apply it after Stefano’s patch is commited. Best, Jaeyong Alexey Sokolov (2): xen/arm: Implement modify_returncode xen/arm: Implement toolstack for xl restore/save and migrate Evgeny Fedotov (2): xen/arm: Implement hvm save and restore xen/arm: Implement get_maximum_gpfn hypercall for arm Jaeyong Yoo (2): xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration xen/arm: Implement hypercall for dirty page tracing config/arm32.mk | 1 + tools/libxc/Makefile | 6 +- tools/libxc/xc_arm_migrate.c | 712 +++++++++++++++++++++++++++++++++ tools/libxc/xc_dom_arm.c | 4 +- tools/libxc/xc_resume.c | 25 ++ tools/misc/Makefile | 4 +- xen/arch/arm/Makefile | 1 + xen/arch/arm/domain.c | 19 + xen/arch/arm/domctl.c | 98 ++++- xen/arch/arm/hvm.c | 464 ++++++++++++++++++++- xen/arch/arm/mm.c | 237 ++++++++++- xen/arch/arm/p2m.c | 206 ++++++++++ xen/arch/arm/save.c | 66 +++ xen/arch/arm/traps.c | 9 + xen/common/Makefile | 2 + xen/include/asm-arm/arm32/page.h | 41 +- xen/include/asm-arm/config.h | 5 + xen/include/asm-arm/domain.h | 14 + xen/include/asm-arm/hvm/support.h | 29 ++ xen/include/asm-arm/mm.h | 24 ++ xen/include/asm-arm/p2m.h | 4 + xen/include/asm-arm/processor.h | 2 + xen/include/public/arch-arm/hvm/save.h | 130 ++++++ 23 files changed, 2084 insertions(+), 19 deletions(-) create mode 100644 tools/libxc/xc_arm_migrate.c create mode 100644 xen/arch/arm/save.c create mode 100644 xen/include/asm-arm/hvm/support.h -- 1.8.1.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
From: Evgeny Fedotov <e.fedotov@samsung.com> Implement save/restore of hvm context hypercall. In hvm context save/restore, we save gic, timer and vfp registers. Changes from v4: Save vcpu registers within hvm context, and purge the save-vcpu-register patch. Singed-off-by: Evgeny Fedotov <e.fedotov@samsung.com> --- xen/arch/arm/Makefile | 1 + xen/arch/arm/domctl.c | 89 ++++++- xen/arch/arm/hvm.c | 464 ++++++++++++++++++++++++++++++++- xen/arch/arm/save.c | 66 +++++ xen/common/Makefile | 2 + xen/include/asm-arm/hvm/support.h | 29 +++ xen/include/public/arch-arm/hvm/save.h | 130 +++++++++ 7 files changed, 779 insertions(+), 2 deletions(-) create mode 100644 xen/arch/arm/save.c create mode 100644 xen/include/asm-arm/hvm/support.h diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile index 003ac84..8910a6c 100644 --- a/xen/arch/arm/Makefile +++ b/xen/arch/arm/Makefile @@ -31,6 +31,7 @@ obj-y += vuart.o obj-y += hvm.o obj-y += device.o obj-y += decode.o +obj-y += save.o #obj-bin-y += ....o diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index 851ee40..cb38e59 100644 --- a/xen/arch/arm/domctl.c +++ b/xen/arch/arm/domctl.c @@ -9,12 +9,99 @@ #include <xen/lib.h> #include <xen/errno.h> #include <xen/sched.h> +#include <xen/hvm/save.h> +#include <xen/guest_access.h> #include <public/domctl.h> long arch_do_domctl(struct xen_domctl *domctl, struct domain *d, XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { - return -ENOSYS; + long ret = 0; + bool_t copyback = 0; + + switch ( domctl->cmd ) + { + case XEN_DOMCTL_sethvmcontext: + { + struct hvm_domain_context c = { .size = domctl->u.hvmcontext.size }; + + ret = -ENOMEM; + if ( (c.data = xmalloc_bytes(c.size)) == NULL ) + goto sethvmcontext_out; + + ret = -EFAULT; + if ( copy_from_guest(c.data, domctl->u.hvmcontext.buffer, c.size) != 0 ) + goto sethvmcontext_out; + + domain_pause(d); + ret = hvm_load(d, &c); + domain_unpause(d); + + sethvmcontext_out: + if ( c.data != NULL ) + xfree(c.data); + } + break; + case XEN_DOMCTL_gethvmcontext: + { + struct hvm_domain_context c = { 0 }; + + ret = -EINVAL; + + c.size = hvm_save_size(d); + + if ( guest_handle_is_null(domctl->u.hvmcontext.buffer) ) + { + /* Client is querying for the correct buffer size */ + domctl->u.hvmcontext.size = c.size; + ret = 0; + goto gethvmcontext_out; + } + + /* Check that the client has a big enough buffer */ + ret = -ENOSPC; + if ( domctl->u.hvmcontext.size < c.size ) + { + printk("(gethvmcontext) size error: %d and %d\n", + domctl->u.hvmcontext.size, c.size ); + goto gethvmcontext_out; + } + + /* Allocate our own marshalling buffer */ + ret = -ENOMEM; + if ( (c.data = xmalloc_bytes(c.size)) == NULL ) + { + printk("(gethvmcontext) xmalloc_bytes failed: %d\n", c.size ); + goto gethvmcontext_out; + } + + domain_pause(d); + ret = hvm_save(d, &c); + domain_unpause(d); + + domctl->u.hvmcontext.size = c.cur; + if ( copy_to_guest(domctl->u.hvmcontext.buffer, c.data, c.size) != 0 ) + { + printk("(gethvmcontext) copy to guest failed\n"); + ret = -EFAULT; + } + + gethvmcontext_out: + copyback = 1; + + if ( c.data != NULL ) + xfree(c.data); + } + break; + + default: + return -EINVAL; + } + + if ( copyback && __copy_to_guest(u_domctl, domctl, 1) ) + ret = -EFAULT; + + return ret; } void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) diff --git a/xen/arch/arm/hvm.c b/xen/arch/arm/hvm.c index 471c4cd..01ce2e7 100644 --- a/xen/arch/arm/hvm.c +++ b/xen/arch/arm/hvm.c @@ -7,14 +7,15 @@ #include <xsm/xsm.h> +#include <xen/hvm/save.h> #include <public/xen.h> #include <public/hvm/params.h> #include <public/hvm/hvm_op.h> #include <asm/hypercall.h> +#include <asm/gic.h> long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) - { long rc = 0; @@ -65,3 +66,464 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) return rc; } + +static int vgic_irq_rank_save(struct vgic_rank *ext, + struct vgic_irq_rank *rank) +{ + spin_lock(&rank->lock); + /* Some of VGIC registers are not used yet, it is for a future usage */ + /* IENABLE, IACTIVE, IPEND, PENDSGI registers */ + ext->ienable = rank->ienable; + ext->iactive = rank->iactive; + ext->ipend = rank->ipend; + ext->pendsgi = rank->pendsgi; + /* ICFG */ + ext->icfg[0] = rank->icfg[0]; + ext->icfg[1] = rank->icfg[1]; + /* IPRIORITY */ + if ( sizeof(rank->ipriority) != sizeof (ext->ipriority) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check ipriority dumping space\n"); + return -EINVAL; + } + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); + /* ITARGETS */ + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); + return -EINVAL; + } + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); + spin_unlock(&rank->lock); + return 0; +} + +static int vgic_irq_rank_restore(struct vgic_irq_rank *rank, + struct vgic_rank *ext) +{ + spin_lock(&rank->lock); + /* IENABLE, IACTIVE, IPEND, PENDSGI registers */ + rank->ienable = ext->ienable; + rank->iactive = ext->iactive; + rank->ipend = ext->ipend; + rank->pendsgi = ext->pendsgi; + /* ICFG */ + rank->icfg[0] = ext->icfg[0]; + rank->icfg[1] = ext->icfg[1]; + /* IPRIORITY */ + if ( sizeof(rank->ipriority) != sizeof (ext->ipriority) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check ipriority dumping space\n"); + return -EINVAL; + } + memcpy(rank->ipriority, ext->ipriority, sizeof(rank->ipriority)); + /* ITARGETS */ + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); + return -EINVAL; + } + memcpy(rank->itargets, ext->itargets, sizeof(rank->itargets)); + spin_unlock(&rank->lock); + return 0; +} + + +static int gic_save(struct domain *d, hvm_domain_context_t *h) +{ + struct hvm_hw_gic ctxt; + struct vcpu *v; + + /* Save the state of GICs */ + for_each_vcpu( d, v ) + { + ctxt.gic_hcr = v->arch.gic_hcr; + ctxt.gic_vmcr = v->arch.gic_vmcr; + ctxt.gic_apr = v->arch.gic_apr; + + /* Save list registers and masks */ + /* (it is not necessary to save/restore them, but LR state can have + * influence on downtime after Live Migration (to be tested) + */ + if ( sizeof(v->arch.gic_lr) > sizeof (ctxt.gic_lr) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: increase LR dumping space\n"); + return -EINVAL; + } + memcpy(ctxt.gic_lr, v->arch.gic_lr, sizeof(v->arch.gic_lr)); + ctxt.lr_mask = v->arch.lr_mask; + ctxt.event_mask = v->arch.event_mask; + + /* Save PPI states (per-CPU) */ + /* It is necessary if SMP enabled */ + if ( vgic_irq_rank_save(&ctxt.ppi_state, &v->arch.vgic.private_irqs) ) + return 1; + + if ( hvm_save_entry(GIC, v->vcpu_id, h, &ctxt) != 0 ) + return 1; + } + return 0; +} + +static int gic_load(struct domain *d, hvm_domain_context_t *h) +{ + int vcpuid; + struct hvm_hw_gic ctxt; + struct vcpu *v; + + /* Which vcpu is this? */ + vcpuid = hvm_load_instance(h); + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) + { + dprintk(XENLOG_G_ERR, "HVM restore: dom%u has no vcpu%u\n", + d->domain_id, vcpuid); + return -EINVAL; + } + + if ( hvm_load_entry(GIC, h, &ctxt) != 0 ) + return -EINVAL; + + v->arch.gic_hcr = ctxt.gic_hcr; + v->arch.gic_vmcr = ctxt.gic_vmcr; + v->arch.gic_apr = ctxt.gic_apr; + + /* Restore list registers and masks */ + if ( sizeof(v->arch.gic_lr) > sizeof (ctxt.gic_lr) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_gic: increase LR dumping space\n"); + return -EINVAL; + } + memcpy(v->arch.gic_lr, ctxt.gic_lr, sizeof(v->arch.gic_lr)); + v->arch.lr_mask = ctxt.lr_mask; + v->arch.event_mask = ctxt.event_mask; + + /* Restore PPI states */ + if ( vgic_irq_rank_restore(&v->arch.vgic.private_irqs, &ctxt.ppi_state) ) + return 1; + + return 0; +} + +HVM_REGISTER_SAVE_RESTORE(GIC, gic_save, gic_load, 1, HVMSR_PER_VCPU); + +static int timer_save(struct domain *d, hvm_domain_context_t *h) +{ + struct hvm_hw_timer ctxt; + struct vcpu *v; + struct vtimer *t; + int i; + + /* Save the state of vtimer and ptimer */ + for_each_vcpu( d, v ) + { + t = &v->arch.virt_timer; + for ( i = 0; i < 2; i++ ) + { + ctxt.cval = t->cval; + ctxt.ctl = t->ctl; + ctxt.vtb_offset = i ? d->arch.phys_timer_base.offset : + d->arch.virt_timer_base.offset; + ctxt.type = i ? TIMER_TYPE_PHYS : TIMER_TYPE_VIRT; + if ( hvm_save_entry(A15_TIMER, v->vcpu_id, h, &ctxt) != 0 ) + return 1; + t = &v->arch.phys_timer; + } + } + + return 0; +} + +static int timer_load(struct domain *d, hvm_domain_context_t *h) +{ + int vcpuid; + struct hvm_hw_timer ctxt; + struct vcpu *v; + struct vtimer *t = NULL; + + /* Which vcpu is this? */ + vcpuid = hvm_load_instance(h); + + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) + { + dprintk(XENLOG_G_ERR, "HVM restore: dom%u has no vcpu%u\n", + d->domain_id, vcpuid); + return -EINVAL; + } + + if ( hvm_load_entry(A15_TIMER, h, &ctxt) != 0 ) + return -EINVAL; + + + if ( ctxt.type == TIMER_TYPE_VIRT ) + { + t = &v->arch.virt_timer; + d->arch.virt_timer_base.offset = ctxt.vtb_offset; + + } + else + { + t = &v->arch.phys_timer; + d->arch.phys_timer_base.offset = ctxt.vtb_offset; + } + + t->cval = ctxt.cval; + t->ctl = ctxt.ctl; + t->v = v; + + return 0; +} + +HVM_REGISTER_SAVE_RESTORE(A15_TIMER, timer_save, timer_load, 2, HVMSR_PER_VCPU); + +static int cpu_save(struct domain *d, hvm_domain_context_t *h) +{ + struct hvm_hw_cpu ctxt; + struct vcpu_guest_core_regs c; + struct vcpu *v; + + /* Save the state of CPU */ + for_each_vcpu( d, v ) + { + memset(&ctxt, 0, sizeof(ctxt)); + + ctxt.sctlr = v->arch.sctlr; + ctxt.ttbr0 = v->arch.ttbr0; + ctxt.ttbr1 = v->arch.ttbr1; + ctxt.ttbcr = v->arch.ttbcr; + + ctxt.dacr = v->arch.dacr; + ctxt.ifar = v->arch.ifar; + ctxt.ifsr = v->arch.ifsr; + ctxt.dfar = v->arch.dfar; + ctxt.dfsr = v->arch.dfsr; + +#ifdef CONFIG_ARM_32 + ctxt.mair0 = v->arch.mair0; + ctxt.mair1 = v->arch.mair1; +#else + ctxt.mair0 = v->arch.mair; +#endif + /* Control Registers */ + ctxt.actlr = v->arch.actlr; + ctxt.sctlr = v->arch.sctlr; + ctxt.cpacr = v->arch.cpacr; + + ctxt.contextidr = v->arch.contextidr; + ctxt.tpidr_el0 = v->arch.tpidr_el0; + ctxt.tpidr_el1 = v->arch.tpidr_el1; + ctxt.tpidrro_el0 = v->arch.tpidrro_el0; + + /* CP 15 */ + ctxt.csselr = v->arch.csselr; + + ctxt.afsr0 = v->arch.afsr0; + ctxt.afsr1 = v->arch.afsr1; + ctxt.vbar = v->arch.vbar; + ctxt.par = v->arch.par; + ctxt.teecr = v->arch.teecr; + ctxt.teehbr = v->arch.teehbr; + ctxt.joscr = v->arch.joscr; + ctxt.jmcr = v->arch.jmcr; + + memset(&c, 0, sizeof(c)); + + /* get guest core registers */ + vcpu_regs_hyp_to_user(v, &c); + + ctxt.x0 = c.x0; + ctxt.x1 = c.x1; + ctxt.x2 = c.x2; + ctxt.x3 = c.x3; + ctxt.x4 = c.x4; + ctxt.x5 = c.x5; + ctxt.x6 = c.x6; + ctxt.x7 = c.x7; + ctxt.x8 = c.x8; + ctxt.x9 = c.x9; + ctxt.x10 = c.x10; + ctxt.x11 = c.x11; + ctxt.x12 = c.x12; + ctxt.x13 = c.x13; + ctxt.x14 = c.x14; + ctxt.x15 = c.x15; + ctxt.x16 = c.x16; + ctxt.x17 = c.x17; + ctxt.x18 = c.x18; + ctxt.x19 = c.x19; + ctxt.x20 = c.x20; + ctxt.x21 = c.x21; + ctxt.x22 = c.x22; + ctxt.x23 = c.x23; + ctxt.x24 = c.x24; + ctxt.x25 = c.x25; + ctxt.x26 = c.x26; + ctxt.x27 = c.x27; + ctxt.x28 = c.x28; + ctxt.x29 = c.x29; + ctxt.x30 = c.x30; + ctxt.pc64 = c.pc64; + ctxt.cpsr = c.cpsr; + ctxt.spsr_el1 = c.spsr_el1; /* spsr_svc */ + + #ifdef CONFIG_ARM_32 + ctxt.spsr_fiq = c.spsr_fiq; + ctxt.spsr_irq = c.spsr_irq; + ctxt.spsr_und = c.spsr_und; + ctxt.spsr_abt = c.spsr_abt; + #endif + #ifdef CONFIG_ARM_64 + ctxt.sp_el0 = c.sp_el0; + ctxt.sp_el1 = c.sp_el1; + ctxt.elr_el1 = c.elr_el1; + #endif + + /* check VFP state size before dumping */ + if ( sizeof(v->arch.vfp) > sizeof (ctxt.vfp) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_cpu: increase VFP dumping space\n"); + return -EINVAL; + } + memcpy((void*) &ctxt.vfp, (void*) &v->arch.vfp, sizeof(v->arch.vfp)); + + ctxt.pause_flags = v->pause_flags; + + if ( hvm_save_entry(VCPU, v->vcpu_id, h, &ctxt) != 0 ) + return 1; + } + return 0; +} + +static int cpu_load(struct domain *d, hvm_domain_context_t *h) +{ + int vcpuid; + struct hvm_hw_cpu ctxt; + struct vcpu *v; + struct vcpu_guest_core_regs c; + + /* Which vcpu is this? */ + vcpuid = hvm_load_instance(h); + if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL ) + { + dprintk(XENLOG_G_ERR, "HVM restore: dom%u has no vcpu%u\n", + d->domain_id, vcpuid); + return -EINVAL; + } + + if ( hvm_load_entry(VCPU, h, &ctxt) != 0 ) + return -EINVAL; + + v->arch.sctlr = ctxt.sctlr; + v->arch.ttbr0 = ctxt.ttbr0; + v->arch.ttbr1 = ctxt.ttbr1; + v->arch.ttbcr = ctxt.ttbcr; + + v->arch.dacr = ctxt.dacr; + v->arch.ifar = ctxt.ifar; + v->arch.ifsr = ctxt.ifsr; + v->arch.dfar = ctxt.dfar; + v->arch.dfsr = ctxt.dfsr; + +#ifdef CONFIG_ARM_32 + v->arch.mair0 = ctxt.mair0; + v->arch.mair1 = ctxt.mair1; +#else + v->arch.mair = ctxt.mair0; +#endif + + /* Control Registers */ + v->arch.actlr = ctxt.actlr; + v->arch.cpacr = ctxt.cpacr; + v->arch.contextidr = ctxt.contextidr; + v->arch.tpidr_el0 = ctxt.tpidr_el0; + v->arch.tpidr_el1 = ctxt.tpidr_el1; + v->arch.tpidrro_el0 = ctxt.tpidrro_el0; + + /* CP 15 */ + v->arch.csselr = ctxt.csselr; + + v->arch.afsr0 = ctxt.afsr0; + v->arch.afsr1 = ctxt.afsr1; + v->arch.vbar = ctxt.vbar; + v->arch.par = ctxt.par; + v->arch.teecr = ctxt.teecr; + v->arch.teehbr = ctxt.teehbr; + v->arch.joscr = ctxt.joscr; + v->arch.jmcr = ctxt.jmcr; + + /* fill guest core registers */ + memset(&c, 0, sizeof(c)); + c.x0 = ctxt.x0; + c.x1 = ctxt.x1; + c.x2 = ctxt.x2; + c.x3 = ctxt.x3; + c.x4 = ctxt.x4; + c.x5 = ctxt.x5; + c.x6 = ctxt.x6; + c.x7 = ctxt.x7; + c.x8 = ctxt.x8; + c.x9 = ctxt.x9; + c.x10 = ctxt.x10; + c.x11 = ctxt.x11; + c.x12 = ctxt.x12; + c.x13 = ctxt.x13; + c.x14 = ctxt.x14; + c.x15 = ctxt.x15; + c.x16 = ctxt.x16; + c.x17 = ctxt.x17; + c.x18 = ctxt.x18; + c.x19 = ctxt.x19; + c.x20 = ctxt.x20; + c.x21 = ctxt.x21; + c.x22 = ctxt.x22; + c.x23 = ctxt.x23; + c.x24 = ctxt.x24; + c.x25 = ctxt.x25; + c.x26 = ctxt.x26; + c.x27 = ctxt.x27; + c.x28 = ctxt.x28; + c.x29 = ctxt.x29; + c.x30 = ctxt.x30; + c.pc64 = ctxt.pc64; + c.cpsr = ctxt.cpsr; + c.spsr_el1 = ctxt.spsr_el1; /* spsr_svc */ + + #ifdef CONFIG_ARM_32 + c.spsr_fiq = ctxt.spsr_fiq; + c.spsr_irq = ctxt.spsr_irq; + c.spsr_und = ctxt.spsr_und; + c.spsr_abt = ctxt.spsr_abt; + #endif + #ifdef CONFIG_ARM_64 + c.sp_el0 = ctxt.sp_el0; + c.sp_el1 = ctxt.sp_el1; + c.elr_el1 = ctxt.elr_el1; + #endif + + /* set guest core registers */ + vcpu_regs_user_to_hyp(v, &c); + + if ( sizeof(v->arch.vfp) > sizeof (ctxt.vfp) ) + { + dprintk(XENLOG_G_ERR, "hvm_hw_cpu: increase VFP dumping space\n"); + return -EINVAL; + } + + memcpy(&v->arch.vfp, &ctxt, sizeof(v->arch.vfp)); + + v->is_initialised = 1; + v->pause_flags = ctxt.pause_flags; + + return 0; +} + +HVM_REGISTER_SAVE_RESTORE(VCPU, cpu_save, cpu_load, 1, HVMSR_PER_VCPU); + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/arch/arm/save.c b/xen/arch/arm/save.c new file mode 100644 index 0000000..c923910 --- /dev/null +++ b/xen/arch/arm/save.c @@ -0,0 +1,66 @@ +/* + * hvm/save.c: Save and restore HVM guest''s emulated hardware state for ARM. + * + * Copyright (c) 2013, Samsung Electronics. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + */ + +#include <asm/hvm/support.h> +#include <public/hvm/save.h> + +void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr) +{ + hdr->cpuid = READ_SYSREG32(MIDR_EL1); +} + +int arch_hvm_load(struct domain *d, struct hvm_save_header *hdr) +{ + uint32_t cpuid; + + if ( hdr->magic != HVM_FILE_MAGIC ) + { + printk(XENLOG_G_ERR "HVM%d restore: bad magic number %#"PRIx32"\n", + d->domain_id, hdr->magic); + return -1; + } + + if ( hdr->version != HVM_FILE_VERSION ) + { + printk(XENLOG_G_ERR "HVM%d restore: unsupported version %u\n", + d->domain_id, hdr->version); + return -1; + } + + cpuid = READ_SYSREG32(MIDR_EL1); + if ( hdr->cpuid != cpuid ) + { + printk(XENLOG_G_INFO "HVM%d restore: VM saved on one CPU " + "(%#"PRIx32") and restored on another (%#"PRIx32").\n", + d->domain_id, hdr->cpuid, cpuid); + return -1; + } + + return 0; +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/common/Makefile b/xen/common/Makefile index 686f7a1..f943302 100644 --- a/xen/common/Makefile +++ b/xen/common/Makefile @@ -63,6 +63,8 @@ subdir-$(CONFIG_COMPAT) += compat subdir-$(x86_64) += hvm +subdir-$(CONFIG_ARM) += hvm + subdir-$(coverage) += gcov subdir-y += libelf diff --git a/xen/include/asm-arm/hvm/support.h b/xen/include/asm-arm/hvm/support.h new file mode 100644 index 0000000..8311f2f --- /dev/null +++ b/xen/include/asm-arm/hvm/support.h @@ -0,0 +1,29 @@ +/* + * support.h: HVM support routines used by ARMv7 VE. + * + * Copyright (c) 2012, Citrix Systems + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + */ + +#ifndef __ASM_ARM_HVM_SUPPORT_H__ +#define __ASM_ARM_HVM_SUPPORT_H__ + +#include <xen/types.h> +#include <public/hvm/ioreq.h> +#include <xen/sched.h> +#include <xen/hvm/save.h> +#include <asm/processor.h> + +#endif /* __ASM_ARM_HVM_SUPPORT_H__ */ diff --git a/xen/include/public/arch-arm/hvm/save.h b/xen/include/public/arch-arm/hvm/save.h index 75b8e65..1f71c41 100644 --- a/xen/include/public/arch-arm/hvm/save.h +++ b/xen/include/public/arch-arm/hvm/save.h @@ -26,6 +26,136 @@ #ifndef __XEN_PUBLIC_HVM_SAVE_ARM_H__ #define __XEN_PUBLIC_HVM_SAVE_ARM_H__ +#define HVM_FILE_MAGIC 0x92385520 +#define HVM_FILE_VERSION 0x00000001 + + +struct hvm_save_header +{ + uint32_t magic; /* Must be HVM_FILE_MAGIC */ + uint32_t version; /* File format version */ + uint64_t changeset; /* Version of Xen that saved this file */ + uint32_t cpuid; /* MIDR_EL1 on the saving machine */ +}; + +DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header); + +struct vgic_rank +{ + uint32_t ienable, iactive, ipend, pendsgi; + uint32_t icfg[2]; + uint32_t ipriority[8]; + uint32_t itargets[8]; +}; + +struct hvm_hw_gic +{ + uint32_t gic_hcr; + uint32_t gic_vmcr; + uint32_t gic_apr; + uint32_t gic_lr[64]; + uint64_t event_mask; + uint64_t lr_mask; + struct vgic_rank ppi_state; +}; + +DECLARE_HVM_SAVE_TYPE(GIC, 2, struct hvm_hw_gic); + +#define TIMER_TYPE_VIRT 0 +#define TIMER_TYPE_PHYS 1 + +struct hvm_hw_timer +{ + uint64_t vtb_offset; + uint32_t ctl; + uint64_t cval; + uint32_t type; +}; + +DECLARE_HVM_SAVE_TYPE(A15_TIMER, 3, struct hvm_hw_timer); + + +struct hvm_hw_cpu +{ + uint64_t vfp[34]; /* Vector floating pointer */ + /* VFP v3 state is 34x64 bit, VFP v4 is not yet supported */ + + /* Guest core registers */ + uint64_t x0; /* r0_usr */ + uint64_t x1; /* r1_usr */ + uint64_t x2; /* r2_usr */ + uint64_t x3; /* r3_usr */ + uint64_t x4; /* r4_usr */ + uint64_t x5; /* r5_usr */ + uint64_t x6; /* r6_usr */ + uint64_t x7; /* r7_usr */ + uint64_t x8; /* r8_usr */ + uint64_t x9; /* r9_usr */ + uint64_t x10; /* r10_usr */ + uint64_t x11; /* r11_usr */ + uint64_t x12; /* r12_usr */ + uint64_t x13; /* sp_usr */ + uint64_t x14; /* lr_usr; */ + uint64_t x15; /* __unused_sp_hyp */ + uint64_t x16; /* lr_irq */ + uint64_t x17; /* sp_irq */ + uint64_t x18; /* lr_svc */ + uint64_t x19; /* sp_svc */ + uint64_t x20; /* lr_abt */ + uint64_t x21; /* sp_abt */ + uint64_t x22; /* lr_und */ + uint64_t x23; /* sp_und */ + uint64_t x24; /* r8_fiq */ + uint64_t x25; /* r9_fiq */ + uint64_t x26; /* r10_fiq */ + uint64_t x27; /* r11_fiq */ + uint64_t x28; /* r12_fiq */ + uint64_t x29; /* fp,sp_fiq */ + uint64_t x30; /* lr_fiq */ + uint64_t pc64; /* ELR_EL2 */ + uint32_t cpsr; /* SPSR_EL2 */ + uint32_t spsr_el1; /*spsr_svc */ + /* AArch32 guests only */ + uint32_t spsr_fiq, spsr_irq, spsr_und, spsr_abt; + /* AArch64 guests only */ + uint64_t sp_el0; + uint64_t sp_el1, elr_el1; + + uint32_t sctlr, ttbcr; + uint64_t ttbr0, ttbr1; + + uint32_t ifar, dfar; + uint32_t ifsr, dfsr; + uint32_t dacr; + uint64_t par; + + uint64_t mair0, mair1; + uint64_t tpidr_el0; + uint64_t tpidr_el1; + uint64_t tpidrro_el0; + uint64_t vbar; + + /* Control Registers */ + uint32_t actlr; + uint32_t cpacr; + uint32_t afsr0, afsr1; + uint32_t contextidr; + uint32_t teecr, teehbr; /* ThumbEE, 32-bit guests only */ + uint32_t joscr, jmcr; + /* CP 15 */ + uint32_t csselr; + + unsigned long pause_flags; + +}; + +DECLARE_HVM_SAVE_TYPE(VCPU, 4, struct hvm_hw_cpu); + +/* + * Largest type-code in use + */ +#define HVM_SAVE_CODE_MAX 4 + #endif /* -- 1.8.1.2
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
From: Evgeny Fedotov <e.fedotov@samsung.com> By using the memory map info in arch_domain (from set_memory_map hypercall) implement get_maximum_gpfn hypercall. Changes from v4: Use GUEST_RAM_BASE as the start physical address of guest RAM. And, purge set-memory-map patch Singed-off-by: Evgeny Fedotov <e.fedotov@samsung.com> --- xen/arch/arm/mm.c | 22 +++++++++++++++++++++- xen/include/asm-arm/mm.h | 2 ++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index 123280e..3801f07 100644 --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) unsigned long domain_get_maximum_gpfn(struct domain *d) { - return -ENOSYS; + paddr_t end; + + get_gma_start_end(d, NULL, &end); + + return (unsigned long) (end >> PAGE_SHIFT); } void share_xen_page_with_guest(struct page_info *page, @@ -1308,6 +1312,22 @@ int is_iomem_page(unsigned long mfn) return 1; return 0; } + +/* Guest RAM base */ +#define GUEST_RAM_BASE 0x80000000 +/* + * XXX: Use correct definition for RAM base when the following patch + * xen: arm: 64-bit guest support and domU FDT autogeneration + * will be upstreamed. + */ +void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end) +{ + if ( start ) + *start = GUEST_RAM_BASE; + if ( end ) + *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages << PAGE_SHIFT); +} + /* * Local variables: * mode: C diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index ce66099..c5cb3af 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -341,6 +341,8 @@ static inline void put_page_and_type(struct page_info *page) put_page(page); } +void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end); + #endif /* __ARCH_ARM_MM__ */ /* * Local variables: -- 1.8.1.2
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 3/6] xen/arm: Implement modify_returncode
From: Alexey Sokolov <sokolov.a@samsung.com> Making sched_op in do_suspend (driver/xen/manage.c) returns 0 on the success of suspend. Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com> --- tools/libxc/xc_resume.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/tools/libxc/xc_resume.c b/tools/libxc/xc_resume.c index 18b4818..9315eb8 100644 --- a/tools/libxc/xc_resume.c +++ b/tools/libxc/xc_resume.c @@ -73,6 +73,31 @@ static int modify_returncode(xc_interface *xch, uint32_t domid) return 0; } +#elif defined(__arm__) + +static int modify_returncode(xc_interface *xch, uint32_t domid) +{ + vcpu_guest_context_any_t ctxt; + xc_dominfo_t info; + int rc; + + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ) + { + PERROR("Could not get domain info"); + return -1; + } + + if ( (rc = xc_vcpu_getcontext(xch, domid, 0, &ctxt)) != 0 ) + return rc; + + ctxt.c.user_regs.r0_usr = 1; + + if ( (rc = xc_vcpu_setcontext(xch, domid, 0, &ctxt)) != 0 ) + return rc; + + return 0; +} + #else static int modify_returncode(xc_interface *xch, uint32_t domid) -- 1.8.1.2
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
Implement vlpt (virtual-linear page table) for fast accessing of 3rd PTE of guest p2m. When creating a mapping for vlpt, just copy the 1st level PTE of guest p2m to the xen''s 2nd level PTE. Then the mapping becomes the following: xen''s 1st PTE --> xen''s 2nd PTE (which is the same as 1st PTE of guest p2m) --> guest p2m''s 2nd PTE --> guest p2m''s 3rd PTE (the memory contents where the vlpt points) For more info about vlpt, see: http://www.technovelty.org/linux/virtual-linear-page-table.html This function is used in dirty-page tracing: when domU write-fault is trapped by xen, xen can immediately locate the 3rd PTE of guest p2m. The following link shows the performance comparison for handling a dirty-page between vlpt and typical page table walking. http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html Changes from v4: 1. In the restoring vlpt, use __foo variant without barriers for write_pte and flush_xen_data_tlb_range_va. 2. Support two consecutive pages for guest''s first level page table. Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> --- xen/arch/arm/domain.c | 5 ++ xen/arch/arm/mm.c | 112 +++++++++++++++++++++++++++++++++++++++ xen/include/asm-arm/arm32/page.h | 41 ++++++++++---- xen/include/asm-arm/config.h | 5 ++ xen/include/asm-arm/domain.h | 7 +++ xen/include/asm-arm/mm.h | 15 ++++++ 6 files changed, 174 insertions(+), 11 deletions(-) diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index cb0424d..c0b5dd8 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -509,6 +509,11 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) /* Default the virtual ID to match the physical */ d->arch.vpidr = boot_cpu_data.midr.bits; + d->arch.dirty.second_lvl_start = 0; + d->arch.dirty.second_lvl_end = 0; + d->arch.dirty.second_lvl[0] = NULL; + d->arch.dirty.second_lvl[1] = NULL; + clear_page(d->shared_info); share_xen_page_with_guest( virt_to_page(d->shared_info), d, XENSHARE_writable); diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index 3801f07..bf13993 100644 --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -1328,6 +1328,118 @@ void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end) *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages << PAGE_SHIFT); } +/* flush the vlpt area */ +void flush_vlpt(struct domain *d) +{ + int flush_size; + flush_size = (d->arch.dirty.second_lvl_end - + d->arch.dirty.second_lvl_start) << SECOND_SHIFT; + /* flushing the 3rd level mapping */ + flush_xen_data_tlb_range_va(d->arch.dirty.second_lvl_start << SECOND_SHIFT, + flush_size); +} + +/* restore the xen page table for vlpt mapping for domain d */ +void restore_vlpt(struct domain *d) +{ + int i; + dsb(); + for ( i = d->arch.dirty.second_lvl_start; + i < d->arch.dirty.second_lvl_end; + ++i ) + { + int k = i % LPAE_ENTRIES; + int l = i / LPAE_ENTRIES; + + if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits ) + { + __write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]); + __flush_xen_data_tlb_range_va(i << SECOND_SHIFT, + 1 << SECOND_SHIFT); + } + } + dsb(); + isb(); +} + +/* setting up the xen page table for vlpt mapping for domain d */ +int prepare_vlpt(struct domain *d) +{ + int xen_second_linear_base; + int gp2m_start_index, gp2m_end_index; + struct p2m_domain *p2m = &d->arch.p2m; + struct page_info *second_lvl_page; + paddr_t gma_start = 0; + paddr_t gma_end = 0; + lpae_t *first[2]; + int i; + uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START; + + get_gma_start_end(d, &gma_start, &gma_end); + required = (gma_end - gma_start) >> LPAE_SHIFT; + + + if ( required > avail ) + { + dprintk(XENLOG_ERR, "Available VLPT is small for domU guest" + "(avail: %llx, required: %llx)\n", + avail, required); + return -ENOMEM; + } + + xen_second_linear_base = second_linear_offset(VIRT_LIN_P2M_START); + + gp2m_start_index = gma_start >> FIRST_SHIFT; + gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1; + + if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 ) + { + dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU"); + return -ENOMEM; + } + + second_lvl_page = alloc_domheap_pages(NULL, 1, 0); + if ( second_lvl_page == NULL ) + return -ENOMEM; + + /* First level p2m is 2 consecutive pages */ + d->arch.dirty.second_lvl[0] = map_domain_page_global( + page_to_mfn(second_lvl_page) ); + d->arch.dirty.second_lvl[1] = map_domain_page_global( + page_to_mfn(second_lvl_page+1) ); + + first[0] = __map_domain_page(p2m->first_level); + first[1] = __map_domain_page(p2m->first_level+1); + for ( i = gp2m_start_index; i < gp2m_end_index; ++i ) + { + int k = i % LPAE_ENTRIES; + int l = i / LPAE_ENTRIES; + int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES; + int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES; + + write_pte(&xen_second[xen_second_linear_base+i], first[l][k]); + + /* we copy the mapping into domain''s structure as a reference + * in case of the context switch (used in restore_vlpt) */ + d->arch.dirty.second_lvl[l2][k2] = first[l][k]; + } + unmap_domain_page(first[0]); + unmap_domain_page(first[1]); + + /* storing the start and end index */ + d->arch.dirty.second_lvl_start = xen_second_linear_base + gp2m_start_index; + d->arch.dirty.second_lvl_end = xen_second_linear_base + gp2m_end_index; + + flush_vlpt(d); + return 0; +} + +void cleanup_vlpt(struct domain *d) +{ + /* First level p2m is 2 consecutive pages */ + unmap_domain_page_global(d->arch.dirty.second_lvl[0]); + unmap_domain_page_global(d->arch.dirty.second_lvl[1]); +} /* * Local variables: * mode: C diff --git a/xen/include/asm-arm/arm32/page.h b/xen/include/asm-arm/arm32/page.h index cf12a89..0a4e115 100644 --- a/xen/include/asm-arm/arm32/page.h +++ b/xen/include/asm-arm/arm32/page.h @@ -5,20 +5,28 @@ /* Write a pagetable entry. * - * If the table entry is changing a text mapping, it is responsibility - * of the caller to issue an ISB after write_pte. + * All necessary barriers are responsibility of the caller */ -static inline void write_pte(lpae_t *p, lpae_t pte) +static inline void __write_pte(lpae_t *p, lpae_t pte) { asm volatile ( - /* Ensure any writes have completed with the old mappings. */ - "dsb;" /* Safely write the entry (STRD is atomic on CPUs that support LPAE) */ "strd %0, %H0, [%1];" - "dsb;" : : "r" (pte.bits), "r" (p) : "memory"); } +/* Write a pagetable entry with dsb barriers. + * + * If the table entry is changing a text mapping, it is responsibility + * of the caller to issue an ISB after write_pte. + */ +static inline void write_pte(lpae_t *p, lpae_t pte) +{ + dsb(); + __write_pte(p, pte); + dsb(); +} + /* Inline ASM to flush dcache on register R (may be an inline asm operand) */ #define __flush_xen_dcache_one(R) STORE_CP32(R, DCCMVAC) @@ -57,18 +65,28 @@ static inline void flush_xen_data_tlb(void) } /* - * Flush a range of VA''s hypervisor mappings from the data TLB. This is not - * sufficient when changing code mappings or for self modifying code. + * Flush a range of VA''s hypervisor mappings from the data TLB. + * All necessary barriers are responsibility of the caller */ -static inline void flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) +static inline void __flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) { unsigned long end = va + size; - dsb(); /* Ensure preceding are visible */ while ( va < end ) { asm volatile(STORE_CP32(0, TLBIMVAH) : : "r" (va) : "memory"); va += PAGE_SIZE; } +} + +/* + * Flush a range of VA''s hypervisor mappings from the data TLB with barriers. + * This barrier is not sufficient when changing code mappings or for self + * modifying code. + */ +static inline void flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) +{ + dsb(); /* Ensure preceding are visible */ + __flush_xen_data_tlb_range_va(va, size); dsb(); /* Ensure completion of the TLB flush */ isb(); } diff --git a/xen/include/asm-arm/config.h b/xen/include/asm-arm/config.h index 5b7b1a8..15ad56d 100644 --- a/xen/include/asm-arm/config.h +++ b/xen/include/asm-arm/config.h @@ -87,6 +87,7 @@ * 0 - 8M <COMMON> * * 32M - 128M Frametable: 24 bytes per page for 16GB of RAM + * 128M - 256M Virtual-linear mapping to P2M table * 256M - 1G VMAP: ioremap and early_ioremap use this virtual address * space * @@ -124,7 +125,9 @@ #define CONFIG_SEPARATE_XENHEAP 1 #define FRAMETABLE_VIRT_START _AT(vaddr_t,0x02000000) +#define VIRT_LIN_P2M_START _AT(vaddr_t,0x08000000) #define VMAP_VIRT_START _AT(vaddr_t,0x10000000) +#define VIRT_LIN_P2M_END VMAP_VIRT_START #define XENHEAP_VIRT_START _AT(vaddr_t,0x40000000) #define XENHEAP_VIRT_END _AT(vaddr_t,0x7fffffff) #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) @@ -157,6 +160,8 @@ #define HYPERVISOR_VIRT_END DIRECTMAP_VIRT_END +/*TODO (ARM_64): define VIRT_LIN_P2M_START VIRT_LIN_P2M_END */ + #endif /* Fixmap slots */ diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h index 67bfbbc..4f366f1 100644 --- a/xen/include/asm-arm/domain.h +++ b/xen/include/asm-arm/domain.h @@ -112,6 +112,13 @@ struct arch_domain spinlock_t lock; } vuart; + /* dirty-page tracing */ + struct { + volatile int second_lvl_start; /* for context switch */ + volatile int second_lvl_end; + lpae_t *second_lvl[2]; /* copy of guest p2m''s first */ + } dirty; + } __cacheline_aligned; struct arch_vcpu diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index c5cb3af..a74e135 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -342,6 +342,21 @@ static inline void put_page_and_type(struct page_info *page) } void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end); +int prepare_vlpt(struct domain *d); +void cleanup_vlpt(struct domain *d); +void restore_vlpt(struct domain *d); + +/* calculate the xen''s virtual address for accessing the leaf PTE of + * a given address (GPA) */ +static inline lpae_t * get_vlpt_3lvl_pte(paddr_t addr) +{ + lpae_t *table = (lpae_t *)VIRT_LIN_P2M_START; + + /* Since we slotted the guest''s first p2m page table to xen''s + * second page table, one shift is enough for calculating the + * index of guest p2m table entry */ + return &table[addr >> PAGE_SHIFT]; +} #endif /* __ARCH_ARM_MM__ */ /* -- 1.8.1.2
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
Add hypercall (shadow op: enable/disable and clean/peek dirtied page bitmap). It consists of two parts: dirty page detecting and saving. For detecting, we setup the guest p2m''s leaf PTE read-only and whenever the guest tries to write something, permission fault happens and traps into xen. The permission-faulted GPA should be saved for the toolstack (when it wants to see which pages are dirtied). For this purpose, we temporarily save the GPAs into bitmap. Changes from v4: 1. For temporary saving dirty pages, use bitmap rather than linked list. 2. Setup the p2m''s second level page as read-write in the view of xen''s memory access. It happens in p2m_create_table function. Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> --- xen/arch/arm/domain.c | 14 +++ xen/arch/arm/domctl.c | 9 ++ xen/arch/arm/mm.c | 103 +++++++++++++++++++- xen/arch/arm/p2m.c | 206 ++++++++++++++++++++++++++++++++++++++++ xen/arch/arm/traps.c | 9 ++ xen/include/asm-arm/domain.h | 7 ++ xen/include/asm-arm/mm.h | 7 ++ xen/include/asm-arm/p2m.h | 4 + xen/include/asm-arm/processor.h | 2 + 9 files changed, 360 insertions(+), 1 deletion(-) diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index c0b5dd8..0a32301 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) WRITE_SYSREG(hcr, HCR_EL2); isb(); + /* for dirty-page tracing + * XXX: how do we consider SMP case? + */ + if ( n->domain->arch.dirty.mode ) + restore_vlpt(n->domain); + /* This is could trigger an hardware interrupt from the virtual * timer. The interrupt needs to be injected into the guest. */ virt_timer_restore(n); @@ -509,11 +515,19 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) /* Default the virtual ID to match the physical */ d->arch.vpidr = boot_cpu_data.midr.bits; + /* init for dirty-page tracing */ + d->arch.dirty.count = 0; + d->arch.dirty.mode = 0; + spin_lock_init(&d->arch.dirty.lock); + d->arch.dirty.second_lvl_start = 0; d->arch.dirty.second_lvl_end = 0; d->arch.dirty.second_lvl[0] = NULL; d->arch.dirty.second_lvl[1] = NULL; + memset(d->arch.dirty.bitmap, 0, sizeof(d->arch.dirty.bitmap)); + d->arch.dirty.bitmap_pages = 0; + clear_page(d->shared_info); share_xen_page_with_guest( virt_to_page(d->shared_info), d, XENSHARE_writable); diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index cb38e59..eb74225 100644 --- a/xen/arch/arm/domctl.c +++ b/xen/arch/arm/domctl.c @@ -93,6 +93,15 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d, xfree(c.data); } break; + case XEN_DOMCTL_shadow_op: + { + domain_pause(d); + ret = dirty_mode_op(d, &domctl->u.shadow_op); + domain_unpause(d); + + copyback = 1; + } + break; default: return -EINVAL; diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index bf13993..d5a0a11 100644 --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -845,7 +845,6 @@ void destroy_xen_mappings(unsigned long v, unsigned long e) create_xen_entries(REMOVE, v, 0, (e - v) >> PAGE_SHIFT, 0); } -enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; static void set_pte_flags_on_range(const char *p, unsigned long l, enum mg mg) { lpae_t pte; @@ -1320,6 +1319,60 @@ int is_iomem_page(unsigned long mfn) * xen: arm: 64-bit guest support and domU FDT autogeneration * will be upstreamed. */ + +static inline void mark_dirty_bitmap(struct domain *d, paddr_t addr) +{ + paddr_t ram_base = (paddr_t) GUEST_RAM_BASE; + int bit_index = PFN_DOWN(addr - ram_base); + int page_index = bit_index >> (PAGE_SHIFT + 3); + int bit_index_residual = bit_index & ((1ul << (PAGE_SHIFT + 3)) - 1); + + set_bit(bit_index_residual, d->arch.dirty.bitmap[page_index]); +} + +/* routine for dirty-page tracing + * + * On first write, it page faults, its entry is changed to read-write, + * and on retry the write succeeds. + * + * for locating p2m of the faulting entry, we use virtual-linear page table. + * returns zero if addr is not valid or dirty mode is not set + */ +int handle_page_fault(struct domain *d, paddr_t addr) +{ + + lpae_t *vlp2m_pte = 0; + paddr_t gma_start = 0; + paddr_t gma_end = 0; + + if ( !d->arch.dirty.mode ) return 0; + get_gma_start_end(d, &gma_start, &gma_end); + + /* Ensure that addr is inside guest''s RAM */ + if ( addr < gma_start || + addr > gma_end ) return 0; + + vlp2m_pte = get_vlpt_3lvl_pte(addr); + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 && + vlp2m_pte->p2m.avail == 0 /* reuse avail bit as read-only */ ) + { + lpae_t pte = *vlp2m_pte; + pte.p2m.write = 1; + write_pte(vlp2m_pte, pte); + flush_tlb_local(); + + /* only necessary to lock between get-dirty bitmap and mark dirty + * bitmap. If get-dirty bitmap happens immediately before this + * lock, the corresponding dirty-page would be marked at the next + * round of get-dirty bitmap */ + spin_lock(&d->arch.dirty.lock); + mark_dirty_bitmap(d, addr); + spin_unlock(&d->arch.dirty.lock); + } + + return 1; +} + void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end) { if ( start ) @@ -1440,6 +1493,54 @@ void cleanup_vlpt(struct domain *d) unmap_domain_page_global(d->arch.dirty.second_lvl[0]); unmap_domain_page_global(d->arch.dirty.second_lvl[1]); } + +int prepare_bitmap(struct domain *d) +{ + paddr_t gma_start = 0; + paddr_t gma_end = 0; + int nr_bytes; + int nr_pages; + int i; + + get_gma_start_end(d, &gma_start, &gma_end); + + nr_bytes = (PFN_DOWN(gma_end - gma_start) + 7) / 8; + nr_pages = (nr_bytes + PAGE_SIZE - 1) / PAGE_SIZE; + + BUG_ON( nr_pages > MAX_DIRTY_BITMAP_PAGES ); + + for ( i = 0; i < nr_pages; ++i ) + { + struct page_info *page; + page = alloc_domheap_page(NULL, 0); + if ( page == NULL ) + goto cleanup_on_failure; + + d->arch.dirty.bitmap[i] = map_domain_page_global(__page_to_mfn(page)); + clear_page(d->arch.dirty.bitmap[i]); + } + + d->arch.dirty.bitmap_pages = nr_pages; + return 0; + +cleanup_on_failure: + nr_pages = i; + for ( i = 0; i < nr_pages; ++i ) + { + unmap_domain_page_global(d->arch.dirty.bitmap[i]); + } + return -ENOMEM; +} + +void cleanup_bitmap(struct domain *d) +{ + int i; + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) + { + unmap_domain_page_global(d->arch.dirty.bitmap[i]); + } +} + /* * Local variables: * mode: C diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index 2d09fef..b7dbf7d 100644 --- a/xen/arch/arm/p2m.c +++ b/xen/arch/arm/p2m.c @@ -6,6 +6,8 @@ #include <xen/bitops.h> #include <asm/flushtlb.h> #include <asm/gic.h> +#include <xen/guest_access.h> +#include <xen/pfn.h> void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -113,6 +115,10 @@ static int p2m_create_table(struct domain *d, pte = mfn_to_p2m_entry(page_to_mfn(page), MATTR_MEM); + /* mark the write bit (page table''s case, ro bit) as 0 + * so, it is writable in case of vlpt access */ + pte.pt.ro = 0; + write_pte(entry, pte); return 0; @@ -408,6 +414,206 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn) return p >> PAGE_SHIFT; } +/* Change types across all p2m entries in a domain */ +void p2m_change_entry_type_global(struct domain *d, enum mg nt) +{ + struct p2m_domain *p2m = &d->arch.p2m; + paddr_t ram_base; + int i1, i2, i3; + int first_index, second_index, third_index; + lpae_t *first = __map_domain_page(p2m->first_level); + lpae_t pte, *second = NULL, *third = NULL; + + get_gma_start_end(d, &ram_base, NULL); + + first_index = first_table_offset((uint64_t)ram_base); + second_index = second_table_offset((uint64_t)ram_base); + third_index = third_table_offset((uint64_t)ram_base); + + BUG_ON( !first && "Can''t map first level p2m." ); + + spin_lock(&p2m->lock); + + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) + { + lpae_walk_t first_pte = first[i1].walk; + if ( !first_pte.valid || !first_pte.table ) + goto out; + + second = map_domain_page(first_pte.base); + BUG_ON( !second && "Can''t map second level p2m."); + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) + { + lpae_walk_t second_pte = second[i2].walk; + + if ( !second_pte.valid || !second_pte.table ) + goto out; + + third = map_domain_page(second_pte.base); + BUG_ON( !third && "Can''t map third level p2m."); + + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) + { + lpae_walk_t third_pte = third[i3].walk; + if ( !third_pte.valid ) + goto out; + + pte = third[i3]; + if ( nt == mg_ro ) + { + if ( pte.p2m.write == 1 ) + { + pte.p2m.write = 0; + pte.p2m.avail = 0; + } + else + { + /* reuse avail bit as an indicator + * of ''actual'' read-only */ + pte.p2m.avail = 1; + } + } + else if ( nt == mg_rw ) + { + if ( pte.p2m.write == 0 && pte.p2m.avail == 0 ) + { + pte.p2m.write = 1; + } + } + write_pte(&third[i3], pte); + } + unmap_domain_page(third); + + third = NULL; + third_index = 0; + } + unmap_domain_page(second); + + second = NULL; + second_index = 0; + third_index = 0; + } + +out: + flush_tlb_all_local(); + if ( third ) unmap_domain_page(third); + if ( second ) unmap_domain_page(second); + if ( first ) unmap_domain_page(first); + + spin_unlock(&p2m->lock); +} + +/* Read a domain''s log-dirty bitmap and stats. + * If the operation is a CLEAN, clear the bitmap and stats. */ +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) +{ + int peek = 1; + int i; + int bitmap_size; + paddr_t gma_start, gma_end; + + /* this hypercall is called from domain 0, and we don''t know which guest''s + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */ + restore_vlpt(d); + + get_gma_start_end(d, &gma_start, &gma_end); + bitmap_size = (gma_end - gma_start) / 8; + + if ( guest_handle_is_null(sc->dirty_bitmap) ) + { + peek = 0; + } + else + { + spin_lock(&d->arch.dirty.lock); + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) + { + int j = 0; + uint8_t *bitmap; + copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE, + d->arch.dirty.bitmap[i], + bitmap_size < PAGE_SIZE ? bitmap_size : + PAGE_SIZE); + bitmap_size -= PAGE_SIZE; + + /* set p2m page table read-only */ + bitmap = d->arch.dirty.bitmap[i]; + while ((j = find_next_bit((const long unsigned int *)bitmap, + PAGE_SIZE*8, j)) < PAGE_SIZE*8) + { + lpae_t *vlpt; + paddr_t addr = gma_start + + (i << (2*PAGE_SHIFT+3)) + + (j << PAGE_SHIFT); + vlpt = get_vlpt_3lvl_pte(addr); + vlpt->p2m.write = 0; + j++; + } + } + + if ( sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN ) + { + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) + { + clear_page(d->arch.dirty.bitmap[i]); + } + } + + spin_unlock(&d->arch.dirty.lock); + flush_tlb_local(); + } + + sc->stats.dirty_count = d->arch.dirty.count; + + return 0; +} + +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc) +{ + long ret = 0; + switch (sc->op) + { + case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY: + case XEN_DOMCTL_SHADOW_OP_OFF: + { + enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : mg_ro; + + d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : 1; + p2m_change_entry_type_global(d, nt); + + if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF ) + { + cleanup_vlpt(d); + cleanup_bitmap(d); + } + else + { + if ( (ret = prepare_vlpt(d)) ) + return ret; + + if ( (ret = prepare_bitmap(d)) ) + { + /* in case of failure, we have to cleanup vlpt */ + cleanup_vlpt(d); + return ret; + } + } + } + break; + + case XEN_DOMCTL_SHADOW_OP_CLEAN: + case XEN_DOMCTL_SHADOW_OP_PEEK: + { + ret = log_dirty_op(d, sc); + } + break; + + default: + return -ENOSYS; + } + return ret; +} + /* * Local variables: * mode: C diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index 287dd7b..1a7ed11 100644 --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -1321,6 +1321,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, const char *msg; int rc, level = -1; mmio_info_t info; + int page_fault = ( (dabt.dfsc & FSC_MASK) =+ (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write ); if ( !check_conditional_instr(regs, hsr) ) { @@ -1342,6 +1344,13 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, if ( rc == -EFAULT ) goto bad_data_abort; + /* domU page fault handling for guest live migration */ + /* dabt.valid can be 0 here */ + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) + { + /* Do not modify pc after page fault to repeat memory operation */ + return; + } /* XXX: Decode the instruction if ISS is not valid */ if ( !dabt.valid ) goto bad_data_abort; diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h index 4f366f1..180d924 100644 --- a/xen/include/asm-arm/domain.h +++ b/xen/include/asm-arm/domain.h @@ -114,9 +114,16 @@ struct arch_domain /* dirty-page tracing */ struct { +#define MAX_DIRTY_BITMAP_PAGES 64 /* support upto 8GB guest memory */ + spinlock_t lock; /* protect list: head, mvn_head */ + volatile int mode; /* 1 if dirty pages tracing enabled */ + volatile unsigned int count; /* dirty pages counter */ volatile int second_lvl_start; /* for context switch */ volatile int second_lvl_end; lpae_t *second_lvl[2]; /* copy of guest p2m''s first */ + /* dirty bitmap */ + uint8_t *bitmap[MAX_DIRTY_BITMAP_PAGES]; + int bitmap_pages; /* number of dirty bitmap pages */ } dirty; } __cacheline_aligned; diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index a74e135..1ce7a4b 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -341,11 +341,18 @@ static inline void put_page_and_type(struct page_info *page) put_page(page); } +enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; + +/* routine for dirty-page tracing */ +int handle_page_fault(struct domain *d, paddr_t addr); void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end); int prepare_vlpt(struct domain *d); void cleanup_vlpt(struct domain *d); void restore_vlpt(struct domain *d); +int prepare_bitmap(struct domain *d); +void cleanup_bitmap(struct domain *d); + /* calculate the xen''s virtual address for accessing the leaf PTE of * a given address (GPA) */ static inline lpae_t * get_vlpt_3lvl_pte(paddr_t addr) diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h index c660820..dba9a7b 100644 --- a/xen/include/asm-arm/p2m.h +++ b/xen/include/asm-arm/p2m.h @@ -2,6 +2,7 @@ #define _XEN_P2M_H #include <xen/mm.h> +#include <public/domctl.h> struct domain; @@ -110,6 +111,9 @@ static inline int get_page_and_type(struct page_info *page, return rc; } +void p2m_change_entry_type_global(struct domain *d, enum mg nt); +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc); + #endif /* _XEN_P2M_H */ /* diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h index 5294421..fced6ad 100644 --- a/xen/include/asm-arm/processor.h +++ b/xen/include/asm-arm/processor.h @@ -399,6 +399,8 @@ union hsr { #define FSC_CPR (0x3a) /* Coprocossor Abort */ #define FSC_LL_MASK (0x03<<0) +#define FSC_MASK (0x3f) /* Fault status mask */ +#define FSC_3D_LEVEL (0x03) /* Third level fault*/ /* Time counter hypervisor control register */ #define CNTHCTL_PA (1u<<0) /* Kernel/user access to physical counter */ -- 1.8.1.2
Jaeyong Yoo
2013-Nov-08 07:50 UTC
[PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
From: Alexey Sokolov <sokolov.a@samsung.com> Implement for xl restore/save (which are also used for migrate) operation in xc_arm_migrate.c and make it compilable. The overall process of save is the following: 1) save guest parameters (i.e., memory map, console and store pfn, etc) 2) save memory (if it is live, perform dirty-page tracing) 3) save hvm states (i.e., gic, timer, vcpu etc) Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com> --- config/arm32.mk | 1 + tools/libxc/Makefile | 6 +- tools/libxc/xc_arm_migrate.c | 712 +++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xc_dom_arm.c | 4 +- tools/misc/Makefile | 4 +- 5 files changed, 723 insertions(+), 4 deletions(-) create mode 100644 tools/libxc/xc_arm_migrate.c diff --git a/config/arm32.mk b/config/arm32.mk index aa79d22..01374c9 100644 --- a/config/arm32.mk +++ b/config/arm32.mk @@ -1,6 +1,7 @@ CONFIG_ARM := y CONFIG_ARM_32 := y CONFIG_ARM_$(XEN_OS) := y +CONFIG_MIGRATE := y CONFIG_XEN_INSTALL_SUFFIX : diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 4c64c15..05dfef4 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -42,8 +42,13 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c GUEST_SRCS-y : GUEST_SRCS-y += xg_private.c xc_suspend.c ifeq ($(CONFIG_MIGRATE),y) +ifeq ($(CONFIG_X86),y) GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c GUEST_SRCS-y += xc_offline_page.c xc_compression.c +endif +ifeq ($(CONFIG_ARM),y) +GUEST_SRCS-y += xc_arm_migrate.c +endif else GUEST_SRCS-y += xc_nomigrate.c endif @@ -63,7 +68,6 @@ $(patsubst %.c,%.opic,$(ELF_SRCS-y)): CFLAGS += -Wno-pointer-sign GUEST_SRCS-y += xc_dom_core.c xc_dom_boot.c GUEST_SRCS-y += xc_dom_elfloader.c GUEST_SRCS-$(CONFIG_X86) += xc_dom_bzimageloader.c -GUEST_SRCS-$(CONFIG_X86) += xc_dom_decompress_lz4.c GUEST_SRCS-$(CONFIG_ARM) += xc_dom_armzimageloader.c GUEST_SRCS-y += xc_dom_binloader.c GUEST_SRCS-y += xc_dom_compat_linux.c diff --git a/tools/libxc/xc_arm_migrate.c b/tools/libxc/xc_arm_migrate.c new file mode 100644 index 0000000..461e339 --- /dev/null +++ b/tools/libxc/xc_arm_migrate.c @@ -0,0 +1,712 @@ +/****************************************************************************** + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; + * version 2.1 of the License. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + * + * Copyright (c) 2013, Samsung Electronics + */ + +#include <inttypes.h> +#include <errno.h> +#include <xenctrl.h> +#include <xenguest.h> + +#include <unistd.h> +#include <xc_private.h> +#include <xc_dom.h> +#include "xc_bitops.h" +#include "xg_private.h" + +/* Guest RAM base */ +#define GUEST_RAM_BASE 0x80000000 +/* + * XXX: Use correct definition for RAM base when the following patch + * xen: arm: 64-bit guest support and domU FDT autogeneration + * will be upstreamed. + */ + +#define DEF_MAX_ITERS 29 /* limit us to 30 times round loop */ +#define DEF_MAX_FACTOR 3 /* never send more than 3x p2m_size */ +#define DEF_MIN_DIRTY_PER_ITER 50 /* dirty page count to define last iter */ +#define DEF_PROGRESS_RATE 50 /* progress bar update rate */ + +/* Enable this macro for debug only: "static" migration instead of live */ +/* +#define DISABLE_LIVE_MIGRATION +*/ + +/* Enable this macro for debug only: additional debug info */ +/* +#define ARM_MIGRATE_VERBOSE +*/ + +/* + * Guest params to save: used HVM params, save flags, memory map + */ +typedef struct guest_params +{ + unsigned long console_pfn; + unsigned long store_pfn; + uint32_t flags; + xen_pfn_t start_gpfn; + xen_pfn_t max_gpfn; + uint32_t max_vcpu_id; +} guest_params_t; + +static int suspend_and_state(int (*suspend)(void*), void *data, + xc_interface *xch, int dom) +{ + xc_dominfo_t info; + if ( !(*suspend)(data) ) + { + ERROR("Suspend request failed"); + return -1; + } + + if ( (xc_domain_getinfo(xch, dom, 1, &info) != 1) || + !info.shutdown || (info.shutdown_reason != SHUTDOWN_suspend) ) + { + ERROR("Domain is not in suspended state after suspend attempt"); + return -1; + } + + return 0; +} + +static int write_exact_handled(xc_interface *xch, int fd, const void *data, + size_t size) +{ + if ( write_exact(fd, data, size) ) + { + ERROR("Write failed, check space"); + return -1; + } + return 0; +} + +/* ============ Memory ============= */ +static int save_memory(xc_interface *xch, int io_fd, uint32_t dom, + struct save_callbacks *callbacks, + uint32_t max_iters, uint32_t max_factor, + guest_params_t *params) +{ + int live = !!(params->flags & XCFLAGS_LIVE); + int debug = !!(params->flags & XCFLAGS_DEBUG); + xen_pfn_t i; + char reportbuf[80]; + int iter = 0; + int last_iter = !live; + int total_dirty_pages_num = 0; + int dirty_pages_on_prev_iter_num = 0; + int count = 0; + char *page = 0; + xen_pfn_t *busy_pages = 0; + int busy_pages_count = 0; + int busy_pages_max = 256; + + DECLARE_HYPERCALL_BUFFER(unsigned long, to_send); + + xen_pfn_t start = params->start_gpfn; + const xen_pfn_t end = params->max_gpfn; + const xen_pfn_t mem_size = end - start; + + if ( debug ) + { + IPRINTF("(save mem) start=%llx end=%llx!\n", start, end); + } + + if ( live ) + { + if ( xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, + NULL, 0, NULL, 0, NULL) < 0 ) + { + ERROR("Couldn''t enable log-dirty mode !\n"); + return -1; + } + + max_iters = max_iters ? : DEF_MAX_ITERS; + max_factor = max_factor ? : DEF_MAX_FACTOR; + + if ( debug ) + IPRINTF("Log-dirty mode enabled, max_iters=%d, max_factor=%d!\n", + max_iters, max_factor); + } + + to_send = xc_hypercall_buffer_alloc_pages(xch, to_send, + NRPAGES(bitmap_size(mem_size))); + if ( !to_send ) + { + ERROR("Couldn''t allocate to_send array!\n"); + return -1; + } + + /* send all pages on first iter */ + memset(to_send, 0xff, bitmap_size(mem_size)); + + for ( ; ; ) + { + int dirty_pages_on_current_iter_num = 0; + int frc; + iter++; + + snprintf(reportbuf, sizeof(reportbuf), + "Saving memory: iter %d (last sent %u)", + iter, dirty_pages_on_prev_iter_num); + + xc_report_progress_start(xch, reportbuf, mem_size); + + if ( (iter > 1 && + dirty_pages_on_prev_iter_num < DEF_MIN_DIRTY_PER_ITER) || + (iter == max_iters) || + (total_dirty_pages_num >= mem_size*max_factor) ) + { + if ( debug ) + IPRINTF("Last iteration"); + last_iter = 1; + } + + if ( last_iter ) + { + if ( suspend_and_state(callbacks->suspend, callbacks->data, + xch, dom) ) + { + ERROR("Domain appears not to have suspended"); + return -1; + } + } + if ( live && iter > 1 ) + { + frc = xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN, + HYPERCALL_BUFFER(to_send), mem_size, + NULL, 0, NULL); + if ( frc != mem_size ) + { + ERROR("Error peeking shadow bitmap"); + xc_hypercall_buffer_free_pages(xch, to_send, + NRPAGES(bitmap_size(mem_size))); + return -1; + } + } + + busy_pages = malloc(sizeof(xen_pfn_t) * busy_pages_max); + + for ( i = start; i < end; ++i ) + { + if ( test_bit(i - start, to_send) ) + { + page = xc_map_foreign_range(xch, dom, PAGE_SIZE, PROT_READ, i); + if ( !page ) + { + /* This page is mapped elsewhere, should be resent later */ + busy_pages[busy_pages_count] = i; + busy_pages_count++; + if ( busy_pages_count >= busy_pages_max ) + { + busy_pages_max += 256; + busy_pages = realloc(busy_pages, sizeof(xen_pfn_t) * + busy_pages_max); + } + continue; + } + + if ( write_exact_handled(xch, io_fd, &i, sizeof(i)) || + write_exact_handled(xch, io_fd, page, PAGE_SIZE) ) + { + munmap(page, PAGE_SIZE); + free(busy_pages); + return -1; + } + count++; + munmap(page, PAGE_SIZE); + + if ( (i % DEF_PROGRESS_RATE) == 0 ) + xc_report_progress_step(xch, i - start, mem_size); + dirty_pages_on_current_iter_num++; + } + } + + while ( busy_pages_count ) + { + /* Send busy pages */ + busy_pages_count--; + i = busy_pages[busy_pages_count]; + if ( test_bit(i - start, to_send) ) + { + page = xc_map_foreign_range(xch, dom, PAGE_SIZE,PROT_READ, i); + if ( !page ) + { + IPRINTF("WARNING: 2nd attempt to save page " + "busy failed pfn=%llx", i); + continue; + } + + if ( debug ) + { + IPRINTF("save mem: resend busy page %llx\n", i); + } + + if ( write_exact_handled(xch, io_fd, &i, sizeof(i)) || + write_exact_handled(xch, io_fd, page, PAGE_SIZE) ) + { + munmap(page, PAGE_SIZE); + free(busy_pages); + return -1; + } + count++; + munmap(page, PAGE_SIZE); + dirty_pages_on_current_iter_num++; + } + } + free(busy_pages); + + if ( debug ) + IPRINTF("Dirty pages=%d", dirty_pages_on_current_iter_num); + + xc_report_progress_step(xch, mem_size, mem_size); + + dirty_pages_on_prev_iter_num = dirty_pages_on_current_iter_num; + total_dirty_pages_num += dirty_pages_on_current_iter_num; + + if ( last_iter ) + { + xc_hypercall_buffer_free_pages(xch, to_send, + NRPAGES(bitmap_size(mem_size))); + if ( live ) + { + if ( xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, + NULL, 0, NULL, 0, NULL) < 0 ) + ERROR("Couldn''t disable log-dirty mode"); + } + break; + } + } + if ( debug ) + { + IPRINTF("save mem: pages count = %d\n", count); + } + + i = (xen_pfn_t) -1; /* end page marker */ + return write_exact_handled(xch, io_fd, &i, sizeof(i)); +} + +static int restore_memory(xc_interface *xch, int io_fd, uint32_t dom, + guest_params_t *params) +{ + xen_pfn_t end = params->max_gpfn; + xen_pfn_t gpfn; + int debug = !!(params->flags & XCFLAGS_DEBUG); + int count = 0; + char *page; + xen_pfn_t start = params->start_gpfn; + + /* TODO allocate several pages per call */ + for ( gpfn = start; gpfn < end; ++gpfn ) + { + if ( xc_domain_populate_physmap_exact(xch, dom, 1, 0, 0, &gpfn) ) + { + PERROR("Memory allocation for a new domain failed"); + return -1; + } + } + while ( 1 ) + { + + if ( read_exact(io_fd, &gpfn, sizeof(gpfn)) ) + { + PERROR("GPFN read failed during memory transfer, count=%d", count); + return -1; + } + if ( gpfn == (xen_pfn_t) -1 ) break; /* end page marker */ + + if ( gpfn < start || gpfn >= end ) + { + ERROR("GPFN %llx doesn''t belong to RAM address space, count=%d", + gpfn, count); + return -1; + } + page = xc_map_foreign_range(xch, dom, PAGE_SIZE, + PROT_READ | PROT_WRITE, gpfn); + if ( !page ) + { + PERROR("xc_map_foreign_range failed, pfn=%llx", gpfn); + return -1; + } + if ( read_exact(io_fd, page, PAGE_SIZE) ) + { + PERROR("Page data read failed during memory transfer, pfn=%llx", + gpfn); + return -1; + } + munmap(page, PAGE_SIZE); + count++; + } + + if ( debug ) + { + IPRINTF("Memory restored, pages count=%d", count); + } + return 0; +} + +/* ============ HVM context =========== */ +static int save_armhvm(xc_interface *xch, int io_fd, uint32_t dom, int debug) +{ + /* HVM: a buffer for holding HVM context */ + uint32_t hvm_buf_size = 0; + uint8_t *hvm_buf = NULL; + uint32_t rec_size; + int retval = -1; + + /* Need another buffer for HVM context */ + hvm_buf_size = xc_domain_hvm_getcontext(xch, dom, 0, 0); + if ( hvm_buf_size == -1 ) + { + ERROR("Couldn''t get HVM context size from Xen"); + goto out; + } + hvm_buf = malloc(hvm_buf_size); + + if ( !hvm_buf ) + { + ERROR("Couldn''t allocate memory for hvm buffer"); + goto out; + } + + /* Get HVM context from Xen and save it too */ + if ( (rec_size = xc_domain_hvm_getcontext(xch, dom, hvm_buf, + hvm_buf_size)) == -1 ) + { + ERROR("HVM:Could not get hvm buffer"); + goto out; + } + + if ( debug ) + IPRINTF("HVM save size %d %d", hvm_buf_size, rec_size); + + if ( write_exact_handled(xch, io_fd, &rec_size, sizeof(uint32_t)) ) + goto out; + + if ( write_exact_handled(xch, io_fd, hvm_buf, rec_size) ) + { + goto out; + } + retval = 0; + +out: + if ( hvm_buf ) + free (hvm_buf); + return retval; +} + +static int restore_armhvm(xc_interface *xch, int io_fd, + uint32_t dom, int debug) +{ + uint32_t rec_size; + uint32_t hvm_buf_size = 0; + uint8_t *hvm_buf = NULL; + int frc = 0; + int retval = -1; + + if ( read_exact(io_fd, &rec_size, sizeof(uint32_t)) ) + { + PERROR("Could not read HVM size"); + goto out; + } + + if ( !rec_size ) + { + ERROR("Zero HVM size"); + goto out; + } + + hvm_buf_size = xc_domain_hvm_getcontext(xch, dom, 0, 0); + if ( hvm_buf_size != rec_size ) + { + ERROR("HVM size for this domain is not the same as stored"); + } + + hvm_buf = malloc(hvm_buf_size); + if ( !hvm_buf ) + { + ERROR("Couldn''t allocate memory"); + goto out; + } + + if ( read_exact(io_fd, hvm_buf, hvm_buf_size) ) + { + PERROR("Could not read HVM context"); + goto out; + } + + frc = xc_domain_hvm_setcontext(xch, dom, hvm_buf, hvm_buf_size); + if ( frc ) + { + ERROR("error setting the HVM context"); + goto out; + } + retval = 0; + + if ( debug ) + { + IPRINTF("HVM restore size %d %d", hvm_buf_size, rec_size); + } +out: + if ( hvm_buf ) + free (hvm_buf); + return retval; +} + +/* ================= Console & Xenstore & Memory map =========== */ +static int save_guest_params(xc_interface *xch, int io_fd, + uint32_t dom, uint32_t flags, + guest_params_t *params) +{ + size_t sz = sizeof(guest_params_t); + xc_dominfo_t dom_info; + + params->max_gpfn = xc_domain_maximum_gpfn(xch, dom); + params->start_gpfn = (GUEST_RAM_BASE >> PAGE_SHIFT); + + if ( flags & XCFLAGS_DEBUG ) + { + IPRINTF("Guest param save size: %d ", sz); + } + + if ( xc_get_hvm_param(xch, dom, HVM_PARAM_CONSOLE_PFN, + ¶ms->console_pfn) ) + { + ERROR("Can''t get console gpfn"); + return -1; + } + + if ( xc_get_hvm_param(xch, dom, HVM_PARAM_STORE_PFN, ¶ms->store_pfn) ) + { + ERROR("Can''t get store gpfn"); + return -1; + } + + if ( xc_domain_getinfo(xch, dom, 1, &dom_info ) < 0) + { + ERROR("Can''t get domain info for dom %d", dom); + return -1; + } + params->max_vcpu_id = dom_info.max_vcpu_id; + + params->flags = flags; + + if ( write_exact_handled(xch, io_fd, params, sz) ) + { + return -1; + } + + return 0; +} + +static int restore_guest_params(xc_interface *xch, int io_fd, + uint32_t dom, guest_params_t *params) +{ + size_t sz = sizeof(guest_params_t); + xen_pfn_t nr_pfns; + unsigned int maxmemkb; + + if ( read_exact(io_fd, params, sizeof(guest_params_t)) ) + { + PERROR("Can''t read guest params"); + return -1; + } + + nr_pfns = params->max_gpfn - params->start_gpfn; + maxmemkb = (unsigned int) nr_pfns << (PAGE_SHIFT - 10); + + if ( params->flags & XCFLAGS_DEBUG ) + { + IPRINTF("Guest param restore size: %d ", sz); + IPRINTF("Guest memory size: %d MB", maxmemkb >> 10); + } + + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) + { + ERROR("Can''t set memory map"); + return -1; + } + + /* Set max. number of vcpus as max_vcpu_id + 1 */ + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) ) + { + ERROR("Can''t set max vcpu number for domain"); + return -1; + } + + return 0; +} + +static int set_guest_params(xc_interface *xch, int io_fd, uint32_t dom, + guest_params_t *params, unsigned int console_evtchn, + domid_t console_domid, unsigned int store_evtchn, + domid_t store_domid) +{ + int rc = 0; + + if ( (rc = xc_clear_domain_page(xch, dom, params->console_pfn)) ) + { + ERROR("Can''t clear console page"); + return rc; + } + + if ( (rc = xc_clear_domain_page(xch, dom, params->store_pfn)) ) + { + ERROR("Can''t clear xenstore page"); + return rc; + } + + if ( (rc = xc_dom_gnttab_hvm_seed(xch, dom, params->console_pfn, + params->store_pfn, console_domid, + store_domid)) ) + { + ERROR("Can''t grant console and xenstore pages"); + return rc; + } + + if ( (rc = xc_set_hvm_param(xch, dom, HVM_PARAM_CONSOLE_PFN, + params->console_pfn)) ) + { + ERROR("Can''t set console gpfn"); + return rc; + } + + if ( (rc = xc_set_hvm_param(xch, dom, HVM_PARAM_STORE_PFN, + params->store_pfn)) ) + { + ERROR("Can''t set xenstore gpfn"); + return rc; + } + + if ( (rc = xc_set_hvm_param(xch, dom, HVM_PARAM_CONSOLE_EVTCHN, + console_evtchn)) ) + { + ERROR("Can''t set console event channel"); + return rc; + } + + if ( (rc = xc_set_hvm_param(xch, dom, HVM_PARAM_STORE_EVTCHN, + store_evtchn)) ) + { + ERROR("Can''t set xenstore event channel"); + return rc; + } + return 0; +} + +/* ================== Main ============== */ +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, + uint32_t max_iters, uint32_t max_factor, uint32_t flags, + struct save_callbacks *callbacks, int hvm, + unsigned long vm_generationid_addr) +{ + int debug; + guest_params_t params; + +#ifdef ARM_MIGRATE_VERBOSE + flags |= XCFLAGS_DEBUG; +#endif + +#ifdef DISABLE_LIVE_MIGRATION + flags &= ~(XCFLAGS_LIVE); +#endif + + debug = !!(flags & XCFLAGS_DEBUG); + if ( save_guest_params(xch, io_fd, dom, flags, ¶ms) ) + { + ERROR("Can''t save guest params"); + return -1; + } + + if ( save_memory(xch, io_fd, dom, callbacks, max_iters, + max_factor, ¶ms) ) + { + ERROR("Memory not saved"); + return -1; + } + + if ( save_armhvm(xch, io_fd, dom, debug) ) + { + ERROR("HVM not saved"); + return -1; + } + + if ( debug ) + { + IPRINTF("Domain %d saved", dom); + } + return 0; +} + +int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, + unsigned int store_evtchn, unsigned long *store_gpfn, + domid_t store_domid, unsigned int console_evtchn, + unsigned long *console_gpfn, domid_t console_domid, + unsigned int hvm, unsigned int pae, int superpages, + int no_incr_generationid, int checkpointed_stream, + unsigned long *vm_generationid_addr, + struct restore_callbacks *callbacks) +{ + guest_params_t params; + int debug = 1; + + if ( restore_guest_params(xch, io_fd, dom, ¶ms) ) + { + ERROR("Can''t restore guest params"); + return -1; + } + debug = !!(params.flags & XCFLAGS_DEBUG); + + if ( restore_memory(xch, io_fd, dom, ¶ms) ) + { + ERROR("Can''t restore memory"); + return -1; + } + if ( set_guest_params(xch, io_fd, dom, ¶ms, + console_evtchn, console_domid, + store_evtchn, store_domid) ) + { + ERROR("Can''t setup guest params"); + return -1; + } + + /* Setup console and store PFNs to caller */ + *console_gpfn = params.console_pfn; + *store_gpfn = params.store_pfn; + + if ( restore_armhvm(xch, io_fd, dom, debug) ) + { + ERROR("HVM not restored"); + return -1; + } + + if ( debug ) + { + IPRINTF("Domain %d restored", dom); + } + + return 0; +} + +/* + * Local variables: + * mode: C + * c-set-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/tools/libxc/xc_dom_arm.c b/tools/libxc/xc_dom_arm.c index df59ffb..ebeeb41 100644 --- a/tools/libxc/xc_dom_arm.c +++ b/tools/libxc/xc_dom_arm.c @@ -191,7 +191,9 @@ int arch_setup_meminit(struct xc_dom_image *dom) 0, 0, &dom->p2m_host[i]); } - return 0; + return xc_domain_setmaxmem(dom->xch, dom->guest_domid, + (dom->total_pages + NR_MAGIC_PAGES) + << (PAGE_SHIFT - 10)); } int arch_setup_bootearly(struct xc_dom_image *dom) diff --git a/tools/misc/Makefile b/tools/misc/Makefile index 17aeda5..0824100 100644 --- a/tools/misc/Makefile +++ b/tools/misc/Makefile @@ -11,7 +11,7 @@ HDRS = $(wildcard *.h) TARGETS-y := xenperf xenpm xen-tmem-list-parse gtraceview gtracestat xenlockprof xenwatchdogd xencov TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump -TARGETS-$(CONFIG_MIGRATE) += xen-hptool +TARGETS-$(CONFIG_X86) += xen-hptool TARGETS := $(TARGETS-y) SUBDIRS := $(SUBDIRS-y) @@ -23,7 +23,7 @@ INSTALL_BIN := $(INSTALL_BIN-y) INSTALL_SBIN-y := xen-bugtool xen-python-path xenperf xenpm xen-tmem-list-parse gtraceview \ gtracestat xenlockprof xenwatchdogd xen-ringwatch xencov INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump -INSTALL_SBIN-$(CONFIG_MIGRATE) += xen-hptool +INSTALL_SBIN-$(CONFIG_X86) += xen-hptool INSTALL_SBIN := $(INSTALL_SBIN-y) INSTALL_PRIVBIN-y := xenpvnetboot -- 1.8.1.2
Ian Campbell
2013-Nov-12 15:15 UTC
Re: [PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: If these get resent again please can you CC the ARM maintainers. $ git grep -A5 ARM MAINTAINERS MAINTAINERS:ARM (W/ VIRTUALISATION EXTENSIONS) ARCHITECTURE MAINTAINERS-M: Ian Campbell <ian.campbell@citrix.com> MAINTAINERS-M: Stefano Stabellini <stefano.stabellini@citrix.com> MAINTAINERS-M: Tim Deegan <tim@xen.org> MAINTAINERS-S: Supported MAINTAINERS-L: xen-devel@lists.xen.org> From: Evgeny Fedotov <e.fedotov@samsung.com> > > Implement save/restore of hvm context hypercall. > In hvm context save/restore, we save gic, timer and vfp registers. > > Changes from v4: Save vcpu registers within hvm context, and purge > the save-vcpu-register patch.The inter-version changelog should go after the main commit message and the S-o-b separated by a "---" on a line by itself. This way it is not included in the final commit. [...]> +static int vgic_irq_rank_save(struct vgic_rank *ext, > + struct vgic_irq_rank *rank) > +{ > + spin_lock(&rank->lock);This is probably wise, but the domain must be paused at this point, right?> + /* Some of VGIC registers are not used yet, it is for a future usage */ > + /* IENABLE, IACTIVE, IPEND, PENDSGI registers */ > + ext->ienable = rank->ienable; > + ext->iactive = rank->iactive; > + ext->ipend = rank->ipend; > + ext->pendsgi = rank->pendsgi; > + /* ICFG */ > + ext->icfg[0] = rank->icfg[0]; > + ext->icfg[1] = rank->icfg[1]; > + /* IPRIORITY */ > + if ( sizeof(rank->ipriority) != sizeof (ext->ipriority) ) > + { > + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check ipriority dumping space\n");What makes ipriority (and itargets below) special enough to warrant such a check compared with all the other fields? Could these checks be BUILD_BUG_ON?> + return -EINVAL; > + } > + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); > + /* ITARGETS */ > + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) > + { > + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); > + return -EINVAL; > + } > + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); > + spin_unlock(&rank->lock); > + return 0; > +} > + > +static int gic_save(struct domain *d, hvm_domain_context_t *h) > +{ > + struct hvm_hw_gic ctxt; > + struct vcpu *v; > + > + /* Save the state of GICs */ > + for_each_vcpu( d, v )Where is the GICD state saved then?> + { > + ctxt.gic_hcr = v->arch.gic_hcr; > + ctxt.gic_vmcr = v->arch.gic_vmcr; > + ctxt.gic_apr = v->arch.gic_apr; > + > + /* Save list registers and masks */ > + /* (it is not necessary to save/restore them, but LR state can have > + * influence on downtime after Live Migration (to be tested)What does this mean? I''m in two minds about whether LR counts as architectural state, since it is hypervisor side and not guest facing. I''d have thought that the LR content could be reconstructed from the pending and active bitmasks.> + */ > + if ( sizeof(v->arch.gic_lr) > sizeof (ctxt.gic_lr) ) > + { > + dprintk(XENLOG_G_ERR, "hvm_hw_gic: increase LR dumping space\n"); > + return -EINVAL; > + } > + memcpy(ctxt.gic_lr, v->arch.gic_lr, sizeof(v->arch.gic_lr)); > + ctxt.lr_mask = v->arch.lr_mask; > + ctxt.event_mask = v->arch.event_mask;These last two in particular make me think that saving the LRs is wrong.> + /* Save PPI states (per-CPU) */ > + /* It is necessary if SMP enabled */ > + if ( vgic_irq_rank_save(&ctxt.ppi_state, &v->arch.vgic.private_irqs) ) > + return 1; > + > + if ( hvm_save_entry(GIC, v->vcpu_id, h, &ctxt) != 0 ) > + return 1; > + } > + return 0; > +} > + > +static int gic_load(struct domain *d, hvm_domain_context_t *h) > +{ > + int vcpuid; > + struct hvm_hw_gic ctxt; > + struct vcpu *v; > [...] > + return 0; > +} > + > +HVM_REGISTER_SAVE_RESTORE(GIC, gic_save, gic_load, 1, HVMSR_PER_VCPU); > + > [...]> +#ifdef CONFIG_ARM_32 > + ctxt.mair0 = v->arch.mair0; > + ctxt.mair1 = v->arch.mair1; > +#else > + ctxt.mair0 = v->arch.mair; > +#endif > + /* Control Registers */ > + ctxt.actlr = v->arch.actlr; > + ctxt.sctlr = v->arch.sctlr; > + ctxt.cpacr = v->arch.cpacr; > [...]> + ctxt.x0 = c.x0;Would it make sense to include a field of type vcpu_guest_core_regs in the save struct instead of this big list? [...]> + #ifdef CONFIG_ARM_32#ifdef in the first columns please.> + ctxt.spsr_fiq = c.spsr_fiq; > + ctxt.spsr_irq = c.spsr_irq; > + ctxt.spsr_und = c.spsr_und; > + ctxt.spsr_abt = c.spsr_abt;and don''t indent these as if they were inside a block You had this right when you saved the MAIR registers above.> + #endif > + #ifdef CONFIG_ARM_64or #else if you prefer.> + ctxt.sp_el0 = c.sp_el0; > + ctxt.sp_el1 = c.sp_el1; > + ctxt.elr_el1 = c.elr_el1; > + #endif > + > + /* check VFP state size before dumping */ > + if ( sizeof(v->arch.vfp) > sizeof (ctxt.vfp) ) > + { > + dprintk(XENLOG_G_ERR, "hvm_hw_cpu: increase VFP dumping space\n"); > + return -EINVAL; > + } > + memcpy((void*) &ctxt.vfp, (void*) &v->arch.vfp, sizeof(v->arch.vfp)); > + > + ctxt.pause_flags = v->pause_flags;x86 has at the top of the loop: /* We don''t need to save state for a vcpu that is down; the restore * code will leave it down if there is nothing saved. */ if ( test_bit(_VPF_down, &v->pause_flags) ) continue; which I think is preferable to this.> diff --git a/xen/include/asm-arm/hvm/support.h b/xen/include/asm-arm/hvm/support.h > new file mode 100644 > index 0000000..8311f2f > --- /dev/null > +++ b/xen/include/asm-arm/hvm/support.h > @@ -0,0 +1,29 @@ > +/* > + * support.h: HVM support routines used by ARMv7 VE.Not just on Vexpress nor just on v7 I expect?> + * > + * Copyright (c) 2012, Citrix SystemsReally?> + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + * You should have received a copy of the GNU General Public License along with > + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple > + * Place - Suite 330, Boston, MA 02111-1307 USA. > + */ > + > +#ifndef __ASM_ARM_HVM_SUPPORT_H__ > +#define __ASM_ARM_HVM_SUPPORT_H__ > + > +#include <xen/types.h> > +#include <public/hvm/ioreq.h> > +#include <xen/sched.h> > +#include <xen/hvm/save.h> > +#include <asm/processor.h>Just some #includes here? Ian,
Ian Campbell
2013-Nov-12 15:21 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:> From: Evgeny Fedotov <e.fedotov@samsung.com> > > By using the memory map info in arch_domain (from set_memory_map hypercall) > implement get_maximum_gpfn hypercall. > > Changes from v4: Use GUEST_RAM_BASE as the start physical address of guest > RAM. And, purge set-memory-map patch > > Singed-off-by: Evgeny Fedotov <e.fedotov@samsung.com> > --- > xen/arch/arm/mm.c | 22 +++++++++++++++++++++- > xen/include/asm-arm/mm.h | 2 ++ > 2 files changed, 23 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c > index 123280e..3801f07 100644 > --- a/xen/arch/arm/mm.c > +++ b/xen/arch/arm/mm.c > @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) > > unsigned long domain_get_maximum_gpfn(struct domain *d)s/unsigned long/xen_pfn_t/ I think. Urk, which will break the ABI for this hypercall. That''s something of a conundrum :-/ It is a domctl so we are at liberty to change it, probably to taking a xen_pfn_t * to fill in instead of returning the value. I''m in two minds about whether we should do so now or postpone it (which risks forgetting and having an obscure bug somewhere down the line).> { > - return -ENOSYS; > + paddr_t end; > + > + get_gma_start_end(d, NULL, &end);I suppose this function is useful in a future patch too, otherwise return (GUEST_RAM_BASE>>PAGE_SHIFT)+d->max_pages would suffice. Ian.
Ian Campbell
2013-Nov-12 15:24 UTC
Re: [PATCH RESEND v5 3/6] xen/arm: Implement modify_returncode
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:> From: Alexey Sokolov <sokolov.a@samsung.com> > > Making sched_op in do_suspend (driver/xen/manage.c) returns 0 on the success of suspend. > > Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com>Acked-by: Ian Campbell <ian.campbell@citrix.com> Although if you wanted to...> +#elif defined(__arm__)make this "... || defined(__aarch64__)" and ...> + ctxt.c.user_regs.r0_usr = 1;Use x0 here instead (which is a 64-bit alias of r0_usr, so good for both 32 and 64 bit). Then it would be ready made for 64-bit too in the future. Ian.
Ian Campbell
2013-Nov-12 15:58 UTC
Re: [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:> Implement vlpt (virtual-linear page table) for fast accessing of 3rd PTE of guest p2m. > When creating a mapping for vlpt, just copy the 1st level PTE of guest p2m to the xen''s > 2nd level PTE. Then the mapping becomes the following: > xen''s 1st PTE --> > xen''s 2nd PTE (which is the same as 1st PTE of guest p2m) --> > guest p2m''s 2nd PTE --> > guest p2m''s 3rd PTE (the memory contents where the vlpt points) > For more info about vlpt, see: > http://www.technovelty.org/linux/virtual-linear-page-table.html > > This function is used in dirty-page tracing: when domU write-fault is trapped by xen, > xen can immediately locate the 3rd PTE of guest p2m. > The following link shows the performance comparison for handling a dirty-page between > vlpt and typical page table walking. > http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html > > Changes from v4: > 1. In the restoring vlpt, use __foo variant without barriers for write_pte and > flush_xen_data_tlb_range_va. > 2. Support two consecutive pages for guest''s first level page table. > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > --- > xen/arch/arm/domain.c | 5 ++ > xen/arch/arm/mm.c | 112 +++++++++++++++++++++++++++++++++++++++ > xen/include/asm-arm/arm32/page.h | 41 ++++++++++---- > xen/include/asm-arm/config.h | 5 ++ > xen/include/asm-arm/domain.h | 7 +++ > xen/include/asm-arm/mm.h | 15 ++++++ > 6 files changed, 174 insertions(+), 11 deletions(-) > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c > index cb0424d..c0b5dd8 100644 > --- a/xen/arch/arm/domain.c > +++ b/xen/arch/arm/domain.c > @@ -509,6 +509,11 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > /* Default the virtual ID to match the physical */ > d->arch.vpidr = boot_cpu_data.midr.bits; > > + d->arch.dirty.second_lvl_start = 0; > + d->arch.dirty.second_lvl_end = 0; > + d->arch.dirty.second_lvl[0] = NULL; > + d->arch.dirty.second_lvl[1] = NULL; > + > clear_page(d->shared_info); > share_xen_page_with_guest( > virt_to_page(d->shared_info), d, XENSHARE_writable); > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c > index 3801f07..bf13993 100644 > --- a/xen/arch/arm/mm.c > +++ b/xen/arch/arm/mm.c > @@ -1328,6 +1328,118 @@ void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end) > *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages << PAGE_SHIFT); > } > > +/* flush the vlpt area */ > +void flush_vlpt(struct domain *d) > +{ > + int flush_size; > + flush_size = (d->arch.dirty.second_lvl_end - > + d->arch.dirty.second_lvl_start) << SECOND_SHIFT; > + /* flushing the 3rd level mapping */ > + flush_xen_data_tlb_range_va(d->arch.dirty.second_lvl_start << SECOND_SHIFT, > + flush_size);All these << SECOND_SHIFT, can they be put into a macro VLPT_OFFSET(guest_vaddr) perhaps (assuming that''s the right semantics)> +} > + > +/* restore the xen page table for vlpt mapping for domain d */ > +void restore_vlpt(struct domain *d)I guess this is called on context switch? Why not in this patch?> +{ > + int i; > + dsb(); > + for ( i = d->arch.dirty.second_lvl_start; > + i < d->arch.dirty.second_lvl_end; > + ++i ) > + { > + int k = i % LPAE_ENTRIES; > + int l = i / LPAE_ENTRIES; > + > + if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits ) > + { > + __write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]); > + __flush_xen_data_tlb_range_va(i << SECOND_SHIFT, > + 1 << SECOND_SHIFT); > + } > + } > + dsb(); > + isb(); > +} > + > +/* setting up the xen page table for vlpt mapping for domain d */ > +int prepare_vlpt(struct domain *d) > +{ > + int xen_second_linear_base; > + int gp2m_start_index, gp2m_end_index; > + struct p2m_domain *p2m = &d->arch.p2m; > + struct page_info *second_lvl_page; > + paddr_t gma_start = 0; > + paddr_t gma_end = 0; > + lpae_t *first[2]; > + int i; > + uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START; > + > + get_gma_start_end(d, &gma_start, &gma_end); > + required = (gma_end - gma_start) >> LPAE_SHIFT; > + > + > + if ( required > avail )avail is the number of bytes of virtual address space in the linear p2m area. required is the number of pages which the guest has. Does this comparison correctly account for each page being represented by an 8-byte lpae_t entry?> + { > + dprintk(XENLOG_ERR, "Available VLPT is small for domU guest" > + "(avail: %llx, required: %llx)\n", > + avail, required); > + return -ENOMEM; > + } > + > + xen_second_linear_base = second_linear_offset(VIRT_LIN_P2M_START); > + > + gp2m_start_index = gma_start >> FIRST_SHIFT; > + gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1; > + > + if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 ) > + { > + dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU"); > + return -ENOMEM; > + } > + > + second_lvl_page = alloc_domheap_pages(NULL, 1, 0);There''d be no harm in allocating the two pages separately I think, and avoiding the possiblilty of an order 1 allocation failing. But if you do allocate them contiguously then d->arch.dirty.second_lvl can be a simple lpae_t * and you can just access offsets 0..1023 without worrying about managing the array of two pointers. As it stands I think you have the worst of both worlds. We could also just consider allocating all firstlevel p2m pages from the xenheap instead of the domheap.> + if ( second_lvl_page == NULL ) > + return -ENOMEM; > + > + /* First level p2m is 2 consecutive pages */ > + d->arch.dirty.second_lvl[0] = map_domain_page_global( > + page_to_mfn(second_lvl_page) ); > + d->arch.dirty.second_lvl[1] = map_domain_page_global( > + page_to_mfn(second_lvl_page+1) ); > + > + first[0] = __map_domain_page(p2m->first_level); > + first[1] = __map_domain_page(p2m->first_level+1); > + for ( i = gp2m_start_index; i < gp2m_end_index; ++i )Can''t this just be a loop over 0..1023 and avoid all this modular arithmetic:> + { > + int k = i % LPAE_ENTRIES; > + int l = i / LPAE_ENTRIES; > + int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES; > + int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES; > + > + write_pte(&xen_second[xen_second_linear_base+i], first[l][k]); > + > + /* we copy the mapping into domain''s structure as a reference > + * in case of the context switch (used in restore_vlpt) */Do you keep this up to date when changing the p2m? In fact, why aren''t you just mapping the actual first level pages? Why does it have to be a copy?> + d->arch.dirty.second_lvl[l2][k2] = first[l][k]; > + } > + unmap_domain_page(first[0]); > + unmap_domain_page(first[1]); > + > + /* storing the start and end index */ > + d->arch.dirty.second_lvl_start = xen_second_linear_base + gp2m_start_index; > + d->arch.dirty.second_lvl_end = xen_second_linear_base + gp2m_end_index; > + > + flush_vlpt(d); > + return 0; > +} > + > +void cleanup_vlpt(struct domain *d) > +{ > + /* First level p2m is 2 consecutive pages */ > + unmap_domain_page_global(d->arch.dirty.second_lvl[0]); > + unmap_domain_page_global(d->arch.dirty.second_lvl[1]); > +} > /* > * Local variables: > * mode: C > diff --git a/xen/include/asm-arm/arm32/page.h b/xen/include/asm-arm/arm32/page.h > index cf12a89..0a4e115 100644 > --- a/xen/include/asm-arm/arm32/page.h > +++ b/xen/include/asm-arm/arm32/page.h > @@ -5,20 +5,28 @@ > > /* Write a pagetable entry. > * > - * If the table entry is changing a text mapping, it is responsibility > - * of the caller to issue an ISB after write_pte. > + * All necessary barriers are responsibility of the caller > */ > -static inline void write_pte(lpae_t *p, lpae_t pte) > +static inline void __write_pte(lpae_t *p, lpae_t pte) > { > asm volatile ( > - /* Ensure any writes have completed with the old mappings. */ > - "dsb;" > /* Safely write the entry (STRD is atomic on CPUs that support LPAE) */ > "strd %0, %H0, [%1];" > - "dsb;" > : : "r" (pte.bits), "r" (p) : "memory"); > } > > +/* Write a pagetable entry with dsb barriers. > + * > + * If the table entry is changing a text mapping, it is responsibility > + * of the caller to issue an ISB after write_pte. > + */ > +static inline void write_pte(lpae_t *p, lpae_t pte) > +{ > + dsb();Please retain the comments from the original version.> + __write_pte(p, pte); > + dsb(); > +} > + > /* Inline ASM to flush dcache on register R (may be an inline asm operand) */ > #define __flush_xen_dcache_one(R) STORE_CP32(R, DCCMVAC) > > @@ -57,18 +65,28 @@ static inline void flush_xen_data_tlb(void) > } > > /* > - * Flush a range of VA''s hypervisor mappings from the data TLB. This is not > - * sufficient when changing code mappings or for self modifying code. > + * Flush a range of VA''s hypervisor mappings from the data TLB. > + * All necessary barriers are responsibility of the caller > */ > -static inline void flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) > +static inline void __flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) > { > unsigned long end = va + size; > - dsb(); /* Ensure preceding are visible */ > while ( va < end ) { > asm volatile(STORE_CP32(0, TLBIMVAH) > : : "r" (va) : "memory"); > va += PAGE_SIZE; > } > +} > + > +/* > + * Flush a range of VA''s hypervisor mappings from the data TLB with barriers. > + * This barrier is not sufficient when changing code mappings or for self > + * modifying code. > + */ > +static inline void flush_xen_data_tlb_range_va(unsigned long va, unsigned long size) > +{ > + dsb(); /* Ensure preceding are visible */ > + __flush_xen_data_tlb_range_va(va, size); > dsb(); /* Ensure completion of the TLB flush */ > isb(); > } > diff --git a/xen/include/asm-arm/config.h b/xen/include/asm-arm/config.h > index 5b7b1a8..15ad56d 100644 > --- a/xen/include/asm-arm/config.h > +++ b/xen/include/asm-arm/config.h > @@ -87,6 +87,7 @@ > * 0 - 8M <COMMON> > * > * 32M - 128M Frametable: 24 bytes per page for 16GB of RAM > + * 128M - 256M Virtual-linear mapping to P2M table > * 256M - 1G VMAP: ioremap and early_ioremap use this virtual address > * space > * > @@ -124,7 +125,9 @@ > #define CONFIG_SEPARATE_XENHEAP 1 > > #define FRAMETABLE_VIRT_START _AT(vaddr_t,0x02000000) > +#define VIRT_LIN_P2M_START _AT(vaddr_t,0x08000000) > #define VMAP_VIRT_START _AT(vaddr_t,0x10000000) > +#define VIRT_LIN_P2M_END VMAP_VIRT_START > #define XENHEAP_VIRT_START _AT(vaddr_t,0x40000000) > #define XENHEAP_VIRT_END _AT(vaddr_t,0x7fffffff) > #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) > @@ -157,6 +160,8 @@ > > #define HYPERVISOR_VIRT_END DIRECTMAP_VIRT_END > > +/*TODO (ARM_64): define VIRT_LIN_P2M_START VIRT_LIN_P2M_END */ > + > #endif > > /* Fixmap slots */ > diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h > index 67bfbbc..4f366f1 100644 > --- a/xen/include/asm-arm/domain.h > +++ b/xen/include/asm-arm/domain.h > @@ -112,6 +112,13 @@ struct arch_domain > spinlock_t lock; > } vuart; > > + /* dirty-page tracing */ > + struct { > + volatile int second_lvl_start; /* for context switch */ > + volatile int second_lvl_end;Are these virtual addresses, or pages or something else? We mostly hav specific types (e.g. vaddr_t) for such things. volatile is almost always wrong to use. Why do you want it here? If there is an actual coherency issue you will need proper barriers, not volatile. Ian.
Ian Campbell
2013-Nov-12 16:56 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:> Add hypercall (shadow op: enable/disable and clean/peek dirtied page bitmap). > It consists of two parts: dirty page detecting and saving. > For detecting, we setup the guest p2m''s leaf PTE read-only and whenever the guest > tries to write something, permission fault happens and traps into xen. > The permission-faulted GPA should be saved for the toolstack (when it wants to see > which pages are dirtied). For this purpose, we temporarily save the GPAs into bitmap. > > Changes from v4: > 1. For temporary saving dirty pages, use bitmap rather than linked list. > 2. Setup the p2m''s second level page as read-write in the view of xen''s memory access. > It happens in p2m_create_table function. > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > --- > xen/arch/arm/domain.c | 14 +++ > xen/arch/arm/domctl.c | 9 ++ > xen/arch/arm/mm.c | 103 +++++++++++++++++++- > xen/arch/arm/p2m.c | 206 ++++++++++++++++++++++++++++++++++++++++ > xen/arch/arm/traps.c | 9 ++ > xen/include/asm-arm/domain.h | 7 ++ > xen/include/asm-arm/mm.h | 7 ++ > xen/include/asm-arm/p2m.h | 4 + > xen/include/asm-arm/processor.h | 2 + > 9 files changed, 360 insertions(+), 1 deletion(-) > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c > index c0b5dd8..0a32301 100644 > --- a/xen/arch/arm/domain.c > +++ b/xen/arch/arm/domain.c > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > WRITE_SYSREG(hcr, HCR_EL2); > isb(); > > + /* for dirty-page tracing > + * XXX: how do we consider SMP case? > + */ > + if ( n->domain->arch.dirty.mode ) > + restore_vlpt(n->domain);This is an interesting question. xen_second is shared between all pcpus, which means that the vlpt is currently only usable from a single physical CPU at a time. Currently the only per-cpu area is the domheap region from 2G-4G. We could steal some address space from the top or bottom of there?> /* This is could trigger an hardware interrupt from the virtual > * timer. The interrupt needs to be injected into the guest. */ > virt_timer_restore(n); > @@ -509,11 +515,19 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > /* Default the virtual ID to match the physical */ > d->arch.vpidr = boot_cpu_data.midr.bits; > > + /* init for dirty-page tracing */ > + d->arch.dirty.count = 0; > + d->arch.dirty.mode = 0; > + spin_lock_init(&d->arch.dirty.lock); > + > d->arch.dirty.second_lvl_start = 0; > d->arch.dirty.second_lvl_end = 0; > d->arch.dirty.second_lvl[0] = NULL; > d->arch.dirty.second_lvl[1] = NULL; > > + memset(d->arch.dirty.bitmap, 0, sizeof(d->arch.dirty.bitmap)); > + d->arch.dirty.bitmap_pages = 0; > + > clear_page(d->shared_info); > share_xen_page_with_guest( > virt_to_page(d->shared_info), d, XENSHARE_writable); > diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c > index cb38e59..eb74225 100644 > --- a/xen/arch/arm/domctl.c > +++ b/xen/arch/arm/domctl.c > @@ -93,6 +93,15 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d, > xfree(c.data); > } > break; > + case XEN_DOMCTL_shadow_op: > + { > + domain_pause(d); > + ret = dirty_mode_op(d, &domctl->u.shadow_op); > + domain_unpause(d); > + > + copyback = 1; > + } > + break; > > default: > return -EINVAL; > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c > index bf13993..d5a0a11 100644 > --- a/xen/arch/arm/mm.c > +++ b/xen/arch/arm/mm.c > @@ -845,7 +845,6 @@ void destroy_xen_mappings(unsigned long v, unsigned long e) > create_xen_entries(REMOVE, v, 0, (e - v) >> PAGE_SHIFT, 0); > } > > -enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; > static void set_pte_flags_on_range(const char *p, unsigned long l, enum mg mg) > { > lpae_t pte; > @@ -1320,6 +1319,60 @@ int is_iomem_page(unsigned long mfn) > * xen: arm: 64-bit guest support and domU FDT autogeneration > * will be upstreamed. > */ > + > +static inline void mark_dirty_bitmap(struct domain *d, paddr_t addr) > +{ > + paddr_t ram_base = (paddr_t) GUEST_RAM_BASE; > + int bit_index = PFN_DOWN(addr - ram_base); > + int page_index = bit_index >> (PAGE_SHIFT + 3);+3?> + int bit_index_residual = bit_index & ((1ul << (PAGE_SHIFT + 3)) - 1); > + > + set_bit(bit_index_residual, d->arch.dirty.bitmap[page_index]); > +} > + > +/* routine for dirty-page tracing > + * > + * On first write, it page faults, its entry is changed to read-write, > + * and on retry the write succeeds. > + * > + * for locating p2m of the faulting entry, we use virtual-linear page table. > + * returns zero if addr is not valid or dirty mode is not set > + */ > +int handle_page_fault(struct domain *d, paddr_t addr) > +{ > + > + lpae_t *vlp2m_pte = 0; > + paddr_t gma_start = 0; > + paddr_t gma_end = 0; > + > + if ( !d->arch.dirty.mode ) return 0; > + get_gma_start_end(d, &gma_start, &gma_end); > + > + /* Ensure that addr is inside guest''s RAM */ > + if ( addr < gma_start || > + addr > gma_end ) return 0; > + > + vlp2m_pte = get_vlpt_3lvl_pte(addr); > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 && > + vlp2m_pte->p2m.avail == 0 /* reuse avail bit as read-only */ ) > + { > + lpae_t pte = *vlp2m_pte; > + pte.p2m.write = 1; > + write_pte(vlp2m_pte, pte);Do we not need the p2m lock around here somewhere?> + flush_tlb_local();If SMP was working this would need to be inner shareable to cope with vcpus running on other pcpus.> + /* only necessary to lock between get-dirty bitmap and mark dirty > + * bitmap. If get-dirty bitmap happens immediately before this > + * lock, the corresponding dirty-page would be marked at the next > + * round of get-dirty bitmap */ > + spin_lock(&d->arch.dirty.lock); > + mark_dirty_bitmap(d, addr); > + spin_unlock(&d->arch.dirty.lock);Might an atomic set_bit suffice rather than the lock?> + } > + > + return 1; > +} > + > void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end) > { > if ( start ) > @@ -1440,6 +1493,54 @@ void cleanup_vlpt(struct domain *d) > unmap_domain_page_global(d->arch.dirty.second_lvl[0]); > unmap_domain_page_global(d->arch.dirty.second_lvl[1]); > } > + > +int prepare_bitmap(struct domain *d) > +{ > + paddr_t gma_start = 0; > + paddr_t gma_end = 0; > + int nr_bytes; > + int nr_pages; > + int i; > + > + get_gma_start_end(d, &gma_start, &gma_end); > + > + nr_bytes = (PFN_DOWN(gma_end - gma_start) + 7) / 8; > + nr_pages = (nr_bytes + PAGE_SIZE - 1) / PAGE_SIZE; > + > + BUG_ON( nr_pages > MAX_DIRTY_BITMAP_PAGES ); > + > + for ( i = 0; i < nr_pages; ++i ) > + { > + struct page_info *page; > + page = alloc_domheap_page(NULL, 0); > + if ( page == NULL ) > + goto cleanup_on_failure; > + > + d->arch.dirty.bitmap[i] = map_domain_page_global(__page_to_mfn(page));I think you may as well just use the xenheap here.> +/* Change types across all p2m entries in a domain */ > +void p2m_change_entry_type_global(struct domain *d, enum mg nt) > +{ > + struct p2m_domain *p2m = &d->arch.p2m; > + paddr_t ram_base; > + int i1, i2, i3; > + int first_index, second_index, third_index; > + lpae_t *first = __map_domain_page(p2m->first_level); > + lpae_t pte, *second = NULL, *third = NULL; > + > + get_gma_start_end(d, &ram_base, NULL); > + > + first_index = first_table_offset((uint64_t)ram_base); > + second_index = second_table_offset((uint64_t)ram_base); > + third_index = third_table_offset((uint64_t)ram_base);Are those casts needed?> + > + BUG_ON( !first && "Can''t map first level p2m." ); > + > + spin_lock(&p2m->lock); > + > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )You avoid iterating over stuff below ram_base but you aren''t worried about ram end?> + { > + lpae_walk_t first_pte = first[i1].walk; > + if ( !first_pte.valid || !first_pte.table ) > + goto out; > + > + second = map_domain_page(first_pte.base); > + BUG_ON( !second && "Can''t map second level p2m."); > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > + { > + lpae_walk_t second_pte = second[i2].walk; > + > + if ( !second_pte.valid || !second_pte.table ) > + goto out;goto out is a bit strong. Don''t you want to move on to the next second level page?> + > + third = map_domain_page(second_pte.base); > + BUG_ON( !third && "Can''t map third level p2m."); > + > + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) > + { > + lpae_walk_t third_pte = third[i3].walk; > + if ( !third_pte.valid ) > + goto out; > + > + pte = third[i3]; > + if ( nt == mg_ro ) > + { > + if ( pte.p2m.write == 1 ) > + { > + pte.p2m.write = 0; > + pte.p2m.avail = 0; > + } > + else > + { > + /* reuse avail bit as an indicator > + * of ''actual'' read-only */ > + pte.p2m.avail = 1; > + }This inner if is equivalent to pte.p2m.avail = pte.p2m.write; pte.p2m.write = 1;> + } > + else if ( nt == mg_rw ) > + { > + if ( pte.p2m.write == 0 && pte.p2m.avail == 0 ) > + { > + pte.p2m.write = 1; > + }and this one is: pte.p2m.write = pte.p2m.avail;> + } > + write_pte(&third[i3], pte); > + } > + unmap_domain_page(third); > + > + third = NULL; > + third_index = 0; > + } > + unmap_domain_page(second); > + > + second = NULL; > + second_index = 0; > + third_index = 0; > + } > + > +out: > + flush_tlb_all_local(); > + if ( third ) unmap_domain_page(third); > + if ( second ) unmap_domain_page(second); > + if ( first ) unmap_domain_page(first); > + > + spin_unlock(&p2m->lock); > +} > + > +/* Read a domain''s log-dirty bitmap and stats. > + * If the operation is a CLEAN, clear the bitmap and stats. */ > +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) > +{ > + int peek = 1; > + int i; > + int bitmap_size; > + paddr_t gma_start, gma_end; > + > + /* this hypercall is called from domain 0, and we don''t know which guest''s > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */ > + restore_vlpt(d); > + > + get_gma_start_end(d, &gma_start, &gma_end); > + bitmap_size = (gma_end - gma_start) / 8; > + > + if ( guest_handle_is_null(sc->dirty_bitmap) ) > + { > + peek = 0;What do you do with this?> + } > + else > + { > + spin_lock(&d->arch.dirty.lock); > + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) > + { > + int j = 0; > + uint8_t *bitmap; > + copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE, > + d->arch.dirty.bitmap[i], > + bitmap_size < PAGE_SIZE ? bitmap_size : > + PAGE_SIZE);The bitmap is always a mutliple of page_size, isn''t it?> + bitmap_size -= PAGE_SIZE; > + > + /* set p2m page table read-only */ > + bitmap = d->arch.dirty.bitmap[i]; > + while ((j = find_next_bit((const long unsigned int *)bitmap, > + PAGE_SIZE*8, j)) < PAGE_SIZE*8) > + { > + lpae_t *vlpt; > + paddr_t addr = gma_start + > + (i << (2*PAGE_SHIFT+3)) + > + (j << PAGE_SHIFT); > + vlpt = get_vlpt_3lvl_pte(addr); > + vlpt->p2m.write = 0;Needs to be a write_pte type construct I think.> + j++; > + } > + } > + > + if ( sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN ) > + { > + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) > + { > + clear_page(d->arch.dirty.bitmap[i]); > + } > + } > + > + spin_unlock(&d->arch.dirty.lock);Ah, I see why the lock in the handler is needed. This holds the lock over quite a long period. Could it be made more granular by copying and clearing each page in sequence, dropping the lock between pages?> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > index 287dd7b..1a7ed11 100644 > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -1321,6 +1321,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > const char *msg; > int rc, level = -1; > mmio_info_t info; > + int page_fault = ( (dabt.dfsc & FSC_MASK) => + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write );I think you can use FSC_TYPE_MASK and FSC_LL_MASK here, can''t you? I think this would be better off refactored into a dabt_is_page_fault(dabt), used in the test below. That would allow you to more easily comment on why these particular conditions are the ones we care about etc.> if ( !check_conditional_instr(regs, hsr) ) > { > @@ -1342,6 +1344,13 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > if ( rc == -EFAULT ) > goto bad_data_abort; > > + /* domU page fault handling for guest live migration */ > + /* dabt.valid can be 0 here */ > + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) > + { > + /* Do not modify pc after page fault to repeat memory operation */ > + return; > + } > /* XXX: Decode the instruction if ISS is not valid */ > if ( !dabt.valid ) > goto bad_data_abort; > diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h > index 4f366f1..180d924 100644 > --- a/xen/include/asm-arm/domain.h > +++ b/xen/include/asm-arm/domain.h > @@ -114,9 +114,16 @@ struct arch_domain > > /* dirty-page tracing */ > struct { > +#define MAX_DIRTY_BITMAP_PAGES 64 /* support upto 8GB guest memory */ > + spinlock_t lock; /* protect list: head, mvn_head */ > + volatile int mode; /* 1 if dirty pages tracing enabled */ > + volatile unsigned int count; /* dirty pages counter */More unnecessary volatiles i think.> volatile int second_lvl_start; /* for context switch */ > volatile int second_lvl_end; > lpae_t *second_lvl[2]; /* copy of guest p2m''s first */ > + /* dirty bitmap */ > + uint8_t *bitmap[MAX_DIRTY_BITMAP_PAGES]; > + int bitmap_pages; /* number of dirty bitmap pages */ > } dirty; > > } __cacheline_aligned; > diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h > index a74e135..1ce7a4b 100644 > --- a/xen/include/asm-arm/mm.h > +++ b/xen/include/asm-arm/mm.h > @@ -341,11 +341,18 @@ static inline void put_page_and_type(struct page_info *page) > put_page(page); > } > > +enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; > + > +/* routine for dirty-page tracing */ > +int handle_page_fault(struct domain *d, paddr_t addr); > void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t *end); > int prepare_vlpt(struct domain *d); > void cleanup_vlpt(struct domain *d); > void restore_vlpt(struct domain *d); > > +int prepare_bitmap(struct domain *d); > +void cleanup_bitmap(struct domain *d); > + > /* calculate the xen''s virtual address for accessing the leaf PTE of > * a given address (GPA) */ > static inline lpae_t * get_vlpt_3lvl_pte(paddr_t addr) > diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h > index c660820..dba9a7b 100644 > --- a/xen/include/asm-arm/p2m.h > +++ b/xen/include/asm-arm/p2m.h > @@ -2,6 +2,7 @@ > #define _XEN_P2M_H > > #include <xen/mm.h> > +#include <public/domctl.h> > > struct domain; > > @@ -110,6 +111,9 @@ static inline int get_page_and_type(struct page_info *page, > return rc; > } > > +void p2m_change_entry_type_global(struct domain *d, enum mg nt); > +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc); > + > #endif /* _XEN_P2M_H */ > > /* > diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h > index 5294421..fced6ad 100644 > --- a/xen/include/asm-arm/processor.h > +++ b/xen/include/asm-arm/processor.h > @@ -399,6 +399,8 @@ union hsr { > #define FSC_CPR (0x3a) /* Coprocossor Abort */ > > #define FSC_LL_MASK (0x03<<0) > +#define FSC_MASK (0x3f) /* Fault status mask */ > +#define FSC_3D_LEVEL (0x03) /* Third level fault*/ > > /* Time counter hypervisor control register */ > #define CNTHCTL_PA (1u<<0) /* Kernel/user access to physical counter */
Ian Campbell
2013-Nov-12 17:22 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:> From: Alexey Sokolov <sokolov.a@samsung.com> > > Implement for xl restore/save (which are also used for migrate) operation in xc_arm_migrate.c and make it compilable. > The overall process of save is the following: > 1) save guest parameters (i.e., memory map, console and store pfn, etc) > 2) save memory (if it is live, perform dirty-page tracing) > 3) save hvm states (i.e., gic, timer, vcpu etc) > > Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com> > --- > config/arm32.mk | 1 + > tools/libxc/Makefile | 6 +- > tools/libxc/xc_arm_migrate.c | 712 +++++++++++++++++++++++++++++++++++++++++++ > tools/libxc/xc_dom_arm.c | 4 +- > tools/misc/Makefile | 4 +- > 5 files changed, 723 insertions(+), 4 deletions(-) > create mode 100644 tools/libxc/xc_arm_migrate.c > > diff --git a/config/arm32.mk b/config/arm32.mk > index aa79d22..01374c9 100644 > --- a/config/arm32.mk > +++ b/config/arm32.mk > @@ -1,6 +1,7 @@ > CONFIG_ARM := y > CONFIG_ARM_32 := y > CONFIG_ARM_$(XEN_OS) := y > +CONFIG_MIGRATE := y > > CONFIG_XEN_INSTALL_SUFFIX :> > diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile > index 4c64c15..05dfef4 100644 > --- a/tools/libxc/Makefile > +++ b/tools/libxc/Makefile > @@ -42,8 +42,13 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c > GUEST_SRCS-y :> GUEST_SRCS-y += xg_private.c xc_suspend.c > ifeq ($(CONFIG_MIGRATE),y) > +ifeq ($(CONFIG_X86),y) > GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c > GUEST_SRCS-y += xc_offline_page.c xc_compression.c > +endif > +ifeq ($(CONFIG_ARM),y) > +GUEST_SRCS-y += xc_arm_migrate.cI know you are just following the example above but I think this can be GUEST_SRCS-$(CONFIG_ARM) += xc_arm...> +endif > else > GUEST_SRCS-y += xc_nomigrate.c > endif > @@ -63,7 +68,6 @@ $(patsubst %.c,%.opic,$(ELF_SRCS-y)): CFLAGS += -Wno-pointer-sign > GUEST_SRCS-y += xc_dom_core.c xc_dom_boot.c > GUEST_SRCS-y += xc_dom_elfloader.c > GUEST_SRCS-$(CONFIG_X86) += xc_dom_bzimageloader.c > -GUEST_SRCS-$(CONFIG_X86) += xc_dom_decompress_lz4.cI don''t think this was intentional, was it?> GUEST_SRCS-$(CONFIG_ARM) += xc_dom_armzimageloader.c > GUEST_SRCS-y += xc_dom_binloader.c > GUEST_SRCS-y += xc_dom_compat_linux.c > diff --git a/tools/libxc/xc_arm_migrate.c b/tools/libxc/xc_arm_migrate.c > new file mode 100644 > index 0000000..461e339 > --- /dev/null > +++ b/tools/libxc/xc_arm_migrate.c > @@ -0,0 +1,712 @@Is this implementing the exact protocol as described in tools/libxc/xg_save_restore.h or is it a variant? Are there any docs of the specifics of the ARM protocol? We will eventually need to make a statement about the stability of the protocol, i.e on x86 we support X->X+1 migrations across Xen versions. I think we''d need to make similar guarantees on ARM before we would remove the "tech preview" label from the migration feature. So the docs are useful so we can review the intended protocol for forward compatibility problems etc. We needn''t necessarily implement the x86 one from xg_save_restore.h. In particular it would be nice if the protocol and each of the "chunks" in it were explicitly versioned etc. For example the code assumes that the HVM context implicitly follows the last iteration -- this caused untold pain on x86 when remus was added...> +/****************************************************************************** > + * This library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; > + * version 2.1 of the License. > + * > + * This library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this library; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + * > + * Copyright (c) 2013, Samsung Electronics > + */ > + > +#include <inttypes.h> > +#include <errno.h> > +#include <xenctrl.h> > +#include <xenguest.h> > + > +#include <unistd.h> > +#include <xc_private.h> > +#include <xc_dom.h> > +#include "xc_bitops.h" > +#include "xg_private.h" > + > +/* Guest RAM base */ > +#define GUEST_RAM_BASE 0x80000000 > +/* > + * XXX: Use correct definition for RAM base when the following patch > + * xen: arm: 64-bit guest support and domU FDT autogeneration > + * will be upstreamed. > + */ > + > +#define DEF_MAX_ITERS 29 /* limit us to 30 times round loop */ > +#define DEF_MAX_FACTOR 3 /* never send more than 3x p2m_size */ > +#define DEF_MIN_DIRTY_PER_ITER 50 /* dirty page count to define last iter */ > +#define DEF_PROGRESS_RATE 50 /* progress bar update rate */ > + > +/* Enable this macro for debug only: "static" migration instead of live */ > +/* > +#define DISABLE_LIVE_MIGRATION > +*/I don''t think this is needed, the caller can be hacked if necessary.> + > +/* Enable this macro for debug only: additional debug info */ > +/* > +#define ARM_MIGRATE_VERBOSE > +*/Likewise.> +/* ============ Memory ============= */ > +static int save_memory(xc_interface *xch, int io_fd, uint32_t dom, > + struct save_callbacks *callbacks, > + uint32_t max_iters, uint32_t max_factor, > + guest_params_t *params) > +{ > + int live = !!(params->flags & XCFLAGS_LIVE); > + int debug = !!(params->flags & XCFLAGS_DEBUG); > + xen_pfn_t i; > + char reportbuf[80]; > + int iter = 0; > + int last_iter = !live; > + int total_dirty_pages_num = 0; > + int dirty_pages_on_prev_iter_num = 0; > + int count = 0; > + char *page = 0; > + xen_pfn_t *busy_pages = 0; > + int busy_pages_count = 0; > + int busy_pages_max = 256; > + > + DECLARE_HYPERCALL_BUFFER(unsigned long, to_send); > + > + xen_pfn_t start = params->start_gpfn; > + const xen_pfn_t end = params->max_gpfn; > + const xen_pfn_t mem_size = end - start; > + > + if ( debug ) > + { > + IPRINTF("(save mem) start=%llx end=%llx!\n", start, end); > + }FYI you don''t need the {}''s for cases like this. is if ( debug ) IPRINTF(...) not the equivalent of DPRINTF?> + > + if ( live ) > + { > + if ( xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, > + NULL, 0, NULL, 0, NULL) < 0 ) > + { > + ERROR("Couldn''t enable log-dirty mode !\n"); > + return -1; > + } > + > + max_iters = max_iters ? : DEF_MAX_ITERS; > + max_factor = max_factor ? : DEF_MAX_FACTOR; > + > + if ( debug ) > + IPRINTF("Log-dirty mode enabled, max_iters=%d, max_factor=%d!\n", > + max_iters, max_factor); > + } > + > + to_send = xc_hypercall_buffer_alloc_pages(xch, to_send, > + NRPAGES(bitmap_size(mem_size))); > + if ( !to_send ) > + { > + ERROR("Couldn''t allocate to_send array!\n"); > + return -1; > + } > + > + /* send all pages on first iter */ > + memset(to_send, 0xff, bitmap_size(mem_size)); > + > + for ( ; ; ) > + { > + int dirty_pages_on_current_iter_num = 0; > + int frc; > + iter++; > + > + snprintf(reportbuf, sizeof(reportbuf), > + "Saving memory: iter %d (last sent %u)", > + iter, dirty_pages_on_prev_iter_num); > + > + xc_report_progress_start(xch, reportbuf, mem_size); > + > + if ( (iter > 1 && > + dirty_pages_on_prev_iter_num < DEF_MIN_DIRTY_PER_ITER) || > + (iter == max_iters) || > + (total_dirty_pages_num >= mem_size*max_factor) ) > + { > + if ( debug ) > + IPRINTF("Last iteration"); > + last_iter = 1; > + } > + > + if ( last_iter ) > + { > + if ( suspend_and_state(callbacks->suspend, callbacks->data, > + xch, dom) ) > + { > + ERROR("Domain appears not to have suspended"); > + return -1; > + } > + } > + if ( live && iter > 1 ) > + { > + frc = xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN, > + HYPERCALL_BUFFER(to_send), mem_size, > + NULL, 0, NULL); > + if ( frc != mem_size ) > + { > + ERROR("Error peeking shadow bitmap"); > + xc_hypercall_buffer_free_pages(xch, to_send, > + NRPAGES(bitmap_size(mem_size))); > + return -1; > + } > + } > + > + busy_pages = malloc(sizeof(xen_pfn_t) * busy_pages_max); > + > + for ( i = start; i < end; ++i ) > + { > + if ( test_bit(i - start, to_send) ) > + { > + page = xc_map_foreign_range(xch, dom, PAGE_SIZE, PROT_READ, i);On x86 we try to do this in batches to reduce the overheads. I suppose that could be a future enhancement.> > + if ( !page ) > + { > + /* This page is mapped elsewhere, should be resent later */What does this ("busy") mean? When does this happen? [...]> + > +static int restore_guest_params(xc_interface *xch, int io_fd, > + uint32_t dom, guest_params_t *params) > +{ > [...]> + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) > + { > + ERROR("Can''t set memory map"); > + return -1; > + } > + > + /* Set max. number of vcpus as max_vcpu_id + 1 */ > + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) )Does the higher level toolstack not take care of vcpus and maxmem? I thought so. I think this is how it shoud be.> + { > + ERROR("Can''t set max vcpu number for domain"); > + return -1; > + } > + > + return 0; > +} > +[...]> diff --git a/tools/misc/Makefile b/tools/misc/Makefile > index 17aeda5..0824100 100644 > --- a/tools/misc/Makefile > +++ b/tools/misc/Makefile > @@ -11,7 +11,7 @@ HDRS = $(wildcard *.h) > > TARGETS-y := xenperf xenpm xen-tmem-list-parse gtraceview gtracestat xenlockprof xenwatchdogd xencov > TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump > -TARGETS-$(CONFIG_MIGRATE) += xen-hptool > +TARGETS-$(CONFIG_X86) += xen-hptool > TARGETS := $(TARGETS-y) > > SUBDIRS := $(SUBDIRS-y) > @@ -23,7 +23,7 @@ INSTALL_BIN := $(INSTALL_BIN-y) > INSTALL_SBIN-y := xen-bugtool xen-python-path xenperf xenpm xen-tmem-list-parse gtraceview \ > gtracestat xenlockprof xenwatchdogd xen-ringwatch xencov > INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump > -INSTALL_SBIN-$(CONFIG_MIGRATE) += xen-hptool > +INSTALL_SBIN-$(CONFIG_X86) += xen-hptool > INSTALL_SBIN := $(INSTALL_SBIN-y) > > INSTALL_PRIVBIN-y := xenpvnetbootYou could resend these last two separately and they could probably go straight in. Ian.
Eugene Fedotov
2013-Nov-13 08:00 UTC
Re: [PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
12.11.2013 19:15, Ian Campbell wrote:> On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: > > If these get resent again please can you CC the ARM maintainers. > $ git grep -A5 ARM MAINTAINERS > MAINTAINERS:ARM (W/ VIRTUALISATION EXTENSIONS) ARCHITECTURE > MAINTAINERS-M: Ian Campbell <ian.campbell@citrix.com> > MAINTAINERS-M: Stefano Stabellini <stefano.stabellini@citrix.com> > MAINTAINERS-M: Tim Deegan <tim@xen.org> > MAINTAINERS-S: Supported > MAINTAINERS-L: xen-devel@lists.xen.org > >> From: Evgeny Fedotov <e.fedotov@samsung.com> >> >> Implement save/restore of hvm context hypercall. >> In hvm context save/restore, we save gic, timer and vfp registers. >> >> Changes from v4: Save vcpu registers within hvm context, and purge >> the save-vcpu-register patch. > The inter-version changelog should go after the main commit message and > the S-o-b separated by a "---" on a line by itself. This way it is not > included in the final commit.OK.> [...] >> +static int vgic_irq_rank_save(struct vgic_rank *ext, >> + struct vgic_irq_rank *rank) >> +{ >> + spin_lock(&rank->lock); > This is probably wise, but the domain must be paused at this point, > right?I think this lock can be removed, thanks.> >> + /* Some of VGIC registers are not used yet, it is for a future usage */ >> + /* IENABLE, IACTIVE, IPEND, PENDSGI registers */ >> + ext->ienable = rank->ienable; >> + ext->iactive = rank->iactive; >> + ext->ipend = rank->ipend; >> + ext->pendsgi = rank->pendsgi; >> + /* ICFG */ >> + ext->icfg[0] = rank->icfg[0]; >> + ext->icfg[1] = rank->icfg[1]; >> + /* IPRIORITY */ >> + if ( sizeof(rank->ipriority) != sizeof (ext->ipriority) ) >> + { >> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check ipriority dumping space\n"); > What makes ipriority (and itargets below) special enough to warrant such > a check compared with all the other fields? > > Could these checks be BUILD_BUG_ON?If we use memcpy I think we should always check if the size of source and destination is equal. Of course, BUILD_BUG_ON is safe so I can use it. Thanks.> >> + return -EINVAL; >> + } >> + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); >> + /* ITARGETS */ >> + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) >> + { >> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); >> + return -EINVAL; >> + } >> + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); >> + spin_unlock(&rank->lock); >> + return 0; >> +} >> + >> +static int gic_save(struct domain *d, hvm_domain_context_t *h) >> +{ >> + struct hvm_hw_gic ctxt; >> + struct vcpu *v; >> + >> + /* Save the state of GICs */ >> + for_each_vcpu( d, v ) > Where is the GICD state saved then?The only GICD structure we save for guest domain is struct vgic_irq_rank, it includes: IENABLE, IACTIVE, IPEND, PENDSGI, ICFG, IPRIORITY, ITARGETS registers. We create the same structure inside hvm : vgic_rank (that is no guaranteed to be the same as struct vgic_irq_rank) and save it calling vgic_irq_rank_save routine below in gic_save.> >> + { >> + ctxt.gic_hcr = v->arch.gic_hcr; >> + ctxt.gic_vmcr = v->arch.gic_vmcr; >> + ctxt.gic_apr = v->arch.gic_apr; >> + >> + /* Save list registers and masks */ >> + /* (it is not necessary to save/restore them, but LR state can have >> + * influence on downtime after Live Migration (to be tested) > What does this mean? > > I''m in two minds about whether LR counts as architectural state, since > it is hypervisor side and not guest facing. I''d have thought that the LR > content could be reconstructed from the pending and active bitmasks.Well, I tried to save LR to restore all interrupt queues (pending and active) in future to minimize the quantity of lost IRQs during the migration. But in this stage it is quite wrong to save LR so I remove this.> >> + */ >> + if ( sizeof(v->arch.gic_lr) > sizeof (ctxt.gic_lr) ) >> + { >> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: increase LR dumping space\n"); >> + return -EINVAL; >> + } >> + memcpy(ctxt.gic_lr, v->arch.gic_lr, sizeof(v->arch.gic_lr)); >> + ctxt.lr_mask = v->arch.lr_mask; >> + ctxt.event_mask = v->arch.event_mask; > These last two in particular make me think that saving the LRs is wrong.OK> >> + /* Save PPI states (per-CPU) */ >> + /* It is necessary if SMP enabled */ >> + if ( vgic_irq_rank_save(&ctxt.ppi_state, &v->arch.vgic.private_irqs) ) >> + return 1; >> + >> + if ( hvm_save_entry(GIC, v->vcpu_id, h, &ctxt) != 0 ) >> + return 1; >> + } >> + return 0; >> +} >> + >> +static int gic_load(struct domain *d, hvm_domain_context_t *h) >> +{ >> + int vcpuid; >> + struct hvm_hw_gic ctxt; >> + struct vcpu *v; >> [...] >> + return 0; >> +} >> + >> +HVM_REGISTER_SAVE_RESTORE(GIC, gic_save, gic_load, 1, HVMSR_PER_VCPU); >> + >> [...] >> +#ifdef CONFIG_ARM_32 >> + ctxt.mair0 = v->arch.mair0; >> + ctxt.mair1 = v->arch.mair1; >> +#else >> + ctxt.mair0 = v->arch.mair; >> +#endif >> + /* Control Registers */ >> + ctxt.actlr = v->arch.actlr; >> + ctxt.sctlr = v->arch.sctlr; >> + ctxt.cpacr = v->arch.cpacr; >> [...] >> + ctxt.x0 = c.x0; > Would it make sense to include a field of type vcpu_guest_core_regs in > the save struct instead of this big list?Yes, I think vcpu_guest_core_regs is already common definition for ARM32 and ARM64. [...]>> + #ifdef CONFIG_ARM_32 > #ifdef in the first columns please.OK> >> + ctxt.spsr_fiq = c.spsr_fiq; >> + ctxt.spsr_irq = c.spsr_irq; >> + ctxt.spsr_und = c.spsr_und; >> + ctxt.spsr_abt = c.spsr_abt; > and don''t indent these as if they were inside a blockSure> > You had this right when you saved the MAIR registers above. > >> + #endif >> + #ifdef CONFIG_ARM_64 > or #else if you prefer.OK> >> + ctxt.sp_el0 = c.sp_el0; >> + ctxt.sp_el1 = c.sp_el1; >> + ctxt.elr_el1 = c.elr_el1; >> + #endif >> + >> + /* check VFP state size before dumping */ >> + if ( sizeof(v->arch.vfp) > sizeof (ctxt.vfp) ) >> + { >> + dprintk(XENLOG_G_ERR, "hvm_hw_cpu: increase VFP dumping space\n"); >> + return -EINVAL; >> + } >> + memcpy((void*) &ctxt.vfp, (void*) &v->arch.vfp, sizeof(v->arch.vfp)); >> + >> + ctxt.pause_flags = v->pause_flags; > x86 has at the top of the loop: > /* We don''t need to save state for a vcpu that is down; the restore > * code will leave it down if there is nothing saved. */ > if ( test_bit(_VPF_down, &v->pause_flags) ) > continue; > > which I think is preferable to this.OK> >> diff --git a/xen/include/asm-arm/hvm/support.h b/xen/include/asm-arm/hvm/support.h >> new file mode 100644 >> index 0000000..8311f2f >> --- /dev/null >> +++ b/xen/include/asm-arm/hvm/support.h >> @@ -0,0 +1,29 @@ >> +/* >> + * support.h: HVM support routines used by ARMv7 VE. > Not just on Vexpress nor just on v7 I expect?VE means Virtualization Extentions here. OK, we should support ARM v8 in future,so can be "HVM support routines used by ARMv7/v8 with Virtualization Extensions" a good comment?>> + * >> + * Copyright (c) 2012, Citrix Systems > Really?OK>> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms and conditions of the GNU General Public License, >> + * version 2, as published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + * >> + * You should have received a copy of the GNU General Public License along with >> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple >> + * Place - Suite 330, Boston, MA 02111-1307 USA. >> + */ >> + >> +#ifndef __ASM_ARM_HVM_SUPPORT_H__ >> +#define __ASM_ARM_HVM_SUPPORT_H__ >> + >> +#include <xen/types.h> >> +#include <public/hvm/ioreq.h> >> +#include <xen/sched.h> >> +#include <xen/hvm/save.h> >> +#include <asm/processor.h> > Just some #includes here?Yes, I remove unnecessary includes.> > Ian, > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Eugene Fedotov
2013-Nov-13 08:28 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
12.11.2013 19:21, Ian Campbell пишет:> On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: >> From: Evgeny Fedotov <e.fedotov@samsung.com> >> >> By using the memory map info in arch_domain (from set_memory_map hypercall) >> implement get_maximum_gpfn hypercall. >> >> Changes from v4: Use GUEST_RAM_BASE as the start physical address of guest >> RAM. And, purge set-memory-map patch >> >> Singed-off-by: Evgeny Fedotov <e.fedotov@samsung.com> >> --- >> xen/arch/arm/mm.c | 22 +++++++++++++++++++++- >> xen/include/asm-arm/mm.h | 2 ++ >> 2 files changed, 23 insertions(+), 1 deletion(-) >> >> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c >> index 123280e..3801f07 100644 >> --- a/xen/arch/arm/mm.c >> +++ b/xen/arch/arm/mm.c >> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) >> >> unsigned long domain_get_maximum_gpfn(struct domain *d) > s/unsigned long/xen_pfn_t/ I think. > > Urk, which will break the ABI for this hypercall. That's something of a > conundrum :-/ > > It is a domctl so we are at liberty to change it, probably to taking a > xen_pfn_t * to fill in instead of returning the value. > > I'm in two minds about whether we should do so now or postpone it (which > risks forgetting and having an obscure bug somewhere down the line).OK, this way, or other solution is to purge this hypercall yet in migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the maximum PFN.> >> { >> - return -ENOSYS; >> + paddr_t end; >> + >> + get_gma_start_end(d, NULL, &end); > I suppose this function is useful in a future patch too, otherwise > return (GUEST_RAM_BASE>>PAGE_SHIFT)+d->max_pages > would suffice.OK> Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Eugene Fedotov
2013-Nov-13 08:40 UTC
Re: [PATCH RESEND v5 3/6] xen/arm: Implement modify_returncode
12.11.2013 19:24, Ian Campbell пишет:> On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: >> From: Alexey Sokolov <sokolov.a@samsung.com> >> >> Making sched_op in do_suspend (driver/xen/manage.c) returns 0 on the success of suspend. >> >> Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com> > Acked-by: Ian Campbell <ian.campbell@citrix.com> > > Although if you wanted to... >> +#elif defined(__arm__) > make this "... || defined(__aarch64__)" and ... >> + ctxt.c.user_regs.r0_usr = 1; > Use x0 here instead (which is a 64-bit alias of r0_usr, so good for both > 32 and 64 bit). > > Then it would be ready made for 64-bit too in the future. > > Ian.Good point, OK. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Eugene Fedotov
2013-Nov-13 09:57 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
12.11.2013 21:22, Ian Campbell wrote:> On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: >> From: Alexey Sokolov <sokolov.a@samsung.com> >> >> Implement for xl restore/save (which are also used for migrate) operation in xc_arm_migrate.c and make it compilable. >> The overall process of save is the following: >> 1) save guest parameters (i.e., memory map, console and store pfn, etc) >> 2) save memory (if it is live, perform dirty-page tracing) >> 3) save hvm states (i.e., gic, timer, vcpu etc) >> >> Singed-off-by: Alexey Sokolov <sokolov.a@samsung.com> >> --- >> config/arm32.mk | 1 + >> tools/libxc/Makefile | 6 +- >> tools/libxc/xc_arm_migrate.c | 712 +++++++++++++++++++++++++++++++++++++++++++ >> tools/libxc/xc_dom_arm.c | 4 +- >> tools/misc/Makefile | 4 +- >> 5 files changed, 723 insertions(+), 4 deletions(-) >> create mode 100644 tools/libxc/xc_arm_migrate.c >> >> diff --git a/config/arm32.mk b/config/arm32.mk >> index aa79d22..01374c9 100644 >> --- a/config/arm32.mk >> +++ b/config/arm32.mk >> @@ -1,6 +1,7 @@ >> CONFIG_ARM := y >> CONFIG_ARM_32 := y >> CONFIG_ARM_$(XEN_OS) := y >> +CONFIG_MIGRATE := y >> >> CONFIG_XEN_INSTALL_SUFFIX :>> >> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile >> index 4c64c15..05dfef4 100644 >> --- a/tools/libxc/Makefile >> +++ b/tools/libxc/Makefile >> @@ -42,8 +42,13 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c >> GUEST_SRCS-y :>> GUEST_SRCS-y += xg_private.c xc_suspend.c >> ifeq ($(CONFIG_MIGRATE),y) >> +ifeq ($(CONFIG_X86),y) >> GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c >> GUEST_SRCS-y += xc_offline_page.c xc_compression.c >> +endif >> +ifeq ($(CONFIG_ARM),y) >> +GUEST_SRCS-y += xc_arm_migrate.c > I know you are just following the example above but I think this can be > GUEST_SRCS-$(CONFIG_ARM) += xc_arm...OK> >> +endif >> else >> GUEST_SRCS-y += xc_nomigrate.c >> endif >> @@ -63,7 +68,6 @@ $(patsubst %.c,%.opic,$(ELF_SRCS-y)): CFLAGS += -Wno-pointer-sign >> GUEST_SRCS-y += xc_dom_core.c xc_dom_boot.c >> GUEST_SRCS-y += xc_dom_elfloader.c >> GUEST_SRCS-$(CONFIG_X86) += xc_dom_bzimageloader.c >> -GUEST_SRCS-$(CONFIG_X86) += xc_dom_decompress_lz4.c > I don''t think this was intentional, was it?Oops, sure.>> GUEST_SRCS-$(CONFIG_ARM) += xc_dom_armzimageloader.c >> GUEST_SRCS-y += xc_dom_binloader.c >> GUEST_SRCS-y += xc_dom_compat_linux.c >> diff --git a/tools/libxc/xc_arm_migrate.c b/tools/libxc/xc_arm_migrate.c >> new file mode 100644 >> index 0000000..461e339 >> --- /dev/null >> +++ b/tools/libxc/xc_arm_migrate.c >> @@ -0,0 +1,712 @@ > Is this implementing the exact protocol as described in > tools/libxc/xg_save_restore.h or is it a variant? Are there any docs of > the specifics of the ARM protocol?This implements a quite different from tools/libxc/xg_save_restore.h protocol, it is much more simplified because we do not need some things that implemented for x86. So you''re right, it has to be documented. Should we use a different header to place documentation to this (and place some definitions), for example xc_arm_migrate.h?> We will eventually need to make a statement about the stability of the > protocol, i.e on x86 we support X->X+1 migrations across Xen versions. I > think we''d need to make similar guarantees on ARM before we would remove > the "tech preview" label from the migration feature.So, should you believe our results (and where should we place this statement) or should you make tests from your side?> So the docs are useful so we can review the intended protocol for > forward compatibility problems etc. We needn''t necessarily implement the > x86 one from xg_save_restore.h. > > In particular it would be nice if the protocol and each of the "chunks" > in it were explicitly versioned etc. For example the code assumes that > the HVM context implicitly follows the last iteration -- this caused > untold pain on x86 when remus was added...OK, such documentation will be soon.> >> +/****************************************************************************** >> + * This library is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU Lesser General Public >> + * License as published by the Free Software Foundation; >> + * version 2.1 of the License. >> + * >> + * This library is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * Lesser General Public License for more details. >> + * >> + * You should have received a copy of the GNU Lesser General Public >> + * License along with this library; if not, write to the Free Software >> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA >> + * >> + * Copyright (c) 2013, Samsung Electronics >> + */ >> + >> +#include <inttypes.h> >> +#include <errno.h> >> +#include <xenctrl.h> >> +#include <xenguest.h> >> + >> +#include <unistd.h> >> +#include <xc_private.h> >> +#include <xc_dom.h> >> +#include "xc_bitops.h" >> +#include "xg_private.h" >> + >> +/* Guest RAM base */ >> +#define GUEST_RAM_BASE 0x80000000 >> +/* >> + * XXX: Use correct definition for RAM base when the following patch >> + * xen: arm: 64-bit guest support and domU FDT autogeneration >> + * will be upstreamed. >> + */ >> + >> +#define DEF_MAX_ITERS 29 /* limit us to 30 times round loop */ >> +#define DEF_MAX_FACTOR 3 /* never send more than 3x p2m_size */ >> +#define DEF_MIN_DIRTY_PER_ITER 50 /* dirty page count to define last iter */ >> +#define DEF_PROGRESS_RATE 50 /* progress bar update rate */ >> + >> +/* Enable this macro for debug only: "static" migration instead of live */ >> +/* >> +#define DISABLE_LIVE_MIGRATION >> +*/ > I don''t think this is needed, the caller can be hacked if necessary.OK, I remove. It is internal hack.> >> + >> +/* Enable this macro for debug only: additional debug info */ >> +/* >> +#define ARM_MIGRATE_VERBOSE >> +*/ > Likewise.OK> >> +/* ============ Memory ============= */ >> +static int save_memory(xc_interface *xch, int io_fd, uint32_t dom, >> + struct save_callbacks *callbacks, >> + uint32_t max_iters, uint32_t max_factor, >> + guest_params_t *params) >> +{ >> + int live = !!(params->flags & XCFLAGS_LIVE); >> + int debug = !!(params->flags & XCFLAGS_DEBUG); >> + xen_pfn_t i; >> + char reportbuf[80]; >> + int iter = 0; >> + int last_iter = !live; >> + int total_dirty_pages_num = 0; >> + int dirty_pages_on_prev_iter_num = 0; >> + int count = 0; >> + char *page = 0; >> + xen_pfn_t *busy_pages = 0; >> + int busy_pages_count = 0; >> + int busy_pages_max = 256; >> + >> + DECLARE_HYPERCALL_BUFFER(unsigned long, to_send); >> + >> + xen_pfn_t start = params->start_gpfn; >> + const xen_pfn_t end = params->max_gpfn; >> + const xen_pfn_t mem_size = end - start; >> + >> + if ( debug ) >> + { >> + IPRINTF("(save mem) start=%llx end=%llx!\n", start, end); >> + } > FYI you don''t need the {}''s for cases like this.Actually we don''t need {}, this has been done because we was not sure if this macro can be empty-substituted.> > is if ( debug ) IPRINTF(...) not the equivalent of DPRINTF?This equivalence is not obvious for me, because in current code we obtain debug flag with XCFLAGS_DEBUG mask (when --debug option passed). If it is equivalent I''ll use DPRINTF.> >> + >> + if ( live ) >> + { >> + if ( xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, >> + NULL, 0, NULL, 0, NULL) < 0 ) >> + { >> + ERROR("Couldn''t enable log-dirty mode !\n"); >> + return -1; >> + } >> + >> + max_iters = max_iters ? : DEF_MAX_ITERS; >> + max_factor = max_factor ? : DEF_MAX_FACTOR; >> + >> + if ( debug ) >> + IPRINTF("Log-dirty mode enabled, max_iters=%d, max_factor=%d!\n", >> + max_iters, max_factor); >> + } >> + >> + to_send = xc_hypercall_buffer_alloc_pages(xch, to_send, >> + NRPAGES(bitmap_size(mem_size))); >> + if ( !to_send ) >> + { >> + ERROR("Couldn''t allocate to_send array!\n"); >> + return -1; >> + } >> + >> + /* send all pages on first iter */ >> + memset(to_send, 0xff, bitmap_size(mem_size)); >> + >> + for ( ; ; ) >> + { >> + int dirty_pages_on_current_iter_num = 0; >> + int frc; >> + iter++; >> + >> + snprintf(reportbuf, sizeof(reportbuf), >> + "Saving memory: iter %d (last sent %u)", >> + iter, dirty_pages_on_prev_iter_num); >> + >> + xc_report_progress_start(xch, reportbuf, mem_size); >> + >> + if ( (iter > 1 && >> + dirty_pages_on_prev_iter_num < DEF_MIN_DIRTY_PER_ITER) || >> + (iter == max_iters) || >> + (total_dirty_pages_num >= mem_size*max_factor) ) >> + { >> + if ( debug ) >> + IPRINTF("Last iteration"); >> + last_iter = 1; >> + } >> + >> + if ( last_iter ) >> + { >> + if ( suspend_and_state(callbacks->suspend, callbacks->data, >> + xch, dom) ) >> + { >> + ERROR("Domain appears not to have suspended"); >> + return -1; >> + } >> + } >> + if ( live && iter > 1 ) >> + { >> + frc = xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN, >> + HYPERCALL_BUFFER(to_send), mem_size, >> + NULL, 0, NULL); >> + if ( frc != mem_size ) >> + { >> + ERROR("Error peeking shadow bitmap"); >> + xc_hypercall_buffer_free_pages(xch, to_send, >> + NRPAGES(bitmap_size(mem_size))); >> + return -1; >> + } >> + } >> + >> + busy_pages = malloc(sizeof(xen_pfn_t) * busy_pages_max); >> + >> + for ( i = start; i < end; ++i ) >> + { >> + if ( test_bit(i - start, to_send) ) >> + { >> + page = xc_map_foreign_range(xch, dom, PAGE_SIZE, PROT_READ, i); > On x86 we try to do this in batches to reduce the overheads. I suppose > that could be a future enhancement.OK. I''ll make TODO comment.> >> + if ( !page ) >> + { >> + /* This page is mapped elsewhere, should be resent later */ > What does this ("busy") mean? When does this happen?Oh, it looks like workaround of problem that maybe don''t need now. In case of dom0 with 2 CPU xc_map_foriegn range can return NULL, but guest page is exist. The later call xc_map_foriegn_range on the same page will return valid pointer. I''ll remove this.> > [...] >> + >> +static int restore_guest_params(xc_interface *xch, int io_fd, >> + uint32_t dom, guest_params_t *params) >> +{ >> [...] >> + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) >> + { >> + ERROR("Can''t set memory map"); >> + return -1; >> + } >> + >> + /* Set max. number of vcpus as max_vcpu_id + 1 */ >> + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) ) > Does the higher level toolstack not take care of vcpus and maxmem? I > thought so. I think this is how it shoud be.For my tests guest config information is not transferred for ARM case from high-level stack. At the migration receiver side toolstack always create a new domain with vcpus=1 and default max. mem. So we have to send guest information as our local guest_params structure (at the beginning of migration). It is easy way to work "xl save" or "xl migrate" without modification of libxl level, but you may have another idea? Also, toolstack_restore callback is not set (NULL) for ARM case.> >> + { >> + ERROR("Can''t set max vcpu number for domain"); >> + return -1; >> + } >> + >> + return 0; >> +} >> +[...] >> diff --git a/tools/misc/Makefile b/tools/misc/Makefile >> index 17aeda5..0824100 100644 >> --- a/tools/misc/Makefile >> +++ b/tools/misc/Makefile >> @@ -11,7 +11,7 @@ HDRS = $(wildcard *.h) >> >> TARGETS-y := xenperf xenpm xen-tmem-list-parse gtraceview gtracestat xenlockprof xenwatchdogd xencov >> TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump >> -TARGETS-$(CONFIG_MIGRATE) += xen-hptool >> +TARGETS-$(CONFIG_X86) += xen-hptool >> TARGETS := $(TARGETS-y) >> >> SUBDIRS := $(SUBDIRS-y) >> @@ -23,7 +23,7 @@ INSTALL_BIN := $(INSTALL_BIN-y) >> INSTALL_SBIN-y := xen-bugtool xen-python-path xenperf xenpm xen-tmem-list-parse gtraceview \ >> gtracestat xenlockprof xenwatchdogd xen-ringwatch xencov >> INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump >> -INSTALL_SBIN-$(CONFIG_MIGRATE) += xen-hptool >> +INSTALL_SBIN-$(CONFIG_X86) += xen-hptool >> INSTALL_SBIN := $(INSTALL_SBIN-y) >> >> INSTALL_PRIVBIN-y := xenpvnetboot > You could resend these last two separately and they could probably go > straight in.OK> > Ian. > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Ian Campbell
2013-Nov-13 10:56 UTC
Re: [PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
On Wed, 2013-11-13 at 12:00 +0400, Eugene Fedotov wrote:> > [...] > >> +static int vgic_irq_rank_save(struct vgic_rank *ext, > >> + struct vgic_irq_rank *rank) > >> +{ > >> + spin_lock(&rank->lock); > > This is probably wise, but the domain must be paused at this point, > > right? > I think this lock can be removed, thanks.I''d be happy for it to stay as a belt-and-braces/good practice thing.> >> + return -EINVAL; > >> + } > >> + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); > >> + /* ITARGETS */ > >> + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) > >> + { > >> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); > >> + return -EINVAL; > >> + } > >> + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); > >> + spin_unlock(&rank->lock); > >> + return 0; > >> +} > >> + > >> +static int gic_save(struct domain *d, hvm_domain_context_t *h) > >> +{ > >> + struct hvm_hw_gic ctxt; > >> + struct vcpu *v; > >> + > >> + /* Save the state of GICs */ > >> + for_each_vcpu( d, v ) > > Where is the GICD state saved then? > The only GICD structure we save for guest domain is struct > vgic_irq_rank, it includes: IENABLE, IACTIVE, IPEND, PENDSGI, ICFG, > IPRIORITY, ITARGETS registers. We create the same structure inside hvm : > vgic_rank (that is no guaranteed to be the same as struct vgic_irq_rank) > and save it calling vgic_irq_rank_save routine below in gic_save.I can only see one call to vgic_irq_rank_save which is the one to save PPI state within the vcpu loop. What about the (per-cpu) SGI and (global) SPIs? I can''t see where either of those are saved.> >> + ctxt.x0 = c.x0; > > Would it make sense to include a field of type vcpu_guest_core_regs in > > the save struct instead of this big list? > Yes, I think vcpu_guest_core_regs is already common definition for ARM32 > and ARM64.Correct.> >> diff --git a/xen/include/asm-arm/hvm/support.h b/xen/include/asm-arm/hvm/support.h > >> new file mode 100644 > >> index 0000000..8311f2f > >> --- /dev/null > >> +++ b/xen/include/asm-arm/hvm/support.h > >> @@ -0,0 +1,29 @@ > >> +/* > >> + * support.h: HVM support routines used by ARMv7 VE. > > Not just on Vexpress nor just on v7 I expect? > VE means Virtualization Extentions here. > OK, we should support ARM v8 in future,so can be "HVM support routines > used by ARMv7/v8 with Virtualization Extensions" a good comment?I think just "...used by ARM" would be sufficient.> >> + * > >> + * Copyright (c) 2012, Citrix Systems > > Really? > OK > >> + * > >> + * This program is free software; you can redistribute it and/or modify it > >> + * under the terms and conditions of the GNU General Public License, > >> + * version 2, as published by the Free Software Foundation. > >> + * > >> + * This program is distributed in the hope it will be useful, but WITHOUT > >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > >> + * more details. > >> + * > >> + * You should have received a copy of the GNU General Public License along with > >> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple > >> + * Place - Suite 330, Boston, MA 02111-1307 USA. > >> + */ > >> + > >> +#ifndef __ASM_ARM_HVM_SUPPORT_H__ > >> +#define __ASM_ARM_HVM_SUPPORT_H__ > >> + > >> +#include <xen/types.h> > >> +#include <public/hvm/ioreq.h> > >> +#include <xen/sched.h> > >> +#include <xen/hvm/save.h> > >> +#include <asm/processor.h> > > Just some #includes here? > Yes, I remove unnecessary includes.Oh, I see now this header required by common code, that''s OK then. But please do minimise what is needed here. Ian.
Ian Campbell
2013-Nov-13 10:58 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
On Wed, 2013-11-13 at 12:28 +0400, Eugene Fedotov wrote:> >> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c > >> index 123280e..3801f07 100644 > >> --- a/xen/arch/arm/mm.c > >> +++ b/xen/arch/arm/mm.c > >> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) > >> > >> unsigned long domain_get_maximum_gpfn(struct domain *d) > > s/unsigned long/xen_pfn_t/ I think. > > > > Urk, which will break the ABI for this hypercall. That''s something of a > > conundrum :-/ > > > > It is a domctl so we are at liberty to change it, probably to taking a > > xen_pfn_t * to fill in instead of returning the value. > > > > I''m in two minds about whether we should do so now or postpone it (which > > risks forgetting and having an obscure bug somewhere down the line).> OK, this way, or other solution is to purge this hypercall yet in > migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the > maximum PFN.Is there an interface to retrieve that? I don''t see it :-( Ian.
Ian Campbell
2013-Nov-13 11:09 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
On Wed, 2013-11-13 at 13:57 +0400, Eugene Fedotov wrote:> >> diff --git a/tools/libxc/xc_arm_migrate.c b/tools/libxc/xc_arm_migrate.c > >> new file mode 100644 > >> index 0000000..461e339 > >> --- /dev/null > >> +++ b/tools/libxc/xc_arm_migrate.c > >> @@ -0,0 +1,712 @@ > > Is this implementing the exact protocol as described in > > tools/libxc/xg_save_restore.h or is it a variant? Are there any docs of > > the specifics of the ARM protocol?> This implements a quite different from tools/libxc/xg_save_restore.h > protocol, it is much more simplified because we do not need some things > that implemented for x86. So you''re right, it has to be documented. > Should we use a different header to place documentation to this (and > place some definitions), for example xc_arm_migrate.h?I think xg_arm_save_restore.h would be the consistent name. At some point we should rename some of the x86 specific stuff, but you don''t need to worry about that.> > We will eventually need to make a statement about the stability of the > > protocol, i.e on x86 we support X->X+1 migrations across Xen versions. I > > think we''d need to make similar guarantees on ARM before we would remove > > the "tech preview" label from the migration feature. > So, should you believe our results (and where should we place this > statement) or should you make tests from your side?By stability I mean the stability of the migration ABI across version changes, not stability as in the bugginess or otherwise of the code. IOW a commitment to supporting migration from version X to version X+1 in the future. WRT "tech preview" I don''t think we are going to be able to honestly call this production ready for the general case in 4.4, I expect it''ll need a bit longer to "bed in" before we are happy with that statement. (this has no bearing on whether you want to fully support it as production ready in your particular application).> > > >> +/* ============ Memory ============= */ > >> +static int save_memory(xc_interface *xch, int io_fd, uint32_t dom, > >> + struct save_callbacks *callbacks, > >> + uint32_t max_iters, uint32_t max_factor, > >> + guest_params_t *params) > >> +{ > >> + int live = !!(params->flags & XCFLAGS_LIVE); > >> + int debug = !!(params->flags & XCFLAGS_DEBUG); > >> + xen_pfn_t i; > >> + char reportbuf[80]; > >> + int iter = 0; > >> + int last_iter = !live; > >> + int total_dirty_pages_num = 0; > >> + int dirty_pages_on_prev_iter_num = 0; > >> + int count = 0; > >> + char *page = 0; > >> + xen_pfn_t *busy_pages = 0; > >> + int busy_pages_count = 0; > >> + int busy_pages_max = 256; > >> + > >> + DECLARE_HYPERCALL_BUFFER(unsigned long, to_send); > >> + > >> + xen_pfn_t start = params->start_gpfn; > >> + const xen_pfn_t end = params->max_gpfn; > >> + const xen_pfn_t mem_size = end - start; > >> + > >> + if ( debug ) > >> + { > >> + IPRINTF("(save mem) start=%llx end=%llx!\n", start, end); > >> + } > > FYI you don''t need the {}''s for cases like this. > Actually we don''t need {}, this has been done because we was not sure if > this macro can be empty-substituted. > > > > is if ( debug ) IPRINTF(...) not the equivalent of DPRINTF? > This equivalence is not obvious for me, because in current code we > obtain debug flag with XCFLAGS_DEBUG mask (when --debug option passed). > If it is equivalent I''ll use DPRINTF.No you are right, these are different. Actually DPRINTF is "detail level", there is a DBGPRINTF which is "debug level" (levels here being in terms of XTL_FOO from xentoollog.h). So you probably do want to use if (debug), although you may separately want to consider which level the resulting messages are printed at when they are printed...> > > > [...] > >> + > >> +static int restore_guest_params(xc_interface *xch, int io_fd, > >> + uint32_t dom, guest_params_t *params) > >> +{ > >> [...] > >> + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) > >> + { > >> + ERROR("Can''t set memory map"); > >> + return -1; > >> + } > >> + > >> + /* Set max. number of vcpus as max_vcpu_id + 1 */ > >> + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) ) > > Does the higher level toolstack not take care of vcpus and maxmem? I > > thought so. I think this is how it shoud be. > > For my tests guest config information is not transferred for ARM case > from high-level stack. At the migration receiver side toolstack always > create a new domain with vcpus=1 and default max. mem. So we have to > send guest information as our local guest_params structure (at the > beginning of migration). > It is easy way to work "xl save" or "xl migrate" without modification of > libxl level, but you may have another idea? > Also, toolstack_restore callback is not set (NULL) for ARM case.So you aren''t using xl to do the migration? This is what we should ultimately be aiming for. It is almost certainly going to require fixes at the libxl level though. If you need to send additional information for your usecase then it should be done at the layer above libxc. Ian.
Ian Campbell
2013-Nov-13 12:21 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
Readding the list... On Wed, 2013-11-13 at 12:19 +0000, Ian Campbell wrote:> On Wed, 2013-11-13 at 16:17 +0400, Eugene Fedotov wrote: > > 13.11.2013 15:09, Ian Campbell wrote: > > > For my tests guest config information is not transferred for ARM case > > > from high-level stack. At the migration receiver side toolstack always > > > create a new domain with vcpus=1 and default max. mem. So we have to > > > send guest information as our local guest_params structure (at the > > > beginning of migration). > > > It is easy way to work "xl save" or "xl migrate" without modification of > > > libxl level, but you may have another idea? > > > Also, toolstack_restore callback is not set (NULL) for ARM case. > > > So you aren''t using xl to do the migration? This is what we should > > > ultimately be aiming for. It is almost certainly going to require fixes > > > at the libxl level though. > > Some misunderstanding. We are using xl for migration. I mean that libxl > > doesn''t correctly transfer guest parameters: number of vcpus, memory. > > Huh, I''d have thought that would all be taken care of by generic libxl > code and not require x86 and arm specifics. Maybe I''m wrong. > > In any case it should be fixed. > > > At the proof-of-concept stage there was easier to transfer it inside > > xc_domain_save and xc_domain_restore rather then patching libxl. > > Understood. > > > For example, we should correctly set maximum memory for destination > > domain by using xc_setmaxmem hypercall. Otherwise, toolstack set it by > > default calling > > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + > > LIBXL_MAXMEM_CONSTANT); > > (see libxl_dom.c:241). But we don''t need to add LIBXL_MAXMEM_CONSTANT on > > ARM, so we set it manually. > > This is added on create, so why not on restore? Whether or not it is > necessary on ARM is a separate question, it should be consistent. > > Ian.
Ian Campbell
2013-Nov-13 12:22 UTC
Re: [PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
To the list this time as well. On Wed, 2013-11-13 at 11:47 +0000, Ian Campbell wrote:> On Wed, 2013-11-13 at 15:50 +0400, Eugene Fedotov wrote: > > 13.11.2013 14:56, Ian Campbell пишет: > > > On Wed, 2013-11-13 at 12:00 +0400, Eugene Fedotov wrote: > > > > > >>> [...] > > >>>> +static int vgic_irq_rank_save(struct vgic_rank *ext, > > >>>> + struct vgic_irq_rank *rank) > > >>>> +{ > > >>>> + spin_lock(&rank->lock); > > >>> This is probably wise, but the domain must be paused at this point, > > >>> right? > > >> I think this lock can be removed, thanks. > > > I'd be happy for it to stay as a belt-and-braces/good practice thing. > > > > > >>>> + return -EINVAL; > > >>>> + } > > >>>> + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); > > >>>> + /* ITARGETS */ > > >>>> + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) > > >>>> + { > > >>>> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); > > >>>> + return -EINVAL; > > >>>> + } > > >>>> + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); > > >>>> + spin_unlock(&rank->lock); > > >>>> + return 0; > > >>>> +} > > >>>> + > > >>>> +static int gic_save(struct domain *d, hvm_domain_context_t *h) > > >>>> +{ > > >>>> + struct hvm_hw_gic ctxt; > > >>>> + struct vcpu *v; > > >>>> + > > >>>> + /* Save the state of GICs */ > > >>>> + for_each_vcpu( d, v ) > > >>> Where is the GICD state saved then? > > >> The only GICD structure we save for guest domain is struct > > >> vgic_irq_rank, it includes: IENABLE, IACTIVE, IPEND, PENDSGI, ICFG, > > >> IPRIORITY, ITARGETS registers. We create the same structure inside hvm : > > >> vgic_rank (that is no guaranteed to be the same as struct vgic_irq_rank) > > >> and save it calling vgic_irq_rank_save routine below in gic_save. > > > I can only see one call to vgic_irq_rank_save which is the one to save > > > PPI state within the vcpu loop. What about the (per-cpu) SGI and > > > (global) SPIs? I can't see where either of those are saved. > > > 1) For the guest domain vgic.nr_lines was set to 0 (arch_domain > > structure, see "domain_vgic_init" function in vgic.c): > > d->arch.vgic.nr_lines = 0; /* We don't need SPIs for the guest */ > > Should we not rely on this, assuming that SPIs will be enabled for guest? > > Aha! I'd forgotten about that. I think we can safely ignore SPIs until > device passthrough is added. And even then I don't think we would want > to migrate with an active passthrough device attached. > > > 2) Could you point me where is SGI states are situated in arch_vcpu? I > > thought they are inside those 32 IRQs (namely 0-15 IRQs) that we have > > already saved, because vgic_irq_rank in arch_vcpu is the context for > > those 32 IRQs. > > SGIs are "interrupts" 0..15 which PPIs are "interrupts" 16..31, so if > you are saving the first rank which contains all 32 of those then I > think you have probably covered them already. I think you just need to > update the comment(s) to mention SGI as well as PPI. > > Thanks, > > Ian >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Eugene Fedotov
2013-Nov-13 12:31 UTC
Re: [PATCH RESEND v5 1/6] xen/arm: Implement hvm save and restore
13.11.2013 15:47, Ian Campbell пишет:> On Wed, 2013-11-13 at 15:50 +0400, Eugene Fedotov wrote: >> 13.11.2013 14:56, Ian Campbell пишет: >>> On Wed, 2013-11-13 at 12:00 +0400, Eugene Fedotov wrote: >>> >>>>> [...] >>>>>> +static int vgic_irq_rank_save(struct vgic_rank *ext, >>>>>> + struct vgic_irq_rank *rank) >>>>>> +{ >>>>>> + spin_lock(&rank->lock); >>>>> This is probably wise, but the domain must be paused at this point, >>>>> right? >>>> I think this lock can be removed, thanks. >>> I'd be happy for it to stay as a belt-and-braces/good practice thing. >>> >>>>>> + return -EINVAL; >>>>>> + } >>>>>> + memcpy(ext->ipriority, rank->ipriority, sizeof(rank->ipriority)); >>>>>> + /* ITARGETS */ >>>>>> + if ( sizeof(rank->itargets) != sizeof (ext->itargets) ) >>>>>> + { >>>>>> + dprintk(XENLOG_G_ERR, "hvm_hw_gic: check itargets dumping space\n"); >>>>>> + return -EINVAL; >>>>>> + } >>>>>> + memcpy(ext->itargets, rank->itargets, sizeof(rank->itargets)); >>>>>> + spin_unlock(&rank->lock); >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +static int gic_save(struct domain *d, hvm_domain_context_t *h) >>>>>> +{ >>>>>> + struct hvm_hw_gic ctxt; >>>>>> + struct vcpu *v; >>>>>> + >>>>>> + /* Save the state of GICs */ >>>>>> + for_each_vcpu( d, v ) >>>>> Where is the GICD state saved then? >>>> The only GICD structure we save for guest domain is struct >>>> vgic_irq_rank, it includes: IENABLE, IACTIVE, IPEND, PENDSGI, ICFG, >>>> IPRIORITY, ITARGETS registers. We create the same structure inside hvm : >>>> vgic_rank (that is no guaranteed to be the same as struct vgic_irq_rank) >>>> and save it calling vgic_irq_rank_save routine below in gic_save. >>> I can only see one call to vgic_irq_rank_save which is the one to save >>> PPI state within the vcpu loop. What about the (per-cpu) SGI and >>> (global) SPIs? I can't see where either of those are saved. >> 1) For the guest domain vgic.nr_lines was set to 0 (arch_domain >> structure, see "domain_vgic_init" function in vgic.c): >> d->arch.vgic.nr_lines = 0; /* We don't need SPIs for the guest */ >> Should we not rely on this, assuming that SPIs will be enabled for guest? > Aha! I'd forgotten about that. I think we can safely ignore SPIs until > device passthrough is added. And even then I don't think we would want > to migrate with an active passthrough device attached. > >> 2) Could you point me where is SGI states are situated in arch_vcpu? I >> thought they are inside those 32 IRQs (namely 0-15 IRQs) that we have >> already saved, because vgic_irq_rank in arch_vcpu is the context for >> those 32 IRQs. > SGIs are "interrupts" 0..15 which PPIs are "interrupts" 16..31, so if > you are saving the first rank which contains all 32 of those then I > think you have probably covered them already. I think you just need to > update the comment(s) to mention SGI as well as PPI.OK, thanks. Best regards, Evgeny. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jaeyong Yoo
2013-Nov-14 23:58 UTC
Re: [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
> -----Original Message----- > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Wednesday, November 13, 2013 12:59 AM > To: Jaeyong Yoo > Cc: xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH RESEND v5 4/6] xen/arm: Implement virtual- > linear page table for guest p2m mapping in live migration > > On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: > > Implement vlpt (virtual-linear page table) for fast accessing of 3rd PTE > of guest p2m. > > When creating a mapping for vlpt, just copy the 1st level PTE of guest > > p2m to the xen''s 2nd level PTE. Then the mapping becomes the following: > > xen''s 1st PTE --> > > xen''s 2nd PTE (which is the same as 1st PTE of guest p2m) --> > > guest p2m''s 2nd PTE --> > > guest p2m''s 3rd PTE (the memory contents where the vlpt points) > > For more info about vlpt, see: > > http://www.technovelty.org/linux/virtual-linear-page-table.html > > > > This function is used in dirty-page tracing: when domU write-fault is > > trapped by xen, xen can immediately locate the 3rd PTE of guest p2m. > > The following link shows the performance comparison for handling a > > dirty-page between vlpt and typical page table walking. > > http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html > > > > Changes from v4: > > 1. In the restoring vlpt, use __foo variant without barriers for > write_pte and > > flush_xen_data_tlb_range_va. > > 2. Support two consecutive pages for guest''s first level page table. > > > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > > --- > > xen/arch/arm/domain.c | 5 ++ > > xen/arch/arm/mm.c | 112 > +++++++++++++++++++++++++++++++++++++++ > > xen/include/asm-arm/arm32/page.h | 41 ++++++++++---- > > xen/include/asm-arm/config.h | 5 ++ > > xen/include/asm-arm/domain.h | 7 +++ > > xen/include/asm-arm/mm.h | 15 ++++++ > > 6 files changed, 174 insertions(+), 11 deletions(-) > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > cb0424d..c0b5dd8 100644 > > --- a/xen/arch/arm/domain.c > > +++ b/xen/arch/arm/domain.c > > @@ -509,6 +509,11 @@ int arch_domain_create(struct domain *d, unsigned > int domcr_flags) > > /* Default the virtual ID to match the physical */ > > d->arch.vpidr = boot_cpu_data.midr.bits; > > > > + d->arch.dirty.second_lvl_start = 0; > > + d->arch.dirty.second_lvl_end = 0; > > + d->arch.dirty.second_lvl[0] = NULL; > > + d->arch.dirty.second_lvl[1] = NULL; > > + > > clear_page(d->shared_info); > > share_xen_page_with_guest( > > virt_to_page(d->shared_info), d, XENSHARE_writable); diff > > --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index 3801f07..bf13993 > > 100644 > > --- a/xen/arch/arm/mm.c > > +++ b/xen/arch/arm/mm.c > > @@ -1328,6 +1328,118 @@ void get_gma_start_end(struct domain *d, paddr_t > *start, paddr_t *end) > > *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages << > > PAGE_SHIFT); } > > > > +/* flush the vlpt area */ > > +void flush_vlpt(struct domain *d) > > +{ > > + int flush_size; > > + flush_size = (d->arch.dirty.second_lvl_end - > > + d->arch.dirty.second_lvl_start) << SECOND_SHIFT; > > + /* flushing the 3rd level mapping */ > > + flush_xen_data_tlb_range_va(d->arch.dirty.second_lvl_start << > SECOND_SHIFT, > > + flush_size); > > All these << SECOND_SHIFT, can they be put into a macro > VLPT_OFFSET(guest_vaddr) perhaps (assuming that''s the right semantics) >VLPT_OFFSET sounds fine.> > +} > > + > > +/* restore the xen page table for vlpt mapping for domain d */ void > > +restore_vlpt(struct domain *d) > > I guess this is called on context switch? Why not in this patch?OK.> > > +{ > > + int i; > > + dsb(); > > + for ( i = d->arch.dirty.second_lvl_start; > > + i < d->arch.dirty.second_lvl_end; > > + ++i ) > > + { > > + int k = i % LPAE_ENTRIES; > > + int l = i / LPAE_ENTRIES; > > + > > + if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits ) > > + { > > + __write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]); > > + __flush_xen_data_tlb_range_va(i << SECOND_SHIFT, > > + 1 << SECOND_SHIFT); > > + } > > + } > > + dsb(); > > + isb(); > > +} > > + > > +/* setting up the xen page table for vlpt mapping for domain d */ int > > +prepare_vlpt(struct domain *d) { > > + int xen_second_linear_base; > > + int gp2m_start_index, gp2m_end_index; > > + struct p2m_domain *p2m = &d->arch.p2m; > > + struct page_info *second_lvl_page; > > + paddr_t gma_start = 0; > > + paddr_t gma_end = 0; > > + lpae_t *first[2]; > > + int i; > > + uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START; > > + > > + get_gma_start_end(d, &gma_start, &gma_end); > > + required = (gma_end - gma_start) >> LPAE_SHIFT; > > + > > + > > + if ( required > avail ) > > avail is the number of bytes of virtual address space in the linear p2m > area. > > required is the number of pages which the guest has. Does this comparison > correctly account for each page being represented by an 8-byte lpae_t > entry?Yes, it does. See, LPAE_SHIFT only shifts 9 bits. Shifting PAGE_SHIFT (12 bits) gives the number of pages (every bit), and 9-bit shift gives the required memory for storing third-level PTE. Since this one-shifting of LPAE_SHIFT is confusing enough, we can make it explicit by using something like required = ((gma_end - gma_start) >> PAGE_SHIFT ) * (sizeof lpae_t).> > > + { > > + dprintk(XENLOG_ERR, "Available VLPT is small for domU guest" > > + "(avail: %llx, required: %llx)\n", > > + avail, required); > > + return -ENOMEM; > > + } > > + > > + xen_second_linear_base > > + second_linear_offset(VIRT_LIN_P2M_START); > > + > > + gp2m_start_index = gma_start >> FIRST_SHIFT; > > + gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1; > > + > > + if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 ) > > + { > > + dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU"); > > + return -ENOMEM; > > + } > > + > > + second_lvl_page = alloc_domheap_pages(NULL, 1, 0); > > There''d be no harm in allocating the two pages separately I think, and > avoiding the possiblilty of an order 1 allocation failing. > > But if you do allocate them contiguously then d->arch.dirty.second_lvl can > be a simple lpae_t * and you can just access offsets 0..1023 without > worrying about managing the array of two pointers. > > As it stands I think you have the worst of both worlds. > > We could also just consider allocating all firstlevel p2m pages from the > xenheap instead of the domheap.OK. xenheap allocation sounds good. I was not sure if I can use xenheap for the one that looks like ''domain-specific'' purpose. If it''s OK, it would be much better code readability.> > > + if ( second_lvl_page == NULL ) > > + return -ENOMEM; > > + > > + /* First level p2m is 2 consecutive pages */ > > + d->arch.dirty.second_lvl[0] = map_domain_page_global( > > + page_to_mfn(second_lvl_page) ); > > + d->arch.dirty.second_lvl[1] = map_domain_page_global( > > + > > + page_to_mfn(second_lvl_page+1) ); > > + > > + first[0] = __map_domain_page(p2m->first_level); > > + first[1] = __map_domain_page(p2m->first_level+1); > > + for ( i = gp2m_start_index; i < gp2m_end_index; ++i ) > > Can''t this just be a loop over 0..1023 and avoid all this modular > arithmetic:Sure.> > + { > > + int k = i % LPAE_ENTRIES; > > + int l = i / LPAE_ENTRIES; > > + int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES; > > + int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES; > > + > > + write_pte(&xen_second[xen_second_linear_base+i], > > + first[l][k]); > > + > > + /* we copy the mapping into domain''s structure as a reference > > + * in case of the context switch (used in restore_vlpt) */ > > Do you keep this up to date when changing the p2m? > > In fact, why aren''t you just mapping the actual first level pages? Why > does it have to be a copy?Oh right. That is much better! Then, no need to maintain the copy. Thanks!> > > + d->arch.dirty.second_lvl[l2][k2] = first[l][k]; > > + } > > + unmap_domain_page(first[0]); > > + unmap_domain_page(first[1]); > > + > > + /* storing the start and end index */ > > + d->arch.dirty.second_lvl_start = xen_second_linear_base + > gp2m_start_index; > > + d->arch.dirty.second_lvl_end = xen_second_linear_base + > > + gp2m_end_index; > > + > > + flush_vlpt(d); > > + return 0; > > +} > > + > > +void cleanup_vlpt(struct domain *d) > > +{ > > + /* First level p2m is 2 consecutive pages */ > > + unmap_domain_page_global(d->arch.dirty.second_lvl[0]); > > + unmap_domain_page_global(d->arch.dirty.second_lvl[1]); > > +} > > /* > > * Local variables: > > * mode: C > > diff --git a/xen/include/asm-arm/arm32/page.h > > b/xen/include/asm-arm/arm32/page.h > > index cf12a89..0a4e115 100644 > > --- a/xen/include/asm-arm/arm32/page.h > > +++ b/xen/include/asm-arm/arm32/page.h > > @@ -5,20 +5,28 @@ > > > > /* Write a pagetable entry. > > * > > - * If the table entry is changing a text mapping, it is > > responsibility > > - * of the caller to issue an ISB after write_pte. > > + * All necessary barriers are responsibility of the caller > > */ > > -static inline void write_pte(lpae_t *p, lpae_t pte) > > +static inline void __write_pte(lpae_t *p, lpae_t pte) > > { > > asm volatile ( > > - /* Ensure any writes have completed with the old mappings. */ > > - "dsb;" > > /* Safely write the entry (STRD is atomic on CPUs that support > LPAE) */ > > "strd %0, %H0, [%1];" > > - "dsb;" > > : : "r" (pte.bits), "r" (p) : "memory"); } > > > > +/* Write a pagetable entry with dsb barriers. > > + * > > + * If the table entry is changing a text mapping, it is > > +responsibility > > + * of the caller to issue an ISB after write_pte. > > + */ > > +static inline void write_pte(lpae_t *p, lpae_t pte) { > > + dsb(); > > Please retain the comments from the original version.OK.> > > + __write_pte(p, pte); > > + dsb(); > > +} > > + > > /* Inline ASM to flush dcache on register R (may be an inline asm > > operand) */ #define __flush_xen_dcache_one(R) STORE_CP32(R, DCCMVAC) > > > > @@ -57,18 +65,28 @@ static inline void flush_xen_data_tlb(void) } > > > > /* > > - * Flush a range of VA''s hypervisor mappings from the data TLB. This > > is not > > - * sufficient when changing code mappings or for self modifying code. > > + * Flush a range of VA''s hypervisor mappings from the data TLB. > > + * All necessary barriers are responsibility of the caller > > */ > > -static inline void flush_xen_data_tlb_range_va(unsigned long va, > > unsigned long size) > > +static inline void __flush_xen_data_tlb_range_va(unsigned long va, > > +unsigned long size) > > { > > unsigned long end = va + size; > > - dsb(); /* Ensure preceding are visible */ > > while ( va < end ) { > > asm volatile(STORE_CP32(0, TLBIMVAH) > > : : "r" (va) : "memory"); > > va += PAGE_SIZE; > > } > > +} > > + > > +/* > > + * Flush a range of VA''s hypervisor mappings from the data TLB with > barriers. > > + * This barrier is not sufficient when changing code mappings or for > > +self > > + * modifying code. > > + */ > > +static inline void flush_xen_data_tlb_range_va(unsigned long va, > > +unsigned long size) { > > + dsb(); /* Ensure preceding are visible */ > > + __flush_xen_data_tlb_range_va(va, size); > > dsb(); /* Ensure completion of the TLB flush */ > > isb(); > > } > > diff --git a/xen/include/asm-arm/config.h > > b/xen/include/asm-arm/config.h index 5b7b1a8..15ad56d 100644 > > --- a/xen/include/asm-arm/config.h > > +++ b/xen/include/asm-arm/config.h > > @@ -87,6 +87,7 @@ > > * 0 - 8M <COMMON> > > * > > * 32M - 128M Frametable: 24 bytes per page for 16GB of RAM > > + * 128M - 256M Virtual-linear mapping to P2M table > > * 256M - 1G VMAP: ioremap and early_ioremap use this virtual address > > * space > > * > > @@ -124,7 +125,9 @@ > > #define CONFIG_SEPARATE_XENHEAP 1 > > > > #define FRAMETABLE_VIRT_START _AT(vaddr_t,0x02000000) > > +#define VIRT_LIN_P2M_START _AT(vaddr_t,0x08000000) > > #define VMAP_VIRT_START _AT(vaddr_t,0x10000000) > > +#define VIRT_LIN_P2M_END VMAP_VIRT_START > > #define XENHEAP_VIRT_START _AT(vaddr_t,0x40000000) > > #define XENHEAP_VIRT_END _AT(vaddr_t,0x7fffffff) > > #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) > > @@ -157,6 +160,8 @@ > > > > #define HYPERVISOR_VIRT_END DIRECTMAP_VIRT_END > > > > +/*TODO (ARM_64): define VIRT_LIN_P2M_START VIRT_LIN_P2M_END */ > > + > > #endif > > > > /* Fixmap slots */ > > diff --git a/xen/include/asm-arm/domain.h > > b/xen/include/asm-arm/domain.h index 67bfbbc..4f366f1 100644 > > --- a/xen/include/asm-arm/domain.h > > +++ b/xen/include/asm-arm/domain.h > > @@ -112,6 +112,13 @@ struct arch_domain > > spinlock_t lock; > > } vuart; > > > > + /* dirty-page tracing */ > > + struct { > > + volatile int second_lvl_start; /* for context switch */ > > + volatile int second_lvl_end; > > Are these virtual addresses, or pages or something else? We mostly hav > specific types (e.g. vaddr_t) for such things. >OK> volatile is almost always wrong to use. Why do you want it here? If there > is an actual coherency issue you will need proper barriers, not volatile.Ah, OK. Barriers would be more explicit.> > Ian.
Jaeyong Yoo
2013-Nov-15 02:26 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
> -----Original Message----- > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Wednesday, November 13, 2013 1:56 AM > To: Jaeyong Yoo > Cc: xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH RESEND v5 5/6] xen/arm: Implement > hypercall for dirty page tracing > > On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: > > Add hypercall (shadow op: enable/disable and clean/peek dirtied page > bitmap). > > It consists of two parts: dirty page detecting and saving. > > For detecting, we setup the guest p2m''s leaf PTE read-only and > > whenever the guest tries to write something, permission fault happens > and traps into xen. > > The permission-faulted GPA should be saved for the toolstack (when it > > wants to see which pages are dirtied). For this purpose, we temporarily > save the GPAs into bitmap. > > > > Changes from v4: > > 1. For temporary saving dirty pages, use bitmap rather than linked list. > > 2. Setup the p2m''s second level page as read-write in the view of xen''s > memory access. > > It happens in p2m_create_table function. > > > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > > --- > > xen/arch/arm/domain.c | 14 +++ > > xen/arch/arm/domctl.c | 9 ++ > > xen/arch/arm/mm.c | 103 +++++++++++++++++++- > > xen/arch/arm/p2m.c | 206 > ++++++++++++++++++++++++++++++++++++++++ > > xen/arch/arm/traps.c | 9 ++ > > xen/include/asm-arm/domain.h | 7 ++ > > xen/include/asm-arm/mm.h | 7 ++ > > xen/include/asm-arm/p2m.h | 4 + > > xen/include/asm-arm/processor.h | 2 + > > 9 files changed, 360 insertions(+), 1 deletion(-) > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > c0b5dd8..0a32301 100644 > > --- a/xen/arch/arm/domain.c > > +++ b/xen/arch/arm/domain.c > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > WRITE_SYSREG(hcr, HCR_EL2); > > isb(); > > > > + /* for dirty-page tracing > > + * XXX: how do we consider SMP case? > > + */ > > + if ( n->domain->arch.dirty.mode ) > > + restore_vlpt(n->domain); > > This is an interesting question. xen_second is shared between all pcpus, > which means that the vlpt is currently only usable from a single physical > CPU at a time. > > Currently the only per-cpu area is the domheap region from 2G-4G. We could > steal some address space from the top or bottom of there?oh right hmm. Then, how about place the vlpt area starting from 2G (between dom heap and xen heap), and let the map_domain_page map the domain page starting from the VLPT-end address? For this, I think just changing DOMHEAP_VIRT_START to some place (maybe 0x88000000) would be suffice.> > > /* This is could trigger an hardware interrupt from the virtual > > * timer. The interrupt needs to be injected into the guest. */ > > virt_timer_restore(n); > > @@ -509,11 +515,19 @@ int arch_domain_create(struct domain *d, unsigned > int domcr_flags) > > /* Default the virtual ID to match the physical */ > > d->arch.vpidr = boot_cpu_data.midr.bits; > > > > + /* init for dirty-page tracing */ > > + d->arch.dirty.count = 0; > > + d->arch.dirty.mode = 0; > > + spin_lock_init(&d->arch.dirty.lock); > > + > > d->arch.dirty.second_lvl_start = 0; > > d->arch.dirty.second_lvl_end = 0; > > d->arch.dirty.second_lvl[0] = NULL; > > d->arch.dirty.second_lvl[1] = NULL; > > > > + memset(d->arch.dirty.bitmap, 0, sizeof(d->arch.dirty.bitmap)); > > + d->arch.dirty.bitmap_pages = 0; > > + > > clear_page(d->shared_info); > > share_xen_page_with_guest( > > virt_to_page(d->shared_info), d, XENSHARE_writable); diff > > --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index > > cb38e59..eb74225 100644 > > --- a/xen/arch/arm/domctl.c > > +++ b/xen/arch/arm/domctl.c > > @@ -93,6 +93,15 @@ long arch_do_domctl(struct xen_domctl *domctl, struct > domain *d, > > xfree(c.data); > > } > > break; > > + case XEN_DOMCTL_shadow_op: > > + { > > + domain_pause(d); > > + ret = dirty_mode_op(d, &domctl->u.shadow_op); > > + domain_unpause(d); > > + > > + copyback = 1; > > + } > > + break; > > > > default: > > return -EINVAL; > > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index > > bf13993..d5a0a11 100644 > > --- a/xen/arch/arm/mm.c > > +++ b/xen/arch/arm/mm.c > > @@ -845,7 +845,6 @@ void destroy_xen_mappings(unsigned long v, unsigned > long e) > > create_xen_entries(REMOVE, v, 0, (e - v) >> PAGE_SHIFT, 0); } > > > > -enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; static void > > set_pte_flags_on_range(const char *p, unsigned long l, enum mg mg) { > > lpae_t pte; > > @@ -1320,6 +1319,60 @@ int is_iomem_page(unsigned long mfn) > > * xen: arm: 64-bit guest support and domU FDT autogeneration > > * will be upstreamed. > > */ > > + > > +static inline void mark_dirty_bitmap(struct domain *d, paddr_t addr) > > +{ > > + paddr_t ram_base = (paddr_t) GUEST_RAM_BASE; > > + int bit_index = PFN_DOWN(addr - ram_base); > > + int page_index = bit_index >> (PAGE_SHIFT + 3); > > +3?+3 is for changing the bit-index to byte-index. I think it would be better to use BYTE_SHIFT kind of thing.> > > + int bit_index_residual = bit_index & ((1ul << (PAGE_SHIFT + 3)) - > > + 1); > > + > > + set_bit(bit_index_residual, d->arch.dirty.bitmap[page_index]); > > +} > > + > > +/* routine for dirty-page tracing > > + * > > + * On first write, it page faults, its entry is changed to > > +read-write, > > + * and on retry the write succeeds. > > + * > > + * for locating p2m of the faulting entry, we use virtual-linear page > table. > > + * returns zero if addr is not valid or dirty mode is not set */ int > > +handle_page_fault(struct domain *d, paddr_t addr) { > > + > > + lpae_t *vlp2m_pte = 0; > > + paddr_t gma_start = 0; > > + paddr_t gma_end = 0; > > + > > + if ( !d->arch.dirty.mode ) return 0; > > + get_gma_start_end(d, &gma_start, &gma_end); > > + > > + /* Ensure that addr is inside guest''s RAM */ > > + if ( addr < gma_start || > > + addr > gma_end ) return 0; > > + > > + vlp2m_pte = get_vlpt_3lvl_pte(addr); > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 && > > + vlp2m_pte->p2m.avail == 0 /* reuse avail bit as read-only */ ) > > + { > > + lpae_t pte = *vlp2m_pte; > > + pte.p2m.write = 1; > > + write_pte(vlp2m_pte, pte); > > Do we not need the p2m lock around here somewhere?Oh, I think this brings an another question. In the dirty-page tracing phase, would it be OK to permit changes in p2m page table (such as adding new pages)? In that case, VLPT and dirty-bitmap should be re-prepared for taking account this change.> > > + flush_tlb_local(); > > If SMP was working this would need to be inner shareable to cope with > vcpus running on other pcpus.You meant the p2m page table should be inner shareable?> > > + /* only necessary to lock between get-dirty bitmap and mark dirty > > + * bitmap. If get-dirty bitmap happens immediately before this > > + * lock, the corresponding dirty-page would be marked at the next > > + * round of get-dirty bitmap */ > > + spin_lock(&d->arch.dirty.lock); > > + mark_dirty_bitmap(d, addr); > > + spin_unlock(&d->arch.dirty.lock); > > Might an atomic set_bit suffice rather than the lock? > > > + } > > + > > + return 1; > > +} > > + > > void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t > > *end) { > > if ( start ) > > @@ -1440,6 +1493,54 @@ void cleanup_vlpt(struct domain *d) > > unmap_domain_page_global(d->arch.dirty.second_lvl[0]); > > unmap_domain_page_global(d->arch.dirty.second_lvl[1]); > > } > > + > > +int prepare_bitmap(struct domain *d) > > +{ > > + paddr_t gma_start = 0; > > + paddr_t gma_end = 0; > > + int nr_bytes; > > + int nr_pages; > > + int i; > > + > > + get_gma_start_end(d, &gma_start, &gma_end); > > + > > + nr_bytes = (PFN_DOWN(gma_end - gma_start) + 7) / 8; > > + nr_pages = (nr_bytes + PAGE_SIZE - 1) / PAGE_SIZE; > > + > > + BUG_ON( nr_pages > MAX_DIRTY_BITMAP_PAGES ); > > + > > + for ( i = 0; i < nr_pages; ++i ) > > + { > > + struct page_info *page; > > + page = alloc_domheap_page(NULL, 0); > > + if ( page == NULL ) > > + goto cleanup_on_failure; > > + > > + d->arch.dirty.bitmap[i] > > + map_domain_page_global(__page_to_mfn(page)); > > I think you may as well just use the xenheap here.OK.> > > +/* Change types across all p2m entries in a domain */ void > > +p2m_change_entry_type_global(struct domain *d, enum mg nt) { > > + struct p2m_domain *p2m = &d->arch.p2m; > > + paddr_t ram_base; > > + int i1, i2, i3; > > + int first_index, second_index, third_index; > > + lpae_t *first = __map_domain_page(p2m->first_level); > > + lpae_t pte, *second = NULL, *third = NULL; > > + > > + get_gma_start_end(d, &ram_base, NULL); > > + > > + first_index = first_table_offset((uint64_t)ram_base); > > + second_index = second_table_offset((uint64_t)ram_base); > > + third_index = third_table_offset((uint64_t)ram_base); > > Are those casts needed?Oh, they do not need to be.> > > + > > + BUG_ON( !first && "Can''t map first level p2m." ); > > + > > + spin_lock(&p2m->lock); > > + > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > You avoid iterating over stuff below ram_base but you aren''t worried about > ram end?I think I need to rework this function for accounting all those comments below.> > > + { > > + lpae_walk_t first_pte = first[i1].walk; > > + if ( !first_pte.valid || !first_pte.table ) > > + goto out; > > + > > + second = map_domain_page(first_pte.base); > > + BUG_ON( !second && "Can''t map second level p2m."); > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > + { > > + lpae_walk_t second_pte = second[i2].walk; > > + > > + if ( !second_pte.valid || !second_pte.table ) > > + goto out; > > goto out is a bit strong. Don''t you want to move on to the next second > level page? > > > + > > + third = map_domain_page(second_pte.base); > > + BUG_ON( !third && "Can''t map third level p2m."); > > + > > + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) > > + { > > + lpae_walk_t third_pte = third[i3].walk; > > + if ( !third_pte.valid ) > > + goto out; > > + > > + pte = third[i3]; > > + if ( nt == mg_ro ) > > + { > > + if ( pte.p2m.write == 1 ) > > + { > > + pte.p2m.write = 0; > > + pte.p2m.avail = 0; > > + } > > + else > > + { > > + /* reuse avail bit as an indicator > > + * of ''actual'' read-only */ > > + pte.p2m.avail = 1; > > + } > > This inner if is equivalent to > pte.p2m.avail = pte.p2m.write; > pte.p2m.write = 1; > > > + } > > + else if ( nt == mg_rw ) > > + { > > + if ( pte.p2m.write == 0 && pte.p2m.avail == 0 ) > > + { > > + pte.p2m.write = 1; > > + } > > and this one is: > pte.p2m.write = pte.p2m.avail; > > > + } > > + write_pte(&third[i3], pte); > > + } > > + unmap_domain_page(third); > > + > > + third = NULL; > > + third_index = 0; > > + } > > + unmap_domain_page(second); > > + > > + second = NULL; > > + second_index = 0; > > + third_index = 0; > > + } > > + > > +out: > > + flush_tlb_all_local(); > > + if ( third ) unmap_domain_page(third); > > + if ( second ) unmap_domain_page(second); > > + if ( first ) unmap_domain_page(first); > > + > > + spin_unlock(&p2m->lock); > > +} > > + > > +/* Read a domain''s log-dirty bitmap and stats. > > + * If the operation is a CLEAN, clear the bitmap and stats. */ int > > +log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) { > > + int peek = 1; > > + int i; > > + int bitmap_size; > > + paddr_t gma_start, gma_end; > > + > > + /* this hypercall is called from domain 0, and we don''t know which > guest''s > > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt > here */ > > + restore_vlpt(d); > > + > > + get_gma_start_end(d, &gma_start, &gma_end); > > + bitmap_size = (gma_end - gma_start) / 8; > > + > > + if ( guest_handle_is_null(sc->dirty_bitmap) ) > > + { > > + peek = 0; > > What do you do with this?peek was used for just seeing which pages are dirted without resetting the dirty bitmap. It could be used for analyzing phase of dirty-pages (just to see the dirty page generation rates) at toolstack. But, we are having second thought on this and I think the code cleanup of this analysis phase was not perfect. So, the answer is nothing. I will clean up this.> > > + } > > + else > > + { > > + spin_lock(&d->arch.dirty.lock); > > + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) > > + { > > + int j = 0; > > + uint8_t *bitmap; > > + copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE, > > + d->arch.dirty.bitmap[i], > > + bitmap_size < PAGE_SIZE ? bitmap_size : > > + > > + PAGE_SIZE); > > The bitmap is always a mutliple of page_size, isn''t it?But the actually used bits inside of the bitmap are not multiple of page_size I think.> > > + bitmap_size -= PAGE_SIZE; > > + > > + /* set p2m page table read-only */ > > + bitmap = d->arch.dirty.bitmap[i]; > > + while ((j = find_next_bit((const long unsigned int *)bitmap, > > + PAGE_SIZE*8, j)) < PAGE_SIZE*8) > > + { > > + lpae_t *vlpt; > > + paddr_t addr = gma_start + > > + (i << (2*PAGE_SHIFT+3)) + > > + (j << PAGE_SHIFT); > > + vlpt = get_vlpt_3lvl_pte(addr); > > + vlpt->p2m.write = 0; > > Needs to be a write_pte type construct I think.Oh sure.> > > + j++; > > + } > > + } > > + > > + if ( sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN ) > > + { > > + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) > > + { > > + clear_page(d->arch.dirty.bitmap[i]); > > + } > > + } > > + > > + spin_unlock(&d->arch.dirty.lock); > > Ah, I see why the lock in the handler is needed. > > This holds the lock over quite a long period. Could it be made more > granular by copying and clearing each page in sequence, dropping the lock > between pages?OK, I see.> > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index > > 287dd7b..1a7ed11 100644 > > --- a/xen/arch/arm/traps.c > > +++ b/xen/arch/arm/traps.c > > @@ -1321,6 +1321,8 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > > const char *msg; > > int rc, level = -1; > > mmio_info_t info; > > + int page_fault = ( (dabt.dfsc & FSC_MASK) => > + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write > > + ); > > I think you can use FSC_TYPE_MASK and FSC_LL_MASK here, can''t you? > > I think this would be better off refactored into a > dabt_is_page_fault(dabt), used in the test below. That would allow you to > more easily comment on why these particular conditions are the ones we > care about etc. > > > if ( !check_conditional_instr(regs, hsr) ) > > { > > @@ -1342,6 +1344,13 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > > if ( rc == -EFAULT ) > > goto bad_data_abort; > > > > + /* domU page fault handling for guest live migration */ > > + /* dabt.valid can be 0 here */ > > + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) > > + { > > + /* Do not modify pc after page fault to repeat memory operation > */ > > + return; > > + } > > /* XXX: Decode the instruction if ISS is not valid */ > > if ( !dabt.valid ) > > goto bad_data_abort; > > diff --git a/xen/include/asm-arm/domain.h > > b/xen/include/asm-arm/domain.h index 4f366f1..180d924 100644 > > --- a/xen/include/asm-arm/domain.h > > +++ b/xen/include/asm-arm/domain.h > > @@ -114,9 +114,16 @@ struct arch_domain > > > > /* dirty-page tracing */ > > struct { > > +#define MAX_DIRTY_BITMAP_PAGES 64 /* support upto 8GB guest memory > */ > > + spinlock_t lock; /* protect list: head, mvn_head */ > > + volatile int mode; /* 1 if dirty pages tracing enabled > */ > > + volatile unsigned int count; /* dirty pages counter */ > > More unnecessary volatiles i think.OK.> > > volatile int second_lvl_start; /* for context switch */ > > volatile int second_lvl_end; > > lpae_t *second_lvl[2]; /* copy of guest p2m''s first */ > > + /* dirty bitmap */ > > + uint8_t *bitmap[MAX_DIRTY_BITMAP_PAGES]; > > + int bitmap_pages; /* number of dirty bitmap pages */ > > } dirty; > > > > } __cacheline_aligned; > > diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index > > a74e135..1ce7a4b 100644 > > --- a/xen/include/asm-arm/mm.h > > +++ b/xen/include/asm-arm/mm.h > > @@ -341,11 +341,18 @@ static inline void put_page_and_type(struct > page_info *page) > > put_page(page); > > } > > > > +enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; > > + > > +/* routine for dirty-page tracing */ > > +int handle_page_fault(struct domain *d, paddr_t addr); > > void get_gma_start_end(struct domain *d, paddr_t *start, paddr_t > > *end); int prepare_vlpt(struct domain *d); void cleanup_vlpt(struct > > domain *d); void restore_vlpt(struct domain *d); > > > > +int prepare_bitmap(struct domain *d); void cleanup_bitmap(struct > > +domain *d); > > + > > /* calculate the xen''s virtual address for accessing the leaf PTE of > > * a given address (GPA) */ > > static inline lpae_t * get_vlpt_3lvl_pte(paddr_t addr) diff --git > > a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h index > > c660820..dba9a7b 100644 > > --- a/xen/include/asm-arm/p2m.h > > +++ b/xen/include/asm-arm/p2m.h > > @@ -2,6 +2,7 @@ > > #define _XEN_P2M_H > > > > #include <xen/mm.h> > > +#include <public/domctl.h> > > > > struct domain; > > > > @@ -110,6 +111,9 @@ static inline int get_page_and_type(struct page_info > *page, > > return rc; > > } > > > > +void p2m_change_entry_type_global(struct domain *d, enum mg nt); long > > +dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc); > > + > > #endif /* _XEN_P2M_H */ > > > > /* > > diff --git a/xen/include/asm-arm/processor.h > > b/xen/include/asm-arm/processor.h index 5294421..fced6ad 100644 > > --- a/xen/include/asm-arm/processor.h > > +++ b/xen/include/asm-arm/processor.h > > @@ -399,6 +399,8 @@ union hsr { > > #define FSC_CPR (0x3a) /* Coprocossor Abort */ > > > > #define FSC_LL_MASK (0x03<<0) > > +#define FSC_MASK (0x3f) /* Fault status mask */ > > +#define FSC_3D_LEVEL (0x03) /* Third level fault*/ > > > > /* Time counter hypervisor control register */ > > #define CNTHCTL_PA (1u<<0) /* Kernel/user access to physical > counter */
Jaeyong Yoo
2013-Nov-15 04:15 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
I accidently missed this comment in the previous email.> > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index > > 287dd7b..1a7ed11 100644 > > --- a/xen/arch/arm/traps.c > > +++ b/xen/arch/arm/traps.c > > @@ -1321,6 +1321,8 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > > const char *msg; > > int rc, level = -1; > > mmio_info_t info; > > + int page_fault = ( (dabt.dfsc & FSC_MASK) => > + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write > > + ); > > I think you can use FSC_TYPE_MASK and FSC_LL_MASK here, can''t you? >Something like dabt.dfsc & FSC_TYPE_MASK == FSC_TYPE_FAULT && dabt.dfsc & FSC_LL_MASK == 0x3 /* third level */, right?> I think this would be better off refactored into a > dabt_is_page_fault(dabt), used in the test below. That would allow you to > more easily comment on why these particular conditions are the ones we > care about etc. >Sure. Jaeyong
Eugene Fedotov
2013-Nov-15 07:04 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
13.11.2013 14:58, Ian Campbell пишет:> On Wed, 2013-11-13 at 12:28 +0400, Eugene Fedotov wrote: >>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c >>>> index 123280e..3801f07 100644 >>>> --- a/xen/arch/arm/mm.c >>>> +++ b/xen/arch/arm/mm.c >>>> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type) >>>> >>>> unsigned long domain_get_maximum_gpfn(struct domain *d) >>> s/unsigned long/xen_pfn_t/ I think. >>> >>> Urk, which will break the ABI for this hypercall. That's something of a >>> conundrum :-/ >>> >>> It is a domctl so we are at liberty to change it, probably to taking a >>> xen_pfn_t * to fill in instead of returning the value. >>> >>> I'm in two minds about whether we should do so now or postpone it (which >>> risks forgetting and having an obscure bug somewhere down the line). >> OK, this way, or other solution is to purge this hypercall yet in >> migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the >> maximum PFN. > Is there an interface to retrieve that? I don't see it :-(No, it is another hypercall XEN_DOMCTL_getdomaininfo. max_memkb field in xc_dominfo_t structure is a second way to retrieve maximum memory size and calculate PFN. Do we need domain_get_maximum_gpfn ? Best regards, Evgeny. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jaeyong Yoo
2013-Nov-18 03:47 UTC
Re: [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
> > > + > > > + second_lvl_page = alloc_domheap_pages(NULL, 1, 0); > > > > There''d be no harm in allocating the two pages separately I think, and > > avoiding the possiblilty of an order 1 allocation failing. > > > > But if you do allocate them contiguously then d->arch.dirty.second_lvl > > can be a simple lpae_t * and you can just access offsets 0..1023 > > without worrying about managing the array of two pointers. > > > > As it stands I think you have the worst of both worlds. > > > > We could also just consider allocating all firstlevel p2m pages from > > the xenheap instead of the domheap. > > OK. xenheap allocation sounds good. I was not sure if I can use xenheap > for the one that looks like ''domain-specific'' purpose. If it''s OK, it > would be much better code readability. > > > > > > + if ( second_lvl_page == NULL ) > > > + return -ENOMEM; > > > + > > > + /* First level p2m is 2 consecutive pages */ > > > + d->arch.dirty.second_lvl[0] = map_domain_page_global( > > > +page_to_mfn(second_lvl_page) );> > > + d->arch.dirty.second_lvl[1] = map_domain_page_global( > > > + > > > + page_to_mfn(second_lvl_page+1) ); > > > + > > > + first[0] = __map_domain_page(p2m->first_level); > > > + first[1] = __map_domain_page(p2m->first_level+1); > > > + for ( i = gp2m_start_index; i < gp2m_end_index; ++i ) > > > > Can''t this just be a loop over 0..1023 and avoid all this modular > > arithmetic: > > Sure.CTTOI, since p2m->first_level consists of 2 pages, we can''t just loop over 0..1023, unless we have an improved map_domain_page that can map consecutive pages. One way to avoid the modular arithmetic is to use nested loop; loop over first_level and then loop inside a first_level page. I think this one (nested loop) is better than modular arithmetic. Jaeyong
Jaeyong Yoo
2013-Nov-19 01:32 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
> > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > > c0b5dd8..0a32301 100644 > > > --- a/xen/arch/arm/domain.c > > > +++ b/xen/arch/arm/domain.c > > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > > WRITE_SYSREG(hcr, HCR_EL2); > > > isb(); > > > > > > + /* for dirty-page tracing > > > + * XXX: how do we consider SMP case? > > > + */ > > > + if ( n->domain->arch.dirty.mode ) > > > + restore_vlpt(n->domain); > > > > This is an interesting question. xen_second is shared between all > > pcpus, which means that the vlpt is currently only usable from a > > single physical CPU at a time. > > > > Currently the only per-cpu area is the domheap region from 2G-4G. We > > could steal some address space from the top or bottom of there? > > oh right hmm. Then, how about place the vlpt area starting from 2G > (between dom heap and xen heap), and let the map_domain_page map the > domain page starting from the VLPT-end address? > > For this, I think just changing DOMHEAP_VIRT_START to some place (maybe > 0x88000000) would be suffice. >I just found out that DOMHEAP_VIRT_START should be aligned to first-level size. So, 0x88000000 wouldn''t work. So, I''m thinking to use the back of the domheap area and piggybacking the VLPT into the dom-heap region. For this, we should use something like the following layout: #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) #define VIRT_LIN_P2M_START _AT(vaddr_t,0xf8000000) #define VIRT_LIN_P2M_END _AT(vaddr_t<0xffffffff) #define DOMHEAP_VIRT_END _AT(vaddr_t,0xffffffff) where VLPT area is overlapped to domheap. This is necessary since dommap size is decided at booting time by the calculation of DOMHEAP_VIRT_END - DOMHEAP_VIRT_START, and at this time we have to also allocate the dommap for VLPT. Something else we can do is to give VLPT for 1G memory and giving it its own per-cpu page table. But, using 1G for VLPT looks like some waste of virtual address. So, I''m thinking of using overlapped VLPT. Could you tell me your opinion about this? Jaeyong
Eugene Fedotov
2013-Nov-19 11:06 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
13.11.2013 15:09, Ian Campbell wrote:> On Wed, 2013-11-13 at 13:57 +0400, Eugene Fedotov wrote: >>>> diff --git a/tools/libxc/xc_arm_migrate.c b/tools/libxc/xc_arm_migrate.c >>>> new file mode 100644 >>>> index 0000000..461e339 >>>> --- /dev/null >>>> +++ b/tools/libxc/xc_arm_migrate.c >>>> @@ -0,0 +1,712 @@ >>> Is this implementing the exact protocol as described in >>> tools/libxc/xg_save_restore.h or is it a variant? Are there any docs of >>> the specifics of the ARM protocol? >> This implements a quite different from tools/libxc/xg_save_restore.h >> protocol, it is much more simplified because we do not need some things >> that implemented for x86. So you''re right, it has to be documented. >> Should we use a different header to place documentation to this (and >> place some definitions), for example xc_arm_migrate.h? > I think xg_arm_save_restore.h would be the consistent name. At some > point we should rename some of the x86 specific stuff, but you don''t > need to worry about that.OK>>> We will eventually need to make a statement about the stability of the >>> protocol, i.e on x86 we support X->X+1 migrations across Xen versions. I >>> think we''d need to make similar guarantees on ARM before we would remove >>> the "tech preview" label from the migration feature. >> So, should you believe our results (and where should we place this >> statement) or should you make tests from your side? > By stability I mean the stability of the migration ABI across version > changes, not stability as in the bugginess or otherwise of the code. IOW > a commitment to supporting migration from version X to version X+1 in > the future. > > WRT "tech preview" I don''t think we are going to be able to honestly > call this production ready for the general case in 4.4, I expect it''ll > need a bit longer to "bed in" before we are happy with that statement. > (this has no bearing on whether you want to fully support it as > production ready in your particular application).Our migration protocol consists of: header (Console and XenStore PFN), memory data, HVM context. Can we guarantee that HVM context of version X is compatible (or the same) as HVM context of version X ? I think if we add some HVM hardware (such as virtual RTC) in version X+1 we may have a compatibility problem, because HVM buffer structure had been changed. I think we cannot predict such things in future versions. But our migration protocol may have version number and migration between X and X+1 Xen versions is possible when we use the same protocol number in both sides (we may compare protocol versions during the migration runtime).>>>> +/* ============ Memory ============= */ >>>> +static int save_memory(xc_interface *xch, int io_fd, uint32_t dom, >>>> + struct save_callbacks *callbacks, >>>> + uint32_t max_iters, uint32_t max_factor, >>>> + guest_params_t *params) >>>> +{ >>>> + int live = !!(params->flags & XCFLAGS_LIVE); >>>> + int debug = !!(params->flags & XCFLAGS_DEBUG); >>>> + xen_pfn_t i; >>>> + char reportbuf[80]; >>>> + int iter = 0; >>>> + int last_iter = !live; >>>> + int total_dirty_pages_num = 0; >>>> + int dirty_pages_on_prev_iter_num = 0; >>>> + int count = 0; >>>> + char *page = 0; >>>> + xen_pfn_t *busy_pages = 0; >>>> + int busy_pages_count = 0; >>>> + int busy_pages_max = 256; >>>> + >>>> + DECLARE_HYPERCALL_BUFFER(unsigned long, to_send); >>>> + >>>> + xen_pfn_t start = params->start_gpfn; >>>> + const xen_pfn_t end = params->max_gpfn; >>>> + const xen_pfn_t mem_size = end - start; >>>> + >>>> + if ( debug ) >>>> + { >>>> + IPRINTF("(save mem) start=%llx end=%llx!\n", start, end); >>>> + } >>> FYI you don''t need the {}''s for cases like this. >> Actually we don''t need {}, this has been done because we was not sure if >> this macro can be empty-substituted. >>> is if ( debug ) IPRINTF(...) not the equivalent of DPRINTF? >> This equivalence is not obvious for me, because in current code we >> obtain debug flag with XCFLAGS_DEBUG mask (when --debug option passed). >> If it is equivalent I''ll use DPRINTF. > No you are right, these are different. Actually DPRINTF is "detail > level", there is a DBGPRINTF which is "debug level" (levels here being > in terms of XTL_FOO from xentoollog.h). So you probably do want to use > if (debug), although you may separately want to consider which level the > resulting messages are printed at when they are printed...OK>>> [...] >>>> + >>>> +static int restore_guest_params(xc_interface *xch, int io_fd, >>>> + uint32_t dom, guest_params_t *params) >>>> +{ >>>> [...] >>>> + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) >>>> + { >>>> + ERROR("Can''t set memory map"); >>>> + return -1; >>>> + } >>>> + >>>> + /* Set max. number of vcpus as max_vcpu_id + 1 */ >>>> + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) ) >>> Does the higher level toolstack not take care of vcpus and maxmem? I >>> thought so. I think this is how it shoud be. >> For my tests guest config information is not transferred for ARM case >> from high-level stack. At the migration receiver side toolstack always >> create a new domain with vcpus=1 and default max. mem. So we have to >> send guest information as our local guest_params structure (at the >> beginning of migration). >> It is easy way to work "xl save" or "xl migrate" without modification of >> libxl level, but you may have another idea? >> Also, toolstack_restore callback is not set (NULL) for ARM case. > So you aren''t using xl to do the migration? This is what we should > ultimately be aiming for. It is almost certainly going to require fixes > at the libxl level though. > > If you need to send additional information for your usecase then it > should be done at the layer above libxc.I found the reason why guest config information is not saved in libxl. the following line in xl_cmdimpl.c: if (!dom_info->quiet) printf("Parsing config from %s\n", config_source); in function "create_domain" calls printf instead of fprintf(stderr, ...). It is not the same on ARM. I have feeling that printf() causes output to socket buffer and it breaks config data. Should we use fprintf(stderr, "Parsing config from %s\n", config_source); here? If we do this, config is transferred and we do not need set VCPU number and memory manually. Best regards, Evgeny.
Ian Campbell
2013-Nov-19 11:37 UTC
Re: [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
> > > +{ > > > + int i; > > > + dsb(); > > > + for ( i = d->arch.dirty.second_lvl_start; > > > + i < d->arch.dirty.second_lvl_end; > > > + ++i ) > > > + { > > > + int k = i % LPAE_ENTRIES; > > > + int l = i / LPAE_ENTRIES; > > > + > > > + if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits ) > > > + { > > > + __write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]); > > > + __flush_xen_data_tlb_range_va(i << SECOND_SHIFT, > > > + 1 << SECOND_SHIFT); > > > + } > > > + } > > > + dsb(); > > > + isb(); > > > +} > > > + > > > +/* setting up the xen page table for vlpt mapping for domain d */ int > > > +prepare_vlpt(struct domain *d) { > > > + int xen_second_linear_base; > > > + int gp2m_start_index, gp2m_end_index; > > > + struct p2m_domain *p2m = &d->arch.p2m; > > > + struct page_info *second_lvl_page; > > > + paddr_t gma_start = 0; > > > + paddr_t gma_end = 0; > > > + lpae_t *first[2]; > > > + int i; > > > + uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START; > > > + > > > + get_gma_start_end(d, &gma_start, &gma_end); > > > + required = (gma_end - gma_start) >> LPAE_SHIFT; > > > + > > > + > > > + if ( required > avail ) > > > > avail is the number of bytes of virtual address space in the linear p2m > > area. > > > > required is the number of pages which the guest has. Does this comparison > > correctly account for each page being represented by an 8-byte lpae_t > > entry? > > Yes, it does. See, LPAE_SHIFT only shifts 9 bits. Shifting PAGE_SHIFT (12 bits) > gives the number of pages (every bit), and 9-bit shift gives the required > memory for storing third-level PTE. Since this one-shifting of LPAE_SHIFT > is confusing enough, we can make it explicit by using something like > required = ((gma_end - gma_start) >> PAGE_SHIFT ) * (sizeof lpae_t).I think this is clearer than requiring the reader to have to think about the effect of shift by N-M and the reasons for it, thanks. The alternative would be a comment explaining what is going on. Ian.
Ian Campbell
2013-Nov-19 11:38 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
On Fri, 2013-11-15 at 13:15 +0900, Jaeyong Yoo wrote:> I accidently missed this comment in the previous email. > > > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index > > > 287dd7b..1a7ed11 100644 > > > --- a/xen/arch/arm/traps.c > > > +++ b/xen/arch/arm/traps.c > > > @@ -1321,6 +1321,8 @@ static void do_trap_data_abort_guest(struct > > cpu_user_regs *regs, > > > const char *msg; > > > int rc, level = -1; > > > mmio_info_t info; > > > + int page_fault = ( (dabt.dfsc & FSC_MASK) => > > + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write > > > + ); > > > > I think you can use FSC_TYPE_MASK and FSC_LL_MASK here, can''t you? > > > > Something like > > dabt.dfsc & FSC_TYPE_MASK == FSC_TYPE_FAULT && > dabt.dfsc & FSC_LL_MASK == 0x3 /* third level */, > > right?Something like that, yes. You''d need the bit about it being a perm fault in there somewhere too.> > > I think this would be better off refactored into a > > dabt_is_page_fault(dabt), used in the test below. That would allow you to > > more easily comment on why these particular conditions are the ones we > > care about etc. > > > > Sure. > > Jaeyong >
Ian Campbell
2013-Nov-19 11:42 UTC
Re: [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration
On Mon, 2013-11-18 at 12:47 +0900, Jaeyong Yoo wrote:> > > > + > > > > + second_lvl_page = alloc_domheap_pages(NULL, 1, 0); > > > > > > There''d be no harm in allocating the two pages separately I think, and > > > avoiding the possiblilty of an order 1 allocation failing. > > > > > > But if you do allocate them contiguously then d->arch.dirty.second_lvl > > > can be a simple lpae_t * and you can just access offsets 0..1023 > > > without worrying about managing the array of two pointers. > > > > > > As it stands I think you have the worst of both worlds. > > > > > > We could also just consider allocating all firstlevel p2m pages from > > > the xenheap instead of the domheap. > > > > OK. xenheap allocation sounds good. I was not sure if I can use xenheap > > for the one that looks like ''domain-specific'' purpose. If it''s OK, it > > would be much better code readability. > > > > > > > > > + if ( second_lvl_page == NULL ) > > > > + return -ENOMEM; > > > > + > > > > + /* First level p2m is 2 consecutive pages */ > > > > + d->arch.dirty.second_lvl[0] = map_domain_page_global( > > > > + > page_to_mfn(second_lvl_page) ); > > > > + d->arch.dirty.second_lvl[1] = map_domain_page_global( > > > > + > > > > + page_to_mfn(second_lvl_page+1) ); > > > > + > > > > + first[0] = __map_domain_page(p2m->first_level); > > > > + first[1] = __map_domain_page(p2m->first_level+1); > > > > + for ( i = gp2m_start_index; i < gp2m_end_index; ++i ) > > > > > > Can''t this just be a loop over 0..1023 and avoid all this modular > > > arithmetic: > > > > Sure. > > CTTOI, since p2m->first_level consists of 2 pages, we can''t just loop over > 0..1023, > unless we have an improved map_domain_page that can map consecutive pages. > One way to avoid the modular arithmetic is to use nested loop; loop over > first_level > and then loop inside a first_level page. I think this one (nested loop) is > better than > modular arithmetic.Or if it were a xenheap allocation it would allow consecutive pages. If the modular arithmetic is needed then that''s fine, or if you prefer the nested loop then that is fine too. A helper function to get an entry from a specified offset in the first table which encapsulated the modular arithmetic would likely make that approach more palatable. Ian.
Ian Campbell
2013-Nov-19 11:54 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
On Fri, 2013-11-15 at 11:26 +0900, Jaeyong Yoo wrote:> > -----Original Message----- > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > > Sent: Wednesday, November 13, 2013 1:56 AM > > To: Jaeyong Yoo > > Cc: xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] [PATCH RESEND v5 5/6] xen/arm: Implement > > hypercall for dirty page tracing > > > > On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote: > > > Add hypercall (shadow op: enable/disable and clean/peek dirtied page > > bitmap). > > > It consists of two parts: dirty page detecting and saving. > > > For detecting, we setup the guest p2m''s leaf PTE read-only and > > > whenever the guest tries to write something, permission fault happens > > and traps into xen. > > > The permission-faulted GPA should be saved for the toolstack (when it > > > wants to see which pages are dirtied). For this purpose, we temporarily > > save the GPAs into bitmap. > > > > > > Changes from v4: > > > 1. For temporary saving dirty pages, use bitmap rather than linked list. > > > 2. Setup the p2m''s second level page as read-write in the view of xen''s > > memory access. > > > It happens in p2m_create_table function. > > > > > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > > > --- > > > xen/arch/arm/domain.c | 14 +++ > > > xen/arch/arm/domctl.c | 9 ++ > > > xen/arch/arm/mm.c | 103 +++++++++++++++++++- > > > xen/arch/arm/p2m.c | 206 > > ++++++++++++++++++++++++++++++++++++++++ > > > xen/arch/arm/traps.c | 9 ++ > > > xen/include/asm-arm/domain.h | 7 ++ > > > xen/include/asm-arm/mm.h | 7 ++ > > > xen/include/asm-arm/p2m.h | 4 + > > > xen/include/asm-arm/processor.h | 2 + > > > 9 files changed, 360 insertions(+), 1 deletion(-) > > > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > > c0b5dd8..0a32301 100644 > > > --- a/xen/arch/arm/domain.c > > > +++ b/xen/arch/arm/domain.c > > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > > WRITE_SYSREG(hcr, HCR_EL2); > > > isb(); > > > > > > + /* for dirty-page tracing > > > + * XXX: how do we consider SMP case? > > > + */ > > > + if ( n->domain->arch.dirty.mode ) > > > + restore_vlpt(n->domain); > > > > This is an interesting question. xen_second is shared between all pcpus, > > which means that the vlpt is currently only usable from a single physical > > CPU at a time. > > > > Currently the only per-cpu area is the domheap region from 2G-4G. We could > > steal some address space from the top or bottom of there? > > oh right hmm. Then, how about place the vlpt area starting from 2G (between dom > heap and xen heap), and let the map_domain_page map the domain page starting from > the VLPT-end address? > > For this, I think just changing DOMHEAP_VIRT_START to some place (maybe 0x88000000) > would be suffice.You might find that stealing from the start needs a bit more thought in the PT setup code wrt offsets and alignments and assumption that it starts on a 1GB boundary etc. In which case stealing the space from the end might turn out to be simpler by reducing DOMHEAP_ENTRIES. It would be good to keep DOMHEAP_ENTRIES a power of two I suspect.> > > +static inline void mark_dirty_bitmap(struct domain *d, paddr_t addr) > > > +{ > > > + paddr_t ram_base = (paddr_t) GUEST_RAM_BASE; > > > + int bit_index = PFN_DOWN(addr - ram_base); > > > + int page_index = bit_index >> (PAGE_SHIFT + 3); > > > > +3? > > +3 is for changing the bit-index to byte-index. I think it would be better to > use BYTE_SHIFT kind of thing.Or a comment, yes. [...]> > > + vlp2m_pte = get_vlpt_3lvl_pte(addr); > > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 && > > > + vlp2m_pte->p2m.avail == 0 /* reuse avail bit as read-only */ ) > > > + { > > > + lpae_t pte = *vlp2m_pte; > > > + pte.p2m.write = 1; > > > + write_pte(vlp2m_pte, pte); > > > > Do we not need the p2m lock around here somewhere? > > Oh, I think this brings an another question. In the dirty-page tracing phase, > would it be OK to permit changes in p2m page table (such as adding new pages)? > In that case, VLPT and dirty-bitmap should be re-prepared for taking account this change.The p2m can certainly change during migration -- e.g. grant table ops resulting from IO. Those changes should all call back into the p2m. Hopefully this is in the common grant table code etc but you should check, especially that the ARM implementation of the callback isn''t a NULL! Or maybe this is done in arch code -- best to check when and where x86 does this I think.> > > > > > + flush_tlb_local(); > > > > If SMP was working this would need to be inner shareable to cope with > > vcpus running on other pcpus. > > You meant the p2m page table should be inner shareable?I mean the TLB flush would need to be to all cpus in the inner shareable domain, not just the local processor.> > > > > + } > > > + else > > > + { > > > + spin_lock(&d->arch.dirty.lock); > > > + for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i ) > > > + { > > > + int j = 0; > > > + uint8_t *bitmap; > > > + copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE, > > > + d->arch.dirty.bitmap[i], > > > + bitmap_size < PAGE_SIZE ? bitmap_size : > > > + > > > + PAGE_SIZE); > > > > The bitmap is always a mutliple of page_size, isn''t it? > > But the actually used bits inside of the bitmap are not multiple of page_size I think.Oh right. I think it is up to us to declare in the interface whether the supplied buffer should be rounded up to a page boundary or not. There''s a pretty good chance that the userspace allocations will all be entire pages anyhow. Ian.
Ian Campbell
2013-Nov-19 11:57 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
On Tue, 2013-11-19 at 10:32 +0900, Jaeyong Yoo wrote:> > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > > > c0b5dd8..0a32301 100644 > > > > --- a/xen/arch/arm/domain.c > > > > +++ b/xen/arch/arm/domain.c > > > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > > > WRITE_SYSREG(hcr, HCR_EL2); > > > > isb(); > > > > > > > > + /* for dirty-page tracing > > > > + * XXX: how do we consider SMP case? > > > > + */ > > > > + if ( n->domain->arch.dirty.mode ) > > > > + restore_vlpt(n->domain); > > > > > > This is an interesting question. xen_second is shared between all > > > pcpus, which means that the vlpt is currently only usable from a > > > single physical CPU at a time. > > > > > > Currently the only per-cpu area is the domheap region from 2G-4G. We > > > could steal some address space from the top or bottom of there? > > > > oh right hmm. Then, how about place the vlpt area starting from 2G > > (between dom heap and xen heap), and let the map_domain_page map the > > domain page starting from the VLPT-end address? > > > > For this, I think just changing DOMHEAP_VIRT_START to some place (maybe > > 0x88000000) would be suffice. > > > > I just found out that DOMHEAP_VIRT_START should be aligned to first-level > size. > So, 0x88000000 wouldn''t work. > So, I''m thinking to use the back of the domheap area and piggybacking the > VLPT > into the dom-heap region. For this, we should use something like the > following layout: > > #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) > #define VIRT_LIN_P2M_START _AT(vaddr_t,0xf8000000) > #define VIRT_LIN_P2M_END _AT(vaddr_t<0xffffffff) > #define DOMHEAP_VIRT_END _AT(vaddr_t,0xffffffff) > > where VLPT area is overlapped to domheap. This is necessary since dommap > size is decided > at booting time by the calculation of DOMHEAP_VIRT_END - DOMHEAP_VIRT_START, > and at this time > we have to also allocate the dommap for VLPT.I''d prefer to make that explicit at setup time than artificially overlaying the two. I was half wondering about providing a function to allocate a chunk of domheap address range which could then be used by this code to get a bit of memory dynamically. I''m not sure about this. One concern is that it might provoke pessimal behaviour from the domheap (which is a hash table, so it would be stealing slots).> Something else we can do is to > give VLPT for 1G memory > and giving it its own per-cpu page table. But, using 1G for VLPT looks like > some waste of virtual address.Yes, I don''t think we want to do that, mainly since it would necessarily halve the size of the domheap range.> So, I''m thinking of using overlapped VLPT. Could you tell me your opinion > about this? > > Jaeyong > >
Ian Campbell
2013-Nov-19 12:01 UTC
Re: [PATCH RESEND v5 6/6] xen/arm: Implement toolstack for xl restore/save and migrate
On Tue, 2013-11-19 at 15:06 +0400, Eugene Fedotov wrote:> Our migration protocol consists of: header (Console and XenStore PFN), > memory data, HVM context. Can we guarantee that HVM context of version X > is compatible (or the same) as HVM context of version X ? I think if we^ +1 ?> add some HVM hardware (such as virtual RTC) in version X+1 we may have a > compatibility problem, because HVM buffer structure had been changed. I > think we cannot predict such things in future versions.Indeed, it may well be that Xen version X+1 needs some code adding to be able to consume Xen version X save records. I thin x86 does this with either explicit or implicit versioning of the individual save structures.> But our > migration protocol may have version number and migration between X and > X+1 Xen versions is possible when we use the same protocol number in > both sides (we may compare protocol versions during the migration runtime).On x86 the protocol almost always changes in some (possibly minor) way between versions, so this approach would mean that we didn''t support X->X+1 migrations in the majority of releases which would be unacceptable.> >>> [...] > >>>> + > >>>> +static int restore_guest_params(xc_interface *xch, int io_fd, > >>>> + uint32_t dom, guest_params_t *params) > >>>> +{ > >>>> [...] > >>>> + if ( xc_domain_setmaxmem(xch, dom, maxmemkb) ) > >>>> + { > >>>> + ERROR("Can''t set memory map"); > >>>> + return -1; > >>>> + } > >>>> + > >>>> + /* Set max. number of vcpus as max_vcpu_id + 1 */ > >>>> + if ( xc_domain_max_vcpus(xch, dom, params->max_vcpu_id + 1) ) > >>> Does the higher level toolstack not take care of vcpus and maxmem? I > >>> thought so. I think this is how it shoud be. > >> For my tests guest config information is not transferred for ARM case > >> from high-level stack. At the migration receiver side toolstack always > >> create a new domain with vcpus=1 and default max. mem. So we have to > >> send guest information as our local guest_params structure (at the > >> beginning of migration). > >> It is easy way to work "xl save" or "xl migrate" without modification of > >> libxl level, but you may have another idea? > >> Also, toolstack_restore callback is not set (NULL) for ARM case. > > So you aren''t using xl to do the migration? This is what we should > > ultimately be aiming for. It is almost certainly going to require fixes > > at the libxl level though. > > > > If you need to send additional information for your usecase then it > > should be done at the layer above libxc. > I found the reason why guest config information is not saved in libxl. > the following line in xl_cmdimpl.c: > if (!dom_info->quiet) > printf("Parsing config from %s\n", config_source); > in function "create_domain" > calls printf instead of fprintf(stderr, ...). > It is not the same on ARM. I have feeling that printf() causes output to > socket buffer and it breaks config data.If the socket buffer is attached to stdout then that is a bug which needs fixing. I''d be surprised if this turned out to be the case though -- nothing ought to be closing stdout!> Should we use fprintf(stderr, "Parsing config from %s\n", > config_source); here? > If we do this, config is transferred and we do not need set VCPU number > and memory manually. > > Best regards, > Evgeny. >
Eugene Fedotov
2013-Nov-19 12:35 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
15.11.2013 11:04, Eugene Fedotov wrote:> 13.11.2013 14:58, Ian Campbell wrote: >> On Wed, 2013-11-13 at 12:28 +0400, Eugene Fedotov wrote: >>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c >>>>> index 123280e..3801f07 100644 >>>>> --- a/xen/arch/arm/mm.c >>>>> +++ b/xen/arch/arm/mm.c >>>>> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, >>>>> unsigned long mem_type) >>>>> unsigned long domain_get_maximum_gpfn(struct domain *d) >>>> s/unsigned long/xen_pfn_t/ I think. >>>> >>>> Urk, which will break the ABI for this hypercall. That''s something >>>> of a >>>> conundrum :-/ >>>> >>>> It is a domctl so we are at liberty to change it, probably to taking a >>>> xen_pfn_t * to fill in instead of returning the value. >>>> >>>> I''m in two minds about whether we should do so now or postpone it >>>> (which >>>> risks forgetting and having an obscure bug somewhere down the line). >>> OK, this way, or other solution is to purge this hypercall yet in >>> migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the >>> maximum PFN. >> Is there an interface to retrieve that? I don''t see it :-(No, it is another hypercall XEN_DOMCTL_getdomaininfo. max_memkb field in xc_dominfo_t structure is a second way to retrieve maximum memory size and calculate PFN. Do we need domain_get_maximum_gpfn ? In addition to this question I need to ask why we need xen_pfn_t (that is always 64 bit on Xen-arm) for physical memory PFN. Currently the following libxc proxies for DOMCTL: xc_domain_setmaxmem and max_memkb field in xc_domain_getinfo() are working with unsigned long max_mem_kb parameter (that is 32 bit for ARM32) In case if 64 bit PFN is needed, xen_pfn_t returning hypercall looks like better then xc_domain_getinfo() . Best regards, Evgeny.
Ian Campbell
2013-Nov-19 12:53 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
On Tue, 2013-11-19 at 16:35 +0400, Eugene Fedotov wrote:> 15.11.2013 11:04, Eugene Fedotov wrote: > > 13.11.2013 14:58, Ian Campbell wrote: > >> On Wed, 2013-11-13 at 12:28 +0400, Eugene Fedotov wrote: > >>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c > >>>>> index 123280e..3801f07 100644 > >>>>> --- a/xen/arch/arm/mm.c > >>>>> +++ b/xen/arch/arm/mm.c > >>>>> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, > >>>>> unsigned long mem_type) > >>>>> unsigned long domain_get_maximum_gpfn(struct domain *d) > >>>> s/unsigned long/xen_pfn_t/ I think. > >>>> > >>>> Urk, which will break the ABI for this hypercall. That''s something > >>>> of a > >>>> conundrum :-/ > >>>> > >>>> It is a domctl so we are at liberty to change it, probably to taking a > >>>> xen_pfn_t * to fill in instead of returning the value. > >>>> > >>>> I''m in two minds about whether we should do so now or postpone it > >>>> (which > >>>> risks forgetting and having an obscure bug somewhere down the line). > >>> OK, this way, or other solution is to purge this hypercall yet in > >>> migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the > >>> maximum PFN. > >> Is there an interface to retrieve that? I don''t see it :-( > No, it is another hypercall XEN_DOMCTL_getdomaininfo. max_memkb field in > xc_dominfo_t structure is a second way to retrieve maximum memory size > and calculate PFN. Do we need domain_get_maximum_gpfn ?I think not yet.> In addition to this question I need to ask why we need xen_pfn_t (that > is always 64 bit on Xen-arm) for physical memory PFN. Currently the > following libxc proxies for DOMCTL: > xc_domain_setmaxmem and max_memkb field in xc_domain_getinfo() are > working with unsigned long max_mem_kb parameter (that is 32 bit for ARM32)The intention is that the hypercall interface be identical across 32- and 64-bit hypervisors to avoid the need for any argument translation when running 32-bit guests on a 64-bit hypervisor etc. Hence MFNs and PFNs are always 64-bits.> > In case if 64 bit PFN is needed, xen_pfn_t returning hypercall looks > like better then xc_domain_getinfo() . > > Best regards, > Evgeny. >
Eugene Fedotov
2013-Nov-19 13:09 UTC
Re: [PATCH RESEND v5 2/6] xen/arm: Implement get_maximum_gpfn hypercall for arm
19.11.2013 16:53, Ian Campbell wrote:> On Tue, 2013-11-19 at 16:35 +0400, Eugene Fedotov wrote: >> 15.11.2013 11:04, Eugene Fedotov wrote: >>> 13.11.2013 14:58, Ian Campbell wrote: >>>> On Wed, 2013-11-13 at 12:28 +0400, Eugene Fedotov wrote: >>>>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c >>>>>>> index 123280e..3801f07 100644 >>>>>>> --- a/xen/arch/arm/mm.c >>>>>>> +++ b/xen/arch/arm/mm.c >>>>>>> @@ -927,7 +927,11 @@ int page_is_ram_type(unsigned long mfn, >>>>>>> unsigned long mem_type) >>>>>>> unsigned long domain_get_maximum_gpfn(struct domain *d) >>>>>> s/unsigned long/xen_pfn_t/ I think. >>>>>> >>>>>> Urk, which will break the ABI for this hypercall. That''s something >>>>>> of a >>>>>> conundrum :-/ >>>>>> >>>>>> It is a domctl so we are at liberty to change it, probably to taking a >>>>>> xen_pfn_t * to fill in instead of returning the value. >>>>>> >>>>>> I''m in two minds about whether we should do so now or postpone it >>>>>> (which >>>>>> risks forgetting and having an obscure bug somewhere down the line). >>>>> OK, this way, or other solution is to purge this hypercall yet in >>>>> migration and use max_memkb field in DOMCTL_getdomaininfo to obtain the >>>>> maximum PFN. >>>> Is there an interface to retrieve that? I don''t see it :-( >> No, it is another hypercall XEN_DOMCTL_getdomaininfo. max_memkb field in >> xc_dominfo_t structure is a second way to retrieve maximum memory size >> and calculate PFN. Do we need domain_get_maximum_gpfn ? > I think not yet.OK, I''ll remove domain_get_maximum_gpfn from patches. Best regards, Evgeny.
Jaeyong Yoo
2013-Nov-20 09:49 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
> -----Original Message----- > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Tuesday, November 19, 2013 8:57 PM > To: Jaeyong Yoo > Cc: xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH RESEND v5 5/6] xen/arm: Implement > hypercall for dirty page tracing > > On Tue, 2013-11-19 at 10:32 +0900, Jaeyong Yoo wrote: > > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > > > > c0b5dd8..0a32301 100644 > > > > > --- a/xen/arch/arm/domain.c > > > > > +++ b/xen/arch/arm/domain.c > > > > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > > > > WRITE_SYSREG(hcr, HCR_EL2); > > > > > isb(); > > > > > > > > > > + /* for dirty-page tracing > > > > > + * XXX: how do we consider SMP case? > > > > > + */ > > > > > + if ( n->domain->arch.dirty.mode ) > > > > > + restore_vlpt(n->domain); > > > > > > > > This is an interesting question. xen_second is shared between all > > > > pcpus, which means that the vlpt is currently only usable from a > > > > single physical CPU at a time. > > > > > > > > Currently the only per-cpu area is the domheap region from 2G-4G. > > > > We could steal some address space from the top or bottom of there? > > > > > > oh right hmm. Then, how about place the vlpt area starting from 2G > > > (between dom heap and xen heap), and let the map_domain_page map the > > > domain page starting from the VLPT-end address? > > > > > > For this, I think just changing DOMHEAP_VIRT_START to some place > > > (maybe > > > 0x88000000) would be suffice. > > > > > > > I just found out that DOMHEAP_VIRT_START should be aligned to > > first-level size. > > So, 0x88000000 wouldn''t work. > > So, I''m thinking to use the back of the domheap area and piggybacking > > the VLPT into the dom-heap region. For this, we should use something > > like the following layout: > > > > #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) > > #define VIRT_LIN_P2M_START _AT(vaddr_t,0xf8000000) > > #define VIRT_LIN_P2M_END _AT(vaddr_t<0xffffffff) > > #define DOMHEAP_VIRT_END _AT(vaddr_t,0xffffffff) > > > > where VLPT area is overlapped to domheap. This is necessary since > > dommap size is decided at booting time by the calculation of > > DOMHEAP_VIRT_END - DOMHEAP_VIRT_START, and at this time we have to > > also allocate the dommap for VLPT. > > I''d prefer to make that explicit at setup time than artificially > overlaying the two. > > I was half wondering about providing a function to allocate a chunk of > domheap address range which could then be used by this code to get a bit > of memory dynamically. I''m not sure about this. One concern is that it > might provoke pessimal behaviour from the domheap (which is a hash table, > so it would be stealing slots).I think so. If we are going to provide such a function like that, we have to think about how to handle a pool of memory region ready for allocation, and how does it work with ''map_domain_page''. One quick thought is to make "DOMHEAP_ENTRIES" as a variable and adjust this value as the large-chunk memory is allocated or freed. But, honestly I don''t want to increase the probability of breaking anything. I think it would be better off setting it statically at booting time. And, by doing so, DOMHEAP_ENTRIES may not be the power of 2. I think it is OK, but you suggested that DOMHEAP_ENTRIES being power of 2 is good. Could you explain a bit more about this? Jaeyong.
Ian Campbell
2013-Nov-20 10:03 UTC
Re: [PATCH RESEND v5 5/6] xen/arm: Implement hypercall for dirty page tracing
On Wed, 2013-11-20 at 18:49 +0900, Jaeyong Yoo wrote:> > -----Original Message----- > > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > > Sent: Tuesday, November 19, 2013 8:57 PM > > To: Jaeyong Yoo > > Cc: xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] [PATCH RESEND v5 5/6] xen/arm: Implement > > hypercall for dirty page tracing > > > > On Tue, 2013-11-19 at 10:32 +0900, Jaeyong Yoo wrote: > > > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index > > > > > > c0b5dd8..0a32301 100644 > > > > > > --- a/xen/arch/arm/domain.c > > > > > > +++ b/xen/arch/arm/domain.c > > > > > > @@ -215,6 +215,12 @@ static void ctxt_switch_to(struct vcpu *n) > > > > > > WRITE_SYSREG(hcr, HCR_EL2); > > > > > > isb(); > > > > > > > > > > > > + /* for dirty-page tracing > > > > > > + * XXX: how do we consider SMP case? > > > > > > + */ > > > > > > + if ( n->domain->arch.dirty.mode ) > > > > > > + restore_vlpt(n->domain); > > > > > > > > > > This is an interesting question. xen_second is shared between all > > > > > pcpus, which means that the vlpt is currently only usable from a > > > > > single physical CPU at a time. > > > > > > > > > > Currently the only per-cpu area is the domheap region from 2G-4G. > > > > > We could steal some address space from the top or bottom of there? > > > > > > > > oh right hmm. Then, how about place the vlpt area starting from 2G > > > > (between dom heap and xen heap), and let the map_domain_page map the > > > > domain page starting from the VLPT-end address? > > > > > > > > For this, I think just changing DOMHEAP_VIRT_START to some place > > > > (maybe > > > > 0x88000000) would be suffice. > > > > > > > > > > I just found out that DOMHEAP_VIRT_START should be aligned to > > > first-level size. > > > So, 0x88000000 wouldn''t work. > > > So, I''m thinking to use the back of the domheap area and piggybacking > > > the VLPT into the dom-heap region. For this, we should use something > > > like the following layout: > > > > > > #define DOMHEAP_VIRT_START _AT(vaddr_t,0x80000000) > > > #define VIRT_LIN_P2M_START _AT(vaddr_t,0xf8000000) > > > #define VIRT_LIN_P2M_END _AT(vaddr_t<0xffffffff) > > > #define DOMHEAP_VIRT_END _AT(vaddr_t,0xffffffff) > > > > > > where VLPT area is overlapped to domheap. This is necessary since > > > dommap size is decided at booting time by the calculation of > > > DOMHEAP_VIRT_END - DOMHEAP_VIRT_START, and at this time we have to > > > also allocate the dommap for VLPT. > > > > I''d prefer to make that explicit at setup time than artificially > > overlaying the two. > > > > I was half wondering about providing a function to allocate a chunk of > > domheap address range which could then be used by this code to get a bit > > of memory dynamically. I''m not sure about this. One concern is that it > > might provoke pessimal behaviour from the domheap (which is a hash table, > > so it would be stealing slots). > > I think so. If we are going to provide such a function like that, we have to > think about how to handle a pool of memory region ready for allocation, and > how does it work with ''map_domain_page''. One quick thought is to make > "DOMHEAP_ENTRIES" as a variable and adjust this value as the large-chunk > memory is allocated or freed. But, honestly I don''t want to increase the > probability of breaking anything. > > I think it would be better off setting it statically at booting time. And, > by doing so, DOMHEAP_ENTRIES may not be the power of 2. I think it is OK, but > you suggested that DOMHEAP_ENTRIES being power of 2 is good. Could you explain > a bit more about this?I was just thinking that the it would make the modular arithmetic simpler and/or that the hashing algorithm might rely on it somehow. But thinking again I think this mustn''t be the case, and anyway keeping with powers of two would require halving from 2GB to 1GB which is undesirable. Ian.