Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
Hi Keir, These V10 patches are in pretty good shape. I''ve addressed all the issues Jan had in previous versions, and jfyi, he and I''ve been back and forth on pretty much every patch in this series. Lot of the patches have ''acked'' or ''reviewed'' tags. Kindly review. Christoph: I''ve made the minor changes you suggested in V9, please review patches 20 and 21. New in V10: minor changes in 20/21 to not call vmx create and destroy functions, as they are noop for pvh. Also, in patch 16 add check to not migrage hvm timers for PVH. To repeat from before, these are xen changes to support boot of a 64bit PVH domU guest. Built on top of unstable git c/s: 704302ce9404c73cfb687d31adcf67094ab5bb53 The public git tree for this: git clone -n git://oss.oracle.com/git/mrathor/xen.git . git checkout pvh.v10 Coming in future after this is done, two patchsets: - 1) tools changes and 2) dom0 changes. Thanks for all the help, Mukesh
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 01/23] PVH xen: Add readme docs/misc/pvh-readme.txt
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- docs/misc/pvh-readme.txt | 56 ++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 56 insertions(+), 0 deletions(-) create mode 100644 docs/misc/pvh-readme.txt diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt new file mode 100644 index 0000000..3b14aa7 --- /dev/null +++ b/docs/misc/pvh-readme.txt @@ -0,0 +1,56 @@ + +PVH : an x86 PV guest running in an HVM container. HAP is required for PVH. + +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ + +At present the only PVH guest is an x86 64bit PV linux. Patches are at: + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git + +A PVH guest kernel must support following features, as defined for linux +in arch/x86/xen/xen-head.S: + + #define FEATURES_PVH "|writable_descriptor_tables" \ + "|auto_translated_physmap" \ + "|supervisor_mode_kernel" \ + "|hvm_callback_vector" + +In a nutshell, the guest uses auto translate, ie, p2m is managed by xen, +it uses event callback and not vlapic emulation, the page tables are +native, so mmu_update hcall is N/A for PVH guest. Moreover IDT is native, so +set_trap_table hcall is also N/A for a PVH guest. For a full list of hcalls +supported for PVH, see pvh_hypercall64_table in arch/x86/hvm/hvm.c in xen. +From the ABI prespective, it''s mostly a PV guest with auto translate, altho +it does use hvm_op for setting callback vector. + +The initial phase targets the booting of a 64bit UP/SMP linux guest in PVH +mode. This is done by adding: pvh=1 in the config file. xl, and not xm, is +supported. Phase I patches are broken into three parts: + - xen changes for booting of 64bit PVH guest + - tools changes for creating a PVH guest + - boot of 64bit dom0 in PVH mode. + +Following fixme''s exist in the code: + - Add support for more memory types in arch/x86/hvm/mtrr.c. + - arch/x86/time.c: support more tsc modes. + - check_guest_io_breakpoint(): check/add support for IO breakpoint. + - implement arch_get_info_guest() for pvh. + - vmxit_msr_read(): during AMD port go thru hvm_msr_read_intercept() again. + - verify bp matching on emulated instructions will work same as HVM for + PVH guest. see instruction_done() and check_guest_io_breakpoint(). + +Following remain to be done for PVH: + - AMD port. + - 32bit PVH guest support in both linux and xen. Xen changes are tagged + "32bitfixme". + - Add support for monitoring guest behavior. See hvm_memory_event* functions + in hvm.c + - vcpu hotplug support + - Live migration of PVH guests. + - Avail PVH dom0 of posted interrupts. (This will be a big win). + + +Note, any emails to me must be cc''d to xen devel mailing list. OTOH, please +cc me on PVH emails to the xen devel mailing list. + +Mukesh Rathor +mukesh.rathor [at] oracle [dot] com -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 02/23] PVH xen: turn gdb_frames/gdt_ents into union.
Changes in V2: - Add __XEN_INTERFACE_VERSION__ Changes in V3: - Rename union to ''gdt'' and rename field names. Change in V9: - Update __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400 for compat. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- tools/libxc/xc_domain_restore.c | 8 ++++---- tools/libxc/xc_domain_save.c | 6 +++--- xen/arch/x86/domain.c | 12 ++++++------ xen/arch/x86/domctl.c | 12 ++++++------ xen/include/public/arch-x86/xen.h | 14 ++++++++++++++ xen/include/public/xen-compat.h | 2 +- 6 files changed, 34 insertions(+), 20 deletions(-) diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c index 63d36cd..47aaca0 100644 --- a/tools/libxc/xc_domain_restore.c +++ b/tools/libxc/xc_domain_restore.c @@ -2055,15 +2055,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, munmap(start_info, PAGE_SIZE); } /* Uncanonicalise each GDT frame number. */ - if ( GET_FIELD(ctxt, gdt_ents) > 8192 ) + if ( GET_FIELD(ctxt, gdt.pv.num_ents) > 8192 ) { ERROR("GDT entry count out of range"); goto out; } - for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt_ents); j++ ) + for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt.pv.num_ents); j++ ) { - pfn = GET_FIELD(ctxt, gdt_frames[j]); + pfn = GET_FIELD(ctxt, gdt.pv.frames[j]); if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) { @@ -2071,7 +2071,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, j, (unsigned long)pfn); goto out; } - SET_FIELD(ctxt, gdt_frames[j], ctx->p2m[pfn]); + SET_FIELD(ctxt, gdt.pv.frames[j], ctx->p2m[pfn]); } /* Uncanonicalise the page table base pointer. */ pfn = UNFOLD_CR3(GET_FIELD(ctxt, ctrlreg[3])); diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index fbc15e9..e938628 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1907,15 +1907,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter } /* Canonicalise each GDT frame number. */ - for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ ) + for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt.pv.num_ents); j++ ) { - mfn = GET_FIELD(&ctxt, gdt_frames[j]); + mfn = GET_FIELD(&ctxt, gdt.pv.frames[j]); if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn) ) { ERROR("GDT frame is not in range of pseudophys map"); goto out; } - SET_FIELD(&ctxt, gdt_frames[j], mfn_to_pfn(mfn)); + SET_FIELD(&ctxt, gdt.pv.frames[j], mfn_to_pfn(mfn)); } /* Canonicalise the page table base pointer. */ diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 874742c..73ddad7 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -784,8 +784,8 @@ int arch_set_info_guest( } for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i ) - fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt_frames[i]); - fail |= v->arch.pv_vcpu.gdt_ents != c(gdt_ents); + fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt.pv.frames[i]); + fail |= v->arch.pv_vcpu.gdt_ents != c(gdt.pv.num_ents); fail |= v->arch.pv_vcpu.ldt_base != c(ldt_base); fail |= v->arch.pv_vcpu.ldt_ents != c(ldt_ents); @@ -838,17 +838,17 @@ int arch_set_info_guest( return rc; if ( !compat ) - rc = (int)set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents); + rc = (int)set_gdt(v, c.nat->gdt.pv.frames, c.nat->gdt.pv.num_ents); else { unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames)]; - unsigned int n = (c.cmp->gdt_ents + 511) / 512; + unsigned int n = (c.cmp->gdt.pv.num_ents + 511) / 512; if ( n > ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames) ) return -EINVAL; for ( i = 0; i < n; ++i ) - gdt_frames[i] = c.cmp->gdt_frames[i]; - rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt_ents); + gdt_frames[i] = c.cmp->gdt.pv.frames[i]; + rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt.pv.num_ents); } if ( rc != 0 ) return rc; diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index c2a04c4..f87d6ab 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -1300,12 +1300,12 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) c(ldt_base = v->arch.pv_vcpu.ldt_base); c(ldt_ents = v->arch.pv_vcpu.ldt_ents); for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i ) - c(gdt_frames[i] = v->arch.pv_vcpu.gdt_frames[i]); - BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt_frames) !- ARRAY_SIZE(c.cmp->gdt_frames)); - for ( ; i < ARRAY_SIZE(c.nat->gdt_frames); ++i ) - c(gdt_frames[i] = 0); - c(gdt_ents = v->arch.pv_vcpu.gdt_ents); + c(gdt.pv.frames[i] = v->arch.pv_vcpu.gdt_frames[i]); + BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt.pv.frames) !+ ARRAY_SIZE(c.cmp->gdt.pv.frames)); + for ( ; i < ARRAY_SIZE(c.nat->gdt.pv.frames); ++i ) + c(gdt.pv.frames[i] = 0); + c(gdt.pv.num_ents = v->arch.pv_vcpu.gdt_ents); c(kernel_ss = v->arch.pv_vcpu.kernel_ss); c(kernel_sp = v->arch.pv_vcpu.kernel_sp); for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.ctrlreg); ++i ) diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h index b7f6a51..25c8519 100644 --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -170,7 +170,21 @@ struct vcpu_guest_context { struct cpu_user_regs user_regs; /* User-level CPU registers */ struct trap_info trap_ctxt[256]; /* Virtual IDT */ unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */ +#if __XEN_INTERFACE_VERSION__ < 0x00040400 unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */ +#else + union { + struct { + /* GDT (machine frames, # ents) */ + unsigned long frames[16], num_ents; + } pv; + struct { + /* PVH: GDTR addr and size */ + uint64_t addr; + uint16_t limit; + } pvh; + } gdt; +#endif unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */ /* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */ unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */ diff --git a/xen/include/public/xen-compat.h b/xen/include/public/xen-compat.h index 69141c4..3eb80a0 100644 --- a/xen/include/public/xen-compat.h +++ b/xen/include/public/xen-compat.h @@ -27,7 +27,7 @@ #ifndef __XEN_PUBLIC_XEN_COMPAT_H__ #define __XEN_PUBLIC_XEN_COMPAT_H__ -#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040300 +#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040400 #if defined(__XEN__) || defined(__XEN_TOOLS__) /* Xen is built with matching headers and implements the latest interface. */ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 03/23] PVH xen: add params to read_segment_register
In this preparatory patch, read_segment_register macro is changed to take vcpu and regs parameters. No functionality change. Changes in V2: None Changes in V3: - Replace read_sreg with read_segment_register Changes in V7: - Don''t make emulate_privileged_op() public here. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain.c | 8 ++++---- xen/arch/x86/traps.c | 26 ++++++++++++-------------- xen/arch/x86/x86_64/traps.c | 16 ++++++++-------- xen/include/asm-x86/system.h | 2 +- 4 files changed, 25 insertions(+), 27 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 73ddad7..5de5e49 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1221,10 +1221,10 @@ static void save_segments(struct vcpu *v) struct cpu_user_regs *regs = &v->arch.user_regs; unsigned int dirty_segment_mask = 0; - regs->ds = read_segment_register(ds); - regs->es = read_segment_register(es); - regs->fs = read_segment_register(fs); - regs->gs = read_segment_register(gs); + regs->ds = read_segment_register(v, regs, ds); + regs->es = read_segment_register(v, regs, es); + regs->fs = read_segment_register(v, regs, fs); + regs->gs = read_segment_register(v, regs, gs); if ( regs->ds ) dirty_segment_mask |= DIRTY_DS; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 57dbd0c..378ef0a 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -1831,8 +1831,6 @@ static inline uint64_t guest_misc_enable(uint64_t val) } \ (eip) += sizeof(_x); _x; }) -#define read_sreg(regs, sr) read_segment_register(sr) - static int is_cpufreq_controller(struct domain *d) { return ((cpufreq_controller == FREQCTL_dom0_kernel) && @@ -1877,7 +1875,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) goto fail; /* emulating only opcodes not allowing SS to be default */ - data_sel = read_sreg(regs, ds); + data_sel = read_segment_register(v, regs, ds); /* Legacy prefixes. */ for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) ) @@ -1895,17 +1893,17 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) data_sel = regs->cs; continue; case 0x3e: /* DS override */ - data_sel = read_sreg(regs, ds); + data_sel = read_segment_register(v, regs, ds); continue; case 0x26: /* ES override */ - data_sel = read_sreg(regs, es); + data_sel = read_segment_register(v, regs, es); continue; case 0x64: /* FS override */ - data_sel = read_sreg(regs, fs); + data_sel = read_segment_register(v, regs, fs); lm_ovr = lm_seg_fs; continue; case 0x65: /* GS override */ - data_sel = read_sreg(regs, gs); + data_sel = read_segment_register(v, regs, gs); lm_ovr = lm_seg_gs; continue; case 0x36: /* SS override */ @@ -1952,7 +1950,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) if ( !(opcode & 2) ) { - data_sel = read_sreg(regs, es); + data_sel = read_segment_register(v, regs, es); lm_ovr = lm_seg_none; } @@ -2685,22 +2683,22 @@ static void emulate_gate_op(struct cpu_user_regs *regs) ASSERT(opnd_sel); continue; case 0x3e: /* DS override */ - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x26: /* ES override */ - opnd_sel = read_sreg(regs, es); + opnd_sel = read_segment_register(v, regs, es); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x64: /* FS override */ - opnd_sel = read_sreg(regs, fs); + opnd_sel = read_segment_register(v, regs, fs); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x65: /* GS override */ - opnd_sel = read_sreg(regs, gs); + opnd_sel = read_segment_register(v, regs, gs); if ( !opnd_sel ) opnd_sel = dpl; continue; @@ -2753,7 +2751,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs) switch ( modrm & 7 ) { default: - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); break; case 4: case 5: opnd_sel = regs->ss; @@ -2781,7 +2779,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs) break; } if ( !opnd_sel ) - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); switch ( modrm & 7 ) { case 0: case 2: case 4: diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index bcd7609..9e0571d 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -122,10 +122,10 @@ void show_registers(struct cpu_user_regs *regs) fault_crs[0] = read_cr0(); fault_crs[3] = read_cr3(); fault_crs[4] = read_cr4(); - fault_regs.ds = read_segment_register(ds); - fault_regs.es = read_segment_register(es); - fault_regs.fs = read_segment_register(fs); - fault_regs.gs = read_segment_register(gs); + fault_regs.ds = read_segment_register(v, regs, ds); + fault_regs.es = read_segment_register(v, regs, es); + fault_regs.fs = read_segment_register(v, regs, fs); + fault_regs.gs = read_segment_register(v, regs, gs); } print_xen_info(); @@ -240,10 +240,10 @@ void do_double_fault(struct cpu_user_regs *regs) crs[2] = read_cr2(); crs[3] = read_cr3(); crs[4] = read_cr4(); - regs->ds = read_segment_register(ds); - regs->es = read_segment_register(es); - regs->fs = read_segment_register(fs); - regs->gs = read_segment_register(gs); + regs->ds = read_segment_register(current, regs, ds); + regs->es = read_segment_register(current, regs, es); + regs->fs = read_segment_register(current, regs, fs); + regs->gs = read_segment_register(current, regs, gs); printk("CPU: %d\n", cpu); _show_registers(regs, crs, CTXT_hypervisor, NULL); diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h index 6ab7d56..9bb22cb 100644 --- a/xen/include/asm-x86/system.h +++ b/xen/include/asm-x86/system.h @@ -4,7 +4,7 @@ #include <xen/lib.h> #include <xen/bitops.h> -#define read_segment_register(name) \ +#define read_segment_register(vcpu, regs, name) \ ({ u16 __sel; \ asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ __sel; \ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 04/23] PVH xen: Move e820 fields out of pv_domain struct
This patch moves fields out of the pv_domain struct as they are used by PVH also. Changes in V6: - Don''t base on guest type the initialization and cleanup. Changes in V7: - If statement doesn''t need to be split across lines anymore. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/domain.c | 10 ++++------ xen/arch/x86/mm.c | 26 ++++++++++++-------------- xen/include/asm-x86/domain.h | 10 +++++----- 3 files changed, 21 insertions(+), 25 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 5de5e49..c361abf 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) if ( (rc = iommu_domain_init(d)) != 0 ) goto fail; } + spin_lock_init(&d->arch.e820_lock); if ( is_hvm_domain(d) ) { @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) } } else - { /* 64-bit PV guest by default. */ d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; - spin_lock_init(&d->arch.pv_domain.e820_lock); - } - /* initialize default tsc behavior in case tools don''t */ tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); spin_lock_init(&d->arch.vtsc_lock); @@ -592,8 +589,9 @@ void arch_domain_destroy(struct domain *d) { if ( is_hvm_domain(d) ) hvm_domain_destroy(d); - else - xfree(d->arch.pv_domain.e820); + + if ( d->arch.e820 ) + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index c00841c..412971e 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4763,11 +4763,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) return -EFAULT; } - spin_lock(&d->arch.pv_domain.e820_lock); - xfree(d->arch.pv_domain.e820); - d->arch.pv_domain.e820 = e820; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); + xfree(d->arch.e820); + d->arch.e820 = e820; + d->arch.nr_e820 = fmap.map.nr_entries; + spin_unlock(&d->arch.e820_lock); rcu_unlock_domain(d); return rc; @@ -4781,26 +4781,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; - spin_lock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); /* Backwards compatibility. */ - if ( (d->arch.pv_domain.nr_e820 == 0) || - (d->arch.pv_domain.e820 == NULL) ) + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -ENOSYS; } - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, - map.nr_entries) || + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || __copy_to_guest(arg, &map, 1) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -EFAULT; } - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return 0; } diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index d79464d..c3f9f8e 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -234,11 +234,6 @@ struct pv_domain /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; - - /* Pseudophysical e820 map (XENMEM_memory_map). */ - spinlock_t e820_lock; - struct e820entry *e820; - unsigned int nr_e820; }; struct arch_domain @@ -313,6 +308,11 @@ struct arch_domain (possibly other cases in the future */ uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ uint64_t vtsc_usercount; /* not used for hvm */ + + /* Pseudophysical e820 map (XENMEM_memory_map). */ + spinlock_t e820_lock; + struct e820entry *e820; + unsigned int nr_e820; } __cacheline_aligned; #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 05/23] PVH xen: hvm related preparatory changes for PVH
This patch contains small changes to hvm.c because hvm_domain.params is not set/used/supported for PVH in the present series. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/hvm/hvm.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 1fcaed0..8284b3b 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1070,10 +1070,13 @@ int hvm_vcpu_initialise(struct vcpu *v) { int rc; struct domain *d = v->domain; - domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + domid_t dm_domid; hvm_asid_flush_vcpu(v); + spin_lock_init(&v->arch.hvm_vcpu.tm_lock); + INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); + if ( (rc = vlapic_init(v)) != 0 ) goto fail1; @@ -1084,6 +1087,8 @@ int hvm_vcpu_initialise(struct vcpu *v) && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) goto fail3; + dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + /* Create ioreq event channel. */ rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); if ( rc < 0 ) @@ -1106,9 +1111,6 @@ int hvm_vcpu_initialise(struct vcpu *v) get_ioreq(v)->vp_eport = v->arch.hvm_vcpu.xen_port; spin_unlock(&d->arch.hvm_domain.ioreq.lock); - spin_lock_init(&v->arch.hvm_vcpu.tm_lock); - INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); - v->arch.hvm_vcpu.inject_trap.vector = -1; rc = setup_compat_arg_xlat(v); -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 06/23] PVH xen: vmx related preparatory changes for PVH
This is another preparatory patch for PVH. In this patch, following functions are made available for general/public use: vmx_fpu_enter(), get_instruction_length(), update_guest_eip(), and vmx_dr_access(). There is no functionality change. Changes in V2: - prepend vmx_ to get_instruction_length and update_guest_eip. - Do not export/use vmr(). Changes in V3: - Do not change emulate_forced_invalid_op() in this patch. Changes in V7: - Drop pv_cpuid going public here. Changes in V8: - Move vmx_fpu_enter prototype from vmcs.h to vmx.h Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/hvm/vmx/vmx.c | 72 +++++++++++++++--------------------- xen/arch/x86/hvm/vmx/vvmx.c | 2 +- xen/include/asm-x86/hvm/vmx/vmx.h | 17 ++++++++- 3 files changed, 47 insertions(+), 44 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 8ed7026..7292357 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -577,7 +577,7 @@ static int vmx_load_vmcs_ctxt(struct vcpu *v, struct hvm_hw_cpu *ctxt) return 0; } -static void vmx_fpu_enter(struct vcpu *v) +void vmx_fpu_enter(struct vcpu *v) { vcpu_restore_fpu_lazy(v); v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device); @@ -1608,24 +1608,12 @@ const struct hvm_function_table * __init start_vmx(void) return &vmx_function_table; } -/* - * Not all cases receive valid value in the VM-exit instruction length field. - * Callers must know what they''re doing! - */ -static int get_instruction_length(void) -{ - int len; - len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */ - BUG_ON((len < 1) || (len > 15)); - return len; -} - -void update_guest_eip(void) +void vmx_update_guest_eip(void) { struct cpu_user_regs *regs = guest_cpu_user_regs(); unsigned long x; - regs->eip += get_instruction_length(); /* Safe: callers audited */ + regs->eip += vmx_get_instruction_length(); /* Safe: callers audited */ regs->eflags &= ~X86_EFLAGS_RF; x = __vmread(GUEST_INTERRUPTIBILITY_INFO); @@ -1698,8 +1686,8 @@ static void vmx_do_cpuid(struct cpu_user_regs *regs) regs->edx = edx; } -static void vmx_dr_access(unsigned long exit_qualification, - struct cpu_user_regs *regs) +void vmx_dr_access(unsigned long exit_qualification, + struct cpu_user_regs *regs) { struct vcpu *v = current; @@ -2312,7 +2300,7 @@ static int vmx_handle_eoi_write(void) if ( (((exit_qualification >> 12) & 0xf) == 1) && ((exit_qualification & 0xfff) == APIC_EOI) ) { - update_guest_eip(); /* Safe: APIC data write */ + vmx_update_guest_eip(); /* Safe: APIC data write */ vlapic_EOI_set(vcpu_vlapic(current)); HVMTRACE_0D(VLAPIC); return 1; @@ -2525,7 +2513,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) HVMTRACE_1D(TRAP, vector); if ( v->domain->debugger_attached ) { - update_guest_eip(); /* Safe: INT3 */ + vmx_update_guest_eip(); /* Safe: INT3 */ current->arch.gdbsx_vcpu_event = TRAP_int3; domain_pause_for_debugger(); break; @@ -2633,7 +2621,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) */ inst_len = ((source != 3) || /* CALL, IRET, or JMP? */ (idtv_info & (1u<<10))) /* IntrType > 3? */ - ? get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0; + ? vmx_get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0; if ( (source == 3) && (idtv_info & INTR_INFO_DELIVER_CODE_MASK) ) ecode = __vmread(IDT_VECTORING_ERROR_CODE); regs->eip += inst_len; @@ -2641,15 +2629,15 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) break; } case EXIT_REASON_CPUID: - update_guest_eip(); /* Safe: CPUID */ + vmx_update_guest_eip(); /* Safe: CPUID */ vmx_do_cpuid(regs); break; case EXIT_REASON_HLT: - update_guest_eip(); /* Safe: HLT */ + vmx_update_guest_eip(); /* Safe: HLT */ hvm_hlt(regs->eflags); break; case EXIT_REASON_INVLPG: - update_guest_eip(); /* Safe: INVLPG */ + vmx_update_guest_eip(); /* Safe: INVLPG */ exit_qualification = __vmread(EXIT_QUALIFICATION); vmx_invlpg_intercept(exit_qualification); break; @@ -2657,7 +2645,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) regs->ecx = hvm_msr_tsc_aux(v); /* fall through */ case EXIT_REASON_RDTSC: - update_guest_eip(); /* Safe: RDTSC, RDTSCP */ + vmx_update_guest_eip(); /* Safe: RDTSC, RDTSCP */ hvm_rdtsc_intercept(regs); break; case EXIT_REASON_VMCALL: @@ -2667,7 +2655,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) rc = hvm_do_hypercall(regs); if ( rc != HVM_HCALL_preempted ) { - update_guest_eip(); /* Safe: VMCALL */ + vmx_update_guest_eip(); /* Safe: VMCALL */ if ( rc == HVM_HCALL_invalidate ) send_invalidate_req(); } @@ -2677,7 +2665,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) { exit_qualification = __vmread(EXIT_QUALIFICATION); if ( vmx_cr_access(exit_qualification) == X86EMUL_OKAY ) - update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */ + vmx_update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */ break; } case EXIT_REASON_DR_ACCESS: @@ -2691,7 +2679,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) { regs->eax = (uint32_t)msr_content; regs->edx = (uint32_t)(msr_content >> 32); - update_guest_eip(); /* Safe: RDMSR */ + vmx_update_guest_eip(); /* Safe: RDMSR */ } break; } @@ -2700,63 +2688,63 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) uint64_t msr_content; msr_content = ((uint64_t)regs->edx << 32) | (uint32_t)regs->eax; if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY ) - update_guest_eip(); /* Safe: WRMSR */ + vmx_update_guest_eip(); /* Safe: WRMSR */ break; } case EXIT_REASON_VMXOFF: if ( nvmx_handle_vmxoff(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMXON: if ( nvmx_handle_vmxon(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMCLEAR: if ( nvmx_handle_vmclear(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMPTRLD: if ( nvmx_handle_vmptrld(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMPTRST: if ( nvmx_handle_vmptrst(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMREAD: if ( nvmx_handle_vmread(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMWRITE: if ( nvmx_handle_vmwrite(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMLAUNCH: if ( nvmx_handle_vmlaunch(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMRESUME: if ( nvmx_handle_vmresume(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_INVEPT: if ( nvmx_handle_invept(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_INVVPID: if ( nvmx_handle_invvpid(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_MWAIT_INSTRUCTION: @@ -2804,14 +2792,14 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) int bytes = (exit_qualification & 0x07) + 1; int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE; if ( handle_pio(port, bytes, dir) ) - update_guest_eip(); /* Safe: IN, OUT */ + vmx_update_guest_eip(); /* Safe: IN, OUT */ } break; case EXIT_REASON_INVD: case EXIT_REASON_WBINVD: { - update_guest_eip(); /* Safe: INVD, WBINVD */ + vmx_update_guest_eip(); /* Safe: INVD, WBINVD */ vmx_wbinvd_intercept(); break; } @@ -2843,7 +2831,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) case EXIT_REASON_XSETBV: if ( hvm_handle_xsetbv(regs->ecx, (regs->rdx << 32) | regs->_eax) == 0 ) - update_guest_eip(); /* Safe: XSETBV */ + vmx_update_guest_eip(); /* Safe: XSETBV */ break; case EXIT_REASON_APIC_WRITE: diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c index 5dfbc54..82be4cc 100644 --- a/xen/arch/x86/hvm/vmx/vvmx.c +++ b/xen/arch/x86/hvm/vmx/vvmx.c @@ -2139,7 +2139,7 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs, tsc += __get_vvmcs(nvcpu->nv_vvmcx, TSC_OFFSET); regs->eax = (uint32_t)tsc; regs->edx = (uint32_t)(tsc >> 32); - update_guest_eip(); + vmx_update_guest_eip(); return 1; } diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index c33b9f9..c21a303 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -446,6 +446,18 @@ static inline int __vmxon(u64 addr) return rc; } +/* + * Not all cases receive valid value in the VM-exit instruction length field. + * Callers must know what they''re doing! + */ +static inline int vmx_get_instruction_length(void) +{ + int len; + len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */ + BUG_ON((len < 1) || (len > 15)); + return len; +} + void vmx_get_segment_register(struct vcpu *, enum x86_segment, struct segment_register *); void vmx_inject_extint(int trap); @@ -457,7 +469,10 @@ void ept_p2m_uninit(struct p2m_domain *p2m); void ept_walk_table(struct domain *d, unsigned long gfn); void setup_ept_dump(void); -void update_guest_eip(void); +void vmx_update_guest_eip(void); +void vmx_dr_access(unsigned long exit_qualification, + struct cpu_user_regs *regs); +void vmx_fpu_enter(struct vcpu *v); int alloc_p2m_hap_data(struct p2m_domain *p2m); void free_p2m_hap_data(struct p2m_domain *p2m); -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 07/23] PVH xen: vmcs related preparatory changes for PVH
In this patch, some common code is factored out of construct_vmcs() to create vmx_set_common_host_vmcs_fields() to be used by PVH. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/hvm/vmx/vmcs.c | 58 +++++++++++++++++++++++------------------- 1 files changed, 32 insertions(+), 26 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index de9f592..36f167f 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -825,11 +825,40 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val) virtual_vmcs_exit(vvmcs); } -static int construct_vmcs(struct vcpu *v) +static void vmx_set_common_host_vmcs_fields(struct vcpu *v) { - struct domain *d = v->domain; uint16_t sysenter_cs; unsigned long sysenter_eip; + + /* Host data selectors. */ + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_FS_SELECTOR, 0); + __vmwrite(HOST_GS_SELECTOR, 0); + __vmwrite(HOST_FS_BASE, 0); + __vmwrite(HOST_GS_BASE, 0); + + /* Host control registers. */ + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); + __vmwrite(HOST_CR4, + mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); + + /* Host CS:RIP. */ + __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); + __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); + + /* Host SYSENTER CS:RIP. */ + rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); + __vmwrite(HOST_SYSENTER_CS, sysenter_cs); + rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); + __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); +} + +static int construct_vmcs(struct vcpu *v) +{ + struct domain *d = v->domain; u32 vmexit_ctl = vmx_vmexit_control; u32 vmentry_ctl = vmx_vmentry_control; @@ -932,30 +961,7 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector); } - /* Host data selectors. */ - __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_FS_SELECTOR, 0); - __vmwrite(HOST_GS_SELECTOR, 0); - __vmwrite(HOST_FS_BASE, 0); - __vmwrite(HOST_GS_BASE, 0); - - /* Host control registers. */ - v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; - __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); - __vmwrite(HOST_CR4, - mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); - - /* Host CS:RIP. */ - __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); - __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); - - /* Host SYSENTER CS:RIP. */ - rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); - __vmwrite(HOST_SYSENTER_CS, sysenter_cs); - rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); - __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); + vmx_set_common_host_vmcs_fields(v); /* MSR intercepts. */ __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
This patch introduces the concept of a pvh guest. There are other basic changes like creating macros to check for pv/pvh vcpu/domain, and also modifying copy-macros to account for pvh. Finally, guest_kernel_mode is changed to boast that a PVH doesn''t need to check for TF_kernel_mode flag since the kernel runs in ring 0. Chagnes in V2: - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag. - fix indentation and spacing in guest_kernel_mode macro. - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no longer be called in any PVH paths. Chagnes in V3: - Rename enum fields, and add is_pv to it. - Get rid if is_hvm_or_pvh_* macros. Chagnes in V4: - Move e820 fields out of pv_domain struct. Chagnes in V5: - Move e820 changes above in V4, to a separate patch. Chagnes in V5: - Rename enum guest_type from is_pv, ... to guest_type_pv, .... Chagnes in V8: - Got to VMCS for DPL check instead of checking the rpl in guest_kernel_mode. Note, we drop the const qualifier from vcpu_show_registers() to accomodate the hvm function call in guest_kernel_mode(). - Also, hvm_kernel_mode is put in hvm.c because it''s called from guest_kernel_mode in regs.h which is a pretty early header include. Hence, we can''t place it in hvm.h like other similar functions. The other alternative, to put hvm_kernel_mode in regs.h itself, but then it calls hvm_get_segment_register() for which either we need to include hvm.h in regs.h, not possible, or add proto for hvm_get_segment_register(). But then the args to hvm_get_segment_register() also need their headers. So, in the end this seems to be the best/only way. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/debug.c | 2 +- xen/arch/x86/hvm/hvm.c | 8 ++++++++ xen/arch/x86/x86_64/traps.c | 2 +- xen/common/domain.c | 2 +- xen/include/asm-x86/desc.h | 4 +++- xen/include/asm-x86/domain.h | 2 +- xen/include/asm-x86/guest_access.h | 12 ++++++------ xen/include/asm-x86/x86_64/regs.h | 11 +++++++---- xen/include/public/domctl.h | 3 +++ xen/include/xen/sched.h | 21 ++++++++++++++++++--- 10 files changed, 49 insertions(+), 18 deletions(-) diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c index e67473e..167421d 100644 --- a/xen/arch/x86/debug.c +++ b/xen/arch/x86/debug.c @@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp, pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len); - mfn = (dp->is_hvm + mfn = (!is_pv_domain(dp) ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn) : dbg_pv_va2mfn(addr, dp, pgd3)); if ( mfn == INVALID_MFN ) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 8284b3b..bac4708 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4642,6 +4642,14 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v) return hvm_funcs.nhvm_intr_blocked(v); } +bool_t hvm_kernel_mode(struct vcpu *v) +{ + struct segment_register seg; + + hvm_get_segment_register(v, x86_seg_ss, &seg); + return (seg.attr.fields.dpl == 0); +} + /* * Local variables: * mode: C diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index 9e0571d..feb50ff 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs *regs) } } -void vcpu_show_registers(const struct vcpu *v) +void vcpu_show_registers(struct vcpu *v) { const struct cpu_user_regs *regs = &v->arch.user_regs; unsigned long crs[8]; diff --git a/xen/common/domain.c b/xen/common/domain.c index 6c264a5..38b1bad 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -236,7 +236,7 @@ struct domain *domain_create( goto fail; if ( domcr_flags & DOMCRF_hvm ) - d->is_hvm = 1; + d->guest_type = guest_type_hvm; if ( domid == 0 ) { diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h index 354b889..041e9d3 100644 --- a/xen/include/asm-x86/desc.h +++ b/xen/include/asm-x86/desc.h @@ -38,7 +38,9 @@ #ifndef __ASSEMBLY__ -#define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3) +/* PVH 32bitfixme : see emulate_gate_op call from do_general_protection */ +#define GUEST_KERNEL_RPL(d) ({ ASSERT(is_pv_domain(d)); \ + is_pv_32bit_domain(d) ? 1 : 3; }) /* Fix up the RPL of a guest segment selector. */ #define __fixup_guest_selector(d, sel) \ diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index c3f9f8e..22a72df 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -447,7 +447,7 @@ struct arch_vcpu #define hvm_svm hvm_vcpu.u.svm void vcpu_show_execution_state(struct vcpu *); -void vcpu_show_registers(const struct vcpu *); +void vcpu_show_registers(struct vcpu *); /* Clean up CR4 bits that are not under guest control. */ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4); diff --git a/xen/include/asm-x86/guest_access.h b/xen/include/asm-x86/guest_access.h index ca700c9..675dda1 100644 --- a/xen/include/asm-x86/guest_access.h +++ b/xen/include/asm-x86/guest_access.h @@ -14,27 +14,27 @@ /* Raw access functions: no type checking. */ #define raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ copy_to_user((dst), (src), (len))) #define raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ copy_from_user((dst), (src), (len))) #define raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) #define __raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ __copy_to_user((dst), (src), (len))) #define __raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ __copy_from_user((dst), (src), (len))) #define __raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) diff --git a/xen/include/asm-x86/x86_64/regs.h b/xen/include/asm-x86/x86_64/regs.h index 3cdc702..d91a84b 100644 --- a/xen/include/asm-x86/x86_64/regs.h +++ b/xen/include/asm-x86/x86_64/regs.h @@ -10,10 +10,13 @@ #define ring_2(r) (((r)->cs & 3) == 2) #define ring_3(r) (((r)->cs & 3) == 3) -#define guest_kernel_mode(v, r) \ - (!is_pv_32bit_vcpu(v) ? \ - (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ - (ring_1(r))) +bool_t hvm_kernel_mode(struct vcpu *); + +#define guest_kernel_mode(v, r) \ + (is_pvh_vcpu(v) ? hvm_kernel_mode(v) : \ + (!is_pv_32bit_vcpu(v) ? \ + (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ + (ring_1(r)))) #define permit_softint(dpl, v, r) \ ((dpl) >= (guest_kernel_mode(v, r) ? 1 : 3)) diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 4c5b2bb..6b1aa11 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -89,6 +89,9 @@ struct xen_domctl_getdomaininfo { /* Being debugged. */ #define _XEN_DOMINF_debugged 6 #define XEN_DOMINF_debugged (1U<<_XEN_DOMINF_debugged) +/* domain is PVH */ +#define _XEN_DOMINF_pvh_guest 7 +#define XEN_DOMINF_pvh_guest (1U<<_XEN_DOMINF_pvh_guest) /* XEN_DOMINF_shutdown guest-supplied code. */ #define XEN_DOMINF_shutdownmask 255 #define XEN_DOMINF_shutdownshift 16 diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index ae6a3b8..2d48d22 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -238,6 +238,14 @@ struct mem_event_per_domain struct mem_event_domain access; }; +/* + * PVH is a PV guest running in an HVM container. While is_hvm_* checks are + * false for it, it uses many of the HVM data structs. + */ +enum guest_type { + guest_type_pv, guest_type_pvh, guest_type_hvm +}; + struct domain { domid_t domain_id; @@ -285,8 +293,8 @@ struct domain struct rangeset *iomem_caps; struct rangeset *irq_caps; - /* Is this an HVM guest? */ - bool_t is_hvm; + enum guest_type guest_type; + #ifdef HAS_PASSTHROUGH /* Does this guest need iommu mappings? */ bool_t need_iommu; @@ -464,6 +472,9 @@ struct domain *domain_create( /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */ #define _DOMCRF_oos_off 4 #define DOMCRF_oos_off (1U<<_DOMCRF_oos_off) + /* DOMCRF_pvh: Create PV domain in HVM container. */ +#define _DOMCRF_pvh 5 +#define DOMCRF_pvh (1U<<_DOMCRF_pvh) /* * rcu_lock_domain_by_id() is more efficient than get_domain_by_id(). @@ -732,8 +743,12 @@ void watchdog_domain_destroy(struct domain *d); #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist)) -#define is_hvm_domain(d) ((d)->is_hvm) +#define is_pv_domain(d) ((d)->guest_type == guest_type_pv) +#define is_pv_vcpu(v) (is_pv_domain((v)->domain)) +#define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm) #define is_hvm_vcpu(v) (is_hvm_domain(v->domain)) +#define is_pvh_domain(d) ((d)->guest_type == guest_type_pvh) +#define is_pvh_vcpu(v) (is_pvh_domain((v)->domain)) #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ cpumask_weight((v)->cpu_affinity) == 1) #ifdef HAS_PASSTHROUGH -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
vmx_pvh_set_vcpu_info() is added to a new file pvh.c, to which more changes are added later, like pvh vmexit handler. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/Makefile | 1 + xen/arch/x86/hvm/vmx/pvh.c | 78 +++++++++++++++++++++++++++++++++++++ xen/arch/x86/hvm/vmx/vmx.c | 1 + xen/include/asm-x86/hvm/hvm.h | 8 ++++ xen/include/asm-x86/hvm/vmx/vmx.h | 1 + 5 files changed, 89 insertions(+), 0 deletions(-) create mode 100644 xen/arch/x86/hvm/vmx/pvh.c diff --git a/xen/arch/x86/hvm/vmx/Makefile b/xen/arch/x86/hvm/vmx/Makefile index 373b3d9..59fb5d4 100644 --- a/xen/arch/x86/hvm/vmx/Makefile +++ b/xen/arch/x86/hvm/vmx/Makefile @@ -1,5 +1,6 @@ obj-bin-y += entry.o obj-y += intr.o +obj-y += pvh.o obj-y += realmode.o obj-y += vmcs.o obj-y += vmx.o diff --git a/xen/arch/x86/hvm/vmx/pvh.c b/xen/arch/x86/hvm/vmx/pvh.c new file mode 100644 index 0000000..b37e423 --- /dev/null +++ b/xen/arch/x86/hvm/vmx/pvh.c @@ -0,0 +1,78 @@ +/* + * Copyright (C) 2013, Mukesh Rathor, Oracle Corp. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include <xen/hypercall.h> +#include <xen/guest_access.h> +#include <asm/p2m.h> +#include <asm/traps.h> +#include <asm/hvm/vmx/vmx.h> +#include <public/sched.h> +#include <asm/hvm/nestedhvm.h> +#include <asm/xstate.h> + +/* + * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called + * from arch_set_info_guest() which sets the (PVH relevant) non-vmcs fields. + * + * In case of linux: + * The boot vcpu calls this to set some context for the non boot smp vcpu. + * The call comes from cpu_initialize_context(). (boot vcpu 0 context is + * set by the tools via do_domctl -> vcpu_initialise). + * + * NOTE: In case of VMCS, loading a selector doesn''t cause the hidden fields + * to be automatically loaded. We load selectors here but not the hidden + * parts, except for GS_BASE and FS_BASE. This means we require the + * guest to have same hidden values as the default values loaded in the + * vmcs in pvh_construct_vmcs(), ie, the GDT the vcpu is coming up on + * should be something like following, + * (from 64bit linux, CS:0x10 DS/SS:0x18) : + * + * ffff88007f704000: 0000000000000000 00cf9b000000ffff + * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff + * ffff88007f704020: 00cffb000000ffff 00cff3000000ffff + * + */ +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) +{ + if ( v->vcpu_id == 0 ) + return 0; + + if ( !(ctxtp->flags & VGCF_in_kernel) ) + return -EINVAL; + + vmx_vmcs_enter(v); + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); + + __vmwrite(GUEST_FS_BASE, ctxtp->fs_base); + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_kernel); + + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); + __vmwrite(GUEST_FS_SELECTOR, ctxtp->user_regs.fs); + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); + + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) + { + vmx_vmcs_exit(v); + return -EINVAL; + } + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_user); + + vmx_vmcs_exit(v); + return 0; +} diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 7292357..e3c7515 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1562,6 +1562,7 @@ static struct hvm_function_table __initdata vmx_function_table = { .sync_pir_to_irr = vmx_sync_pir_to_irr, .handle_eoi = vmx_handle_eoi, .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m, + .pvh_set_vcpu_info = vmx_pvh_set_vcpu_info, }; const struct hvm_function_table * __init start_vmx(void) diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 00489cf..072a2a7 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -193,6 +193,8 @@ struct hvm_function_table { paddr_t *L1_gpa, unsigned int *page_order, uint8_t *p2m_acc, bool_t access_r, bool_t access_w, bool_t access_x); + + int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp); }; extern struct hvm_function_table hvm_funcs; @@ -326,6 +328,12 @@ static inline unsigned long hvm_get_shadow_gs_base(struct vcpu *v) return hvm_funcs.get_shadow_gs_base(v); } +static inline int pvh_set_vcpu_info(struct vcpu *v, + struct vcpu_guest_context *ctxtp) +{ + return hvm_funcs.pvh_set_vcpu_info(v, ctxtp); +} + #define is_viridian_domain(_d) \ (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN])) diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index c21a303..9e6c481 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -473,6 +473,7 @@ void vmx_update_guest_eip(void); void vmx_dr_access(unsigned long exit_qualification, struct cpu_user_regs *regs); void vmx_fpu_enter(struct vcpu *v); +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp); int alloc_p2m_hap_data(struct p2m_domain *p2m); void free_p2m_hap_data(struct p2m_domain *p2m); -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
This patch mostly contains changes to arch/x86/domain.c to allow for a PVH domain creation. The new function pvh_set_vcpu_info(), introduced in the previous patch, is called here to set some guest context in the VMCS. This patch also changes the context_switch code in the same file to follow HVM behaviour for PVH. Changes in V2: - changes to read_segment_register() moved to this patch. - The other comment was to create NULL functions for pvh_set_vcpu_info and pvh_read_descriptor which are implemented in later patch, but since I disable PVH creation until all patches are checked in, it is not needed. But it helps breaking down of patches. Changes in V3: - Fix read_segment_register() macro to make sure args are evaluated once, and use # instead of STR for name in the macro. Changes in V4: - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been moved out of pv_vcpu struct. - rename hvm_pvh_* functions to hvm_*. Changes in V5: - remove pvh_read_descriptor(). Changes in V7: - remove hap_update_cr3() and read_segment_register changes from here. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/domain.c | 56 ++++++++++++++++++++++++++++++++---------------- xen/arch/x86/mm.c | 3 ++ 2 files changed, 40 insertions(+), 19 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index c361abf..fccb4ee 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) vmce_init_vcpu(v); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) { rc = hvm_vcpu_initialise(v); goto done; @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) vcpu_destroy_fpu(v); - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) hvm_vcpu_destroy(v); else xfree(v->arch.pv_vcpu.trap_ctxt); @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) int rc = -ENOMEM; d->arch.hvm_domain.hap_enabled - is_hvm_domain(d) && + !is_pv_domain(d) && hvm_funcs.hap_supported && (domcr_flags & DOMCRF_hap); d->arch.hvm_domain.mem_sharing_enabled = 0; @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) mapcache_domain_init(d); HYPERVISOR_COMPAT_VIRT_START(d) - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) goto fail; @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) } spin_lock_init(&d->arch.e820_lock); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) { if ( (rc = hvm_domain_initialise(d)) != 0 ) { @@ -651,7 +651,7 @@ int arch_set_info_guest( #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) flags = c(flags); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) { if ( !compat ) { @@ -704,7 +704,7 @@ int arch_set_info_guest( v->fpu_initialised = !!(flags & VGCF_I387_VALID); v->arch.flags &= ~TF_kernel_mode; - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) v->arch.flags |= TF_kernel_mode; v->arch.vgc_flags = flags; @@ -719,7 +719,7 @@ int arch_set_info_guest( if ( !compat ) { memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, sizeof(c.nat->trap_ctxt)); } @@ -735,10 +735,13 @@ int arch_set_info_guest( v->arch.user_regs.eflags |= 2; - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) { hvm_set_info_guest(v); - goto out; + if ( is_hvm_vcpu(v) || v->is_initialised ) + goto out; + else + goto pvh_skip_pv_stuff; } init_int80_direct_trap(v); @@ -853,6 +856,7 @@ int arch_set_info_guest( set_bit(_VPF_in_reset, &v->pause_flags); + pvh_skip_pv_stuff: if ( !compat ) cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); else @@ -861,7 +865,7 @@ int arch_set_info_guest( if ( !cr3_page ) rc = -EINVAL; - else if ( paging_mode_refcounts(d) ) + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) /* nothing */; else if ( cr3_page == v->arch.old_guest_table ) { @@ -893,8 +897,15 @@ int arch_set_info_guest( /* handled below */; else if ( !compat ) { + /* PVH 32bitfixme. */ + if ( is_pvh_vcpu(v) ) + { + v->arch.cr3 = page_to_mfn(cr3_page); + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; + } + v->arch.guest_table = pagetable_from_page(cr3_page); - if ( c.nat->ctrlreg[1] ) + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) { cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); @@ -954,6 +965,13 @@ int arch_set_info_guest( update_cr3(v); + if ( is_pvh_vcpu(v) ) + { + /* Set VMCS fields. */ + if ( (rc = pvh_set_vcpu_info(v, c.nat)) != 0 ) + return rc; + } + out: if ( flags & VGCF_online ) clear_bit(_VPF_down, &v->pause_flags); @@ -1315,7 +1333,7 @@ static void update_runstate_area(struct vcpu *v) static inline int need_full_gdt(struct vcpu *v) { - return (!is_hvm_vcpu(v) && !is_idle_vcpu(v)); + return (is_pv_vcpu(v) && !is_idle_vcpu(v)); } static void __context_switch(void) @@ -1450,7 +1468,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next) /* Re-enable interrupts before restoring state which may fault. */ local_irq_enable(); - if ( !is_hvm_vcpu(next) ) + if ( is_pv_vcpu(next) ) { load_LDT(next); load_segments(next); @@ -1576,12 +1594,12 @@ unsigned long hypercall_create_continuation( regs->eax = op; /* Ensure the hypercall trap instruction is re-executed. */ - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) regs->eip -= 2; /* re-execute ''syscall'' / ''int $xx'' */ else current->arch.hvm_vcpu.hcall_preempted = 1; - if ( !is_hvm_vcpu(current) ? + if ( is_pv_vcpu(current) ? !is_pv_32on64_vcpu(current) : (hvm_guest_x86_mode(current) == 8) ) { @@ -1849,7 +1867,7 @@ int domain_relinquish_resources(struct domain *d) return ret; } - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { for_each_vcpu ( d, v ) { @@ -1922,7 +1940,7 @@ int domain_relinquish_resources(struct domain *d) BUG(); } - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) hvm_domain_relinquish_resources(d); return 0; @@ -2006,7 +2024,7 @@ void vcpu_mark_events_pending(struct vcpu *v) if ( already_pending ) return; - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) hvm_assert_evtchn_irq(v); else vcpu_kick(v); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 412971e..ece11e4 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) int i; unsigned long pfn; + if ( is_pvh_vcpu(v) ) + return; + v->arch.pv_vcpu.gdt_ents = 0; pl1e = gdt_ldt_ptes(v->domain, v); for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ ) -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
This patch supports invalid op emulation for PVH by calling appropriate copy macros and and HVM function to inject PF. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/traps.c | 17 ++++++++++++++--- xen/include/asm-x86/traps.h | 1 + 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 378ef0a..a3ca70b 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -459,6 +459,11 @@ static void instruction_done( struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch) { regs->eip = eip; + + /* PVH fixme: debug trap below */ + if ( is_pvh_vcpu(current) ) + return; + regs->eflags &= ~X86_EFLAGS_RF; if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) { @@ -913,7 +918,7 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs) return EXCRET_fault_fixed; } -static int emulate_forced_invalid_op(struct cpu_user_regs *regs) +int emulate_forced_invalid_op(struct cpu_user_regs *regs) { char sig[5], instr[2]; unsigned long eip, rc; @@ -921,7 +926,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) eip = regs->eip; /* Check for forced emulation signature: ud2 ; .ascii "xen". */ - if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 ) + if ( (rc = raw_copy_from_guest(sig, (char *)eip, sizeof(sig))) != 0 ) { propagate_page_fault(eip + sizeof(sig) - rc, 0); return EXCRET_fault_fixed; @@ -931,7 +936,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) eip += sizeof(sig); /* We only emulate CPUID. */ - if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 ) + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, sizeof(instr))) != 0 ) { propagate_page_fault(eip + sizeof(instr) - rc, 0); return EXCRET_fault_fixed; @@ -1076,6 +1081,12 @@ void propagate_page_fault(unsigned long addr, u16 error_code) struct vcpu *v = current; struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; + if ( is_pvh_vcpu(v) ) + { + hvm_inject_page_fault(error_code, addr); + return; + } + v->arch.pv_vcpu.ctrlreg[2] = addr; arch_set_cr2(v, addr); diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h index 82cbcee..1d9b087 100644 --- a/xen/include/asm-x86/traps.h +++ b/xen/include/asm-x86/traps.h @@ -48,5 +48,6 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid, */ extern int send_guest_trap(struct domain *d, uint16_t vcpuid, unsigned int trap_nr); +int emulate_forced_invalid_op(struct cpu_user_regs *regs); #endif /* ASM_TRAP_H */ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
This patch changes mostly traps.c to support privileged op emulation for PVH. A new function read_descriptor_sel() is introduced to read descriptor for PVH given a selector. Another new function vmx_read_selector() reads a selector from VMCS, to support read_segment_register() for PVH. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/hvm/vmx/vmx.c | 40 +++++++++++++++++++ xen/arch/x86/traps.c | 86 +++++++++++++++++++++++++++++++++++----- xen/include/asm-x86/hvm/hvm.h | 7 +++ xen/include/asm-x86/system.h | 19 +++++++-- 4 files changed, 137 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index e3c7515..80109c1 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -664,6 +664,45 @@ static void vmx_ctxt_switch_to(struct vcpu *v) .fields = { .type = 0xb, .s = 0, .dpl = 0, .p = 1, .avl = 0, \ .l = 0, .db = 0, .g = 0, .pad = 0 } }).bytes) +u16 vmx_read_selector(struct vcpu *v, enum x86_segment seg) +{ + u16 sel = 0; + + vmx_vmcs_enter(v); + switch ( seg ) + { + case x86_seg_cs: + sel = __vmread(GUEST_CS_SELECTOR); + break; + + case x86_seg_ss: + sel = __vmread(GUEST_SS_SELECTOR); + break; + + case x86_seg_es: + sel = __vmread(GUEST_ES_SELECTOR); + break; + + case x86_seg_ds: + sel = __vmread(GUEST_DS_SELECTOR); + break; + + case x86_seg_fs: + sel = __vmread(GUEST_FS_SELECTOR); + break; + + case x86_seg_gs: + sel = __vmread(GUEST_GS_SELECTOR); + break; + + default: + BUG(); + } + vmx_vmcs_exit(v); + + return sel; +} + void vmx_get_segment_register(struct vcpu *v, enum x86_segment seg, struct segment_register *reg) { @@ -1563,6 +1602,7 @@ static struct hvm_function_table __initdata vmx_function_table = { .handle_eoi = vmx_handle_eoi, .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m, .pvh_set_vcpu_info = vmx_pvh_set_vcpu_info, + .read_selector = vmx_read_selector, }; const struct hvm_function_table * __init start_vmx(void) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index a3ca70b..fe8b94c 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -480,6 +480,10 @@ static unsigned int check_guest_io_breakpoint(struct vcpu *v, unsigned int width, i, match = 0; unsigned long start; + /* PVH fixme: support io breakpoint. */ + if ( is_pvh_vcpu(v) ) + return 0; + if ( !(v->arch.debugreg[5]) || !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) ) return 0; @@ -1525,6 +1529,49 @@ static int read_descriptor(unsigned int sel, return 1; } +static int read_descriptor_sel(unsigned int sel, + enum x86_segment which_sel, + struct vcpu *v, + const struct cpu_user_regs *regs, + unsigned long *base, + unsigned long *limit, + unsigned int *ar, + unsigned int vm86attr) +{ + struct segment_register seg; + bool_t long_mode; + + if ( !is_pvh_vcpu(v) ) + return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); + + hvm_get_segment_register(v, x86_seg_cs, &seg); + long_mode = seg.attr.fields.l; + + if ( which_sel != x86_seg_cs ) + hvm_get_segment_register(v, which_sel, &seg); + + /* "ar" is returned packed as in segment_attributes_t. Fix it up. */ + *ar = seg.attr.bytes; + *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4); + *ar <<= 8; + + if ( long_mode ) + { + *limit = ~0UL; + + if ( which_sel < x86_seg_fs ) + { + *base = 0UL; + return 1; + } + } + else + *limit = seg.limit; + + *base = seg.base; + return 1; +} + static int read_gate_descriptor(unsigned int gate_sel, const struct vcpu *v, unsigned int *sel, @@ -1590,6 +1637,13 @@ static int guest_io_okay( int user_mode = !(v->arch.flags & TF_kernel_mode); #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v) + /* + * For PVH we check this in vmexit for EXIT_REASON_IO_INSTRUCTION + * and so don''t need to check again here. + */ + if ( is_pvh_vcpu(v) ) + return 1; + if ( !vm86_mode(regs) && (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) ) return 1; @@ -1835,7 +1889,7 @@ static inline uint64_t guest_misc_enable(uint64_t val) _ptr = (unsigned int)_ptr; \ if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) ) \ goto fail; \ - if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ + if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ { \ propagate_page_fault(_ptr + sizeof(_x) - _rc, 0); \ goto skip; \ @@ -1852,6 +1906,7 @@ static int is_cpufreq_controller(struct domain *d) static int emulate_privileged_op(struct cpu_user_regs *regs) { + enum x86_segment which_sel; struct vcpu *v = current; unsigned long *reg, eip = regs->eip; u8 opcode, modrm_reg = 0, modrm_rm = 0, rep_prefix = 0, lock = 0, rex = 0; @@ -1874,9 +1929,10 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) void (*io_emul)(struct cpu_user_regs *) __attribute__((__regparm__(1))); uint64_t val, msr_content; - if ( !read_descriptor(regs->cs, v, regs, - &code_base, &code_limit, &ar, - _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) ) + if ( !read_descriptor_sel(regs->cs, x86_seg_cs, v, regs, + &code_base, &code_limit, &ar, + _SEGMENT_CODE|_SEGMENT_S| + _SEGMENT_DPL|_SEGMENT_P) ) goto fail; op_default = op_bytes = (ar & (_SEGMENT_L|_SEGMENT_DB)) ? 4 : 2; ad_default = ad_bytes = (ar & _SEGMENT_L) ? 8 : op_default; @@ -1887,6 +1943,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) /* emulating only opcodes not allowing SS to be default */ data_sel = read_segment_register(v, regs, ds); + which_sel = x86_seg_ds; /* Legacy prefixes. */ for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) ) @@ -1902,23 +1959,29 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) continue; case 0x2e: /* CS override */ data_sel = regs->cs; + which_sel = x86_seg_cs; continue; case 0x3e: /* DS override */ data_sel = read_segment_register(v, regs, ds); + which_sel = x86_seg_ds; continue; case 0x26: /* ES override */ data_sel = read_segment_register(v, regs, es); + which_sel = x86_seg_es; continue; case 0x64: /* FS override */ data_sel = read_segment_register(v, regs, fs); + which_sel = x86_seg_fs; lm_ovr = lm_seg_fs; continue; case 0x65: /* GS override */ data_sel = read_segment_register(v, regs, gs); + which_sel = x86_seg_gs; lm_ovr = lm_seg_gs; continue; case 0x36: /* SS override */ data_sel = regs->ss; + which_sel = x86_seg_ss; continue; case 0xf0: /* LOCK */ lock = 1; @@ -1962,15 +2025,16 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) if ( !(opcode & 2) ) { data_sel = read_segment_register(v, regs, es); + which_sel = x86_seg_es; lm_ovr = lm_seg_none; } if ( !(ar & _SEGMENT_L) ) { - if ( !read_descriptor(data_sel, v, regs, - &data_base, &data_limit, &ar, - _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| - _SEGMENT_P) ) + if ( !read_descriptor_sel(data_sel, which_sel, v, regs, + &data_base, &data_limit, &ar, + _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| + _SEGMENT_P) ) goto fail; if ( !(ar & _SEGMENT_S) || !(ar & _SEGMENT_P) || @@ -2000,9 +2064,9 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) } } else - read_descriptor(data_sel, v, regs, - &data_base, &data_limit, &ar, - 0); + read_descriptor_sel(data_sel, which_sel, v, regs, + &data_base, &data_limit, &ar, + 0); data_limit = ~0UL; ar = _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P; } diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 072a2a7..29ed313 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -195,6 +195,8 @@ struct hvm_function_table { bool_t access_w, bool_t access_x); int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp); + + u16 (*read_selector)(struct vcpu *v, enum x86_segment seg); }; extern struct hvm_function_table hvm_funcs; @@ -334,6 +336,11 @@ static inline int pvh_set_vcpu_info(struct vcpu *v, return hvm_funcs.pvh_set_vcpu_info(v, ctxtp); } +static inline u16 pvh_get_selector(struct vcpu *v, enum x86_segment seg) +{ + return hvm_funcs.read_selector(v, seg); +} + #define is_viridian_domain(_d) \ (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN])) diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h index 9bb22cb..1242657 100644 --- a/xen/include/asm-x86/system.h +++ b/xen/include/asm-x86/system.h @@ -4,10 +4,21 @@ #include <xen/lib.h> #include <xen/bitops.h> -#define read_segment_register(vcpu, regs, name) \ -({ u16 __sel; \ - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ - __sel; \ +/* + * We need vcpu because during context switch, going from PV to PVH, + * in save_segments() current has been updated to next, and no longer pointing + * to the PV, but the intention is to get selector for the PV. Checking + * is_pvh_vcpu(current) will yield incorrect results in such a case. + */ +#define read_segment_register(vcpu, regs, name) \ +({ u16 __sel; \ + struct cpu_user_regs *_regs = (regs); \ + \ + if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) ) \ + __sel = pvh_get_selector(vcpu, x86_seg_##name); \ + else \ + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ + __sel; \ }) #define wbinvd() \ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 13/23] PVH xen: interrupt/event-channel delivery to PVH
PVH uses HVMIRQ_callback_vector for interrupt delivery. Also, change hvm_vcpu_has_pending_irq() as PVH doesn''t need to use vlapic emulation, so we can skip vlapic checks in the function. Moreover, a PVH guest installs IDT natively, and sets a callback vector for interrupt delivery during boot. Once that is done, it receives interrupts via the callback. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/arch/x86/hvm/irq.c | 3 +++ xen/arch/x86/hvm/vmx/intr.c | 8 ++++++-- xen/include/asm-x86/domain.h | 2 +- xen/include/asm-x86/event.h | 2 +- 4 files changed, 11 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index 9eae5de..92fb245 100644 --- a/xen/arch/x86/hvm/irq.c +++ b/xen/arch/x86/hvm/irq.c @@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v) && vcpu_info(v, evtchn_upcall_pending) ) return hvm_intack_vector(plat->irq.callback_via.vector); + if ( is_pvh_vcpu(v) ) + return hvm_intack_none; + if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output ) return hvm_intack_pic(0); diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c index e376f3c..ce42950 100644 --- a/xen/arch/x86/hvm/vmx/intr.c +++ b/xen/arch/x86/hvm/vmx/intr.c @@ -165,6 +165,9 @@ static int nvmx_intr_intercept(struct vcpu *v, struct hvm_intack intack) { u32 ctrl; + if ( is_pvh_vcpu(v) ) + return 0; + if ( nvmx_intr_blocked(v) != hvm_intblk_none ) { enable_intr_window(v, intack); @@ -219,8 +222,9 @@ void vmx_intr_assist(void) return; } - /* Crank the handle on interrupt state. */ - pt_vector = pt_update_irq(v); + if ( !is_pvh_vcpu(v) ) + /* Crank the handle on interrupt state. */ + pt_vector = pt_update_irq(v); do { intack = hvm_vcpu_has_pending_irq(v); diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index 22a72df..21a9954 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -16,7 +16,7 @@ #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d)) #define is_pv_32on64_vcpu(v) (is_pv_32on64_domain((v)->domain)) -#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \ +#define is_hvm_pv_evtchn_domain(d) (!is_pv_domain(d) && \ d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector) #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain)) diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h index 06057c7..7ed5812 100644 --- a/xen/include/asm-x86/event.h +++ b/xen/include/asm-x86/event.h @@ -18,7 +18,7 @@ int hvm_local_events_need_delivery(struct vcpu *v); static inline int local_events_need_delivery(void) { struct vcpu *v = current; - return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) : + return (!is_pv_vcpu(v) ? hvm_local_events_need_delivery(v) : (vcpu_info(v, evtchn_upcall_pending) && !vcpu_info(v, evtchn_upcall_mask))); } -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
Fail creation of 32bit PVH guest. Change hap_update_cr3() to return long mode for PVH, this called during domain creation from arch_set_info_guest(). Return correct features for PVH to guest during it''s boot. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/arch/x86/domain.c | 8 ++++++++ xen/arch/x86/mm/hap/hap.c | 4 +++- xen/common/domain.c | 10 ++++++++++ xen/common/domctl.c | 5 +++++ xen/common/kernel.c | 6 +++++- 5 files changed, 31 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index fccb4ee..288872a 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -339,6 +339,14 @@ int switch_compat(struct domain *d) if ( d == NULL ) return -EINVAL; + + if ( is_pvh_domain(d) ) + { + printk(XENLOG_INFO + "Xen currently does not support 32bit PVH guests\n"); + return -EINVAL; + } + if ( !may_switch_mode(d) ) return -EACCES; if ( is_pv_32on64_domain(d) ) diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index bff05d9..19a085c 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int do_locking) const struct paging_mode * hap_paging_get_mode(struct vcpu *v) { - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : + /* PVH 32bitfixme. */ + return is_pvh_vcpu(v) ? &hap_paging_long_mode : + !hvm_paging_enabled(v) ? &hap_paging_real_mode : hvm_long_mode_enabled(v) ? &hap_paging_long_mode : hvm_pae_enabled(v) ? &hap_paging_pae_mode : &hap_paging_protected_mode; diff --git a/xen/common/domain.c b/xen/common/domain.c index 38b1bad..3b4af4b 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -237,6 +237,16 @@ struct domain *domain_create( if ( domcr_flags & DOMCRF_hvm ) d->guest_type = guest_type_hvm; + else if ( domcr_flags & DOMCRF_pvh ) + { + if ( !(domcr_flags & DOMCRF_hap) ) + { + err = -EOPNOTSUPP; + printk(XENLOG_INFO "PVH guest must have HAP on\n"); + goto fail; + } + d->guest_type = guest_type_pvh; + } if ( domid == 0 ) { diff --git a/xen/common/domctl.c b/xen/common/domctl.c index c653efb..48e4c08 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -187,6 +187,8 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info) if ( is_hvm_domain(d) ) info->flags |= XEN_DOMINF_hvm_guest; + else if ( is_pvh_domain(d) ) + info->flags |= XEN_DOMINF_pvh_guest; xsm_security_domaininfo(d, info); @@ -443,6 +445,9 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) domcr_flags = 0; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest ) domcr_flags |= DOMCRF_hvm; + else if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) + domcr_flags |= DOMCRF_pvh; /* PV with HAP is a PVH guest */ + if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) domcr_flags |= DOMCRF_hap; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity ) diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 72fb905..3bba758 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -289,7 +289,11 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( current->domain == dom0 ) fi.submap |= 1U << XENFEAT_dom0; #ifdef CONFIG_X86 - if ( !is_hvm_vcpu(current) ) + if ( is_pvh_vcpu(current) ) + fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | + (1U << XENFEAT_supervisor_mode_kernel) | + (1U << XENFEAT_hvm_callback_vector); + else if ( !is_hvm_vcpu(current) ) fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) | (1U << XENFEAT_highmem_assist) | (1U << XENFEAT_gnttab_map_avail_bits); -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 15/23] PVH xen: mapcache and show registers
PVH doesn''t use map cache. show_registers() for PVH takes the HVM path. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- xen/arch/x86/domain_page.c | 10 +++++----- xen/arch/x86/x86_64/traps.c | 6 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index bc18263..3903952 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -35,7 +35,7 @@ static inline struct vcpu *mapcache_current_vcpu(void) * then it means we are running on the idle domain''s page table and must * therefore use its mapcache. */ - if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) ) + if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) ) { /* If we really are idling, perform lazy context switch now. */ if ( (v = idle_vcpu[smp_processor_id()]) == current ) @@ -72,7 +72,7 @@ void *map_domain_page(unsigned long mfn) #endif v = mapcache_current_vcpu(); - if ( !v || is_hvm_vcpu(v) ) + if ( !v || !is_pv_vcpu(v) ) return mfn_to_virt(mfn); dcache = &v->domain->arch.pv_domain.mapcache; @@ -177,7 +177,7 @@ void unmap_domain_page(const void *ptr) ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); v = mapcache_current_vcpu(); - ASSERT(v && !is_hvm_vcpu(v)); + ASSERT(v && is_pv_vcpu(v)); dcache = &v->domain->arch.pv_domain.mapcache; ASSERT(dcache->inuse); @@ -244,7 +244,7 @@ int mapcache_domain_init(struct domain *d) struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; unsigned int bitmap_pages; - if ( is_hvm_domain(d) || is_idle_domain(d) ) + if ( !is_pv_domain(d) || is_idle_domain(d) ) return 0; #ifdef NDEBUG @@ -275,7 +275,7 @@ int mapcache_vcpu_init(struct vcpu *v) unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES; unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long)); - if ( is_hvm_vcpu(v) || !dcache->inuse ) + if ( !is_pv_vcpu(v) || !dcache->inuse ) return 0; if ( ents > dcache->entries ) diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index feb50ff..6ac7762 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -85,7 +85,7 @@ void show_registers(struct cpu_user_regs *regs) enum context context; struct vcpu *v = current; - if ( is_hvm_vcpu(v) && guest_mode(regs) ) + if ( !is_pv_vcpu(v) && guest_mode(regs) ) { struct segment_register sreg; context = CTXT_hvm_guest; @@ -146,8 +146,8 @@ void vcpu_show_registers(struct vcpu *v) const struct cpu_user_regs *regs = &v->arch.user_regs; unsigned long crs[8]; - /* No need to handle HVM for now. */ - if ( is_hvm_vcpu(v) ) + /* No need to handle HVM and PVH for now. */ + if ( !is_pv_vcpu(v) ) return; crs[0] = v->arch.pv_vcpu.ctrlreg[0]; -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 16/23] PVH xen: mtrr, tsc, timers, grant changes...
PVH only supports limited memory types in Phase I. TSC is limited to native mode only also for the moment. Finally, grant mapping of iomem for PVH hasn''t been explored in phase I. Changes in V10: - don''t migrate timers for PVH as it doesn''t use rtc or emulated timers. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 4 ++++ xen/arch/x86/hvm/mtrr.c | 8 ++++++++ xen/arch/x86/time.c | 8 ++++++++ xen/common/grant_table.c | 4 ++-- 4 files changed, 22 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index bac4708..93aa42c 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -301,6 +301,10 @@ u64 hvm_get_guest_tsc_adjust(struct vcpu *v) void hvm_migrate_timers(struct vcpu *v) { + /* PVH doesn''t use rtc and emulated timers, it uses pvclock mechanism. */ + if ( is_pvh_vcpu(v) ) + return; + rtc_migrate_timers(v); pt_migrate(v); } diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c index ef51a8d..b9d6411 100644 --- a/xen/arch/x86/hvm/mtrr.c +++ b/xen/arch/x86/hvm/mtrr.c @@ -693,6 +693,14 @@ uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn, ((d->vcpu == NULL) || ((v = d->vcpu[0]) == NULL)) ) return MTRR_TYPE_WRBACK; + /* PVH fixme: Add support for more memory types. */ + if ( is_pvh_domain(d) ) + { + if ( direct_mmio ) + return MTRR_TYPE_UNCACHABLE; + return MTRR_TYPE_WRBACK; + } + if ( !v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] ) return MTRR_TYPE_WRBACK; diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index f047cb3..4589d43 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1891,6 +1891,14 @@ void tsc_set_info(struct domain *d, d->arch.vtsc = 0; return; } + if ( is_pvh_domain(d) && tsc_mode != TSC_MODE_NEVER_EMULATE ) + { + /* PVH fixme: support more tsc modes. */ + printk(XENLOG_WARNING + "PVH currently does not support tsc emulation. Setting timer_mode = native\n"); + d->arch.vtsc = 0; + return; + } switch ( d->arch.tsc_mode = tsc_mode ) { diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index eb50288..c51da30 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -721,7 +721,7 @@ __gnttab_map_grant_ref( double_gt_lock(lgt, rgt); - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; @@ -932,7 +932,7 @@ __gnttab_unmap_common( act->pin -= GNTPIN_hstw_inc; } - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 17/23] PVH xen: Checks, asserts, and limitations for PVH
This patch adds some precautionary checks and debug asserts for PVH. Also, PVH doesn''t support any HVM type guest monitoring at present. Change in V9: - Remove ASSERTs from emulate_gate_op and do_device_not_available. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 13 +++++++++++++ xen/arch/x86/hvm/mtrr.c | 4 ++++ xen/arch/x86/physdev.c | 13 +++++++++++++ 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 93aa42c..383c5cd 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4530,8 +4530,11 @@ static int hvm_memory_event_traps(long p, uint32_t reason, return 1; } +/* PVH fixme: add support for monitoring guest behaviour in below functions. */ void hvm_memory_event_cr0(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR0], MEM_EVENT_REASON_CR0, @@ -4540,6 +4543,8 @@ void hvm_memory_event_cr0(unsigned long value, unsigned long old) void hvm_memory_event_cr3(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR3], MEM_EVENT_REASON_CR3, @@ -4548,6 +4553,8 @@ void hvm_memory_event_cr3(unsigned long value, unsigned long old) void hvm_memory_event_cr4(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR4], MEM_EVENT_REASON_CR4, @@ -4556,6 +4563,8 @@ void hvm_memory_event_cr4(unsigned long value, unsigned long old) void hvm_memory_event_msr(unsigned long msr, unsigned long value) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_MSR], MEM_EVENT_REASON_MSR, @@ -4568,6 +4577,8 @@ int hvm_memory_event_int3(unsigned long gla) unsigned long gfn; gfn = paging_gva_to_gfn(current, gla, &pfec); + if ( is_pvh_vcpu(current) ) + return 0; return hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_INT3], MEM_EVENT_REASON_INT3, @@ -4580,6 +4591,8 @@ int hvm_memory_event_single_step(unsigned long gla) unsigned long gfn; gfn = paging_gva_to_gfn(current, gla, &pfec); + if ( is_pvh_vcpu(current) ) + return 0; return hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP], MEM_EVENT_REASON_SINGLESTEP, diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c index b9d6411..6706af6 100644 --- a/xen/arch/x86/hvm/mtrr.c +++ b/xen/arch/x86/hvm/mtrr.c @@ -578,6 +578,10 @@ int32_t hvm_set_mem_pinned_cacheattr( { struct hvm_mem_pinned_cacheattr_range *range; + /* Side note: A PVH guest writes to MSR_IA32_CR_PAT natively. */ + if ( is_pvh_domain(d) ) + return -EOPNOTSUPP; + if ( !((type == PAT_TYPE_UNCACHABLE) || (type == PAT_TYPE_WRCOMB) || (type == PAT_TYPE_WRTHROUGH) || diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 3733c7a..73c8d2a 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -475,6 +475,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iopl: { struct physdev_set_iopl set_iopl; + + if ( is_pvh_vcpu(current) ) + { + ret = -EPERM; + break; + } + ret = -EFAULT; if ( copy_from_guest(&set_iopl, arg, 1) != 0 ) break; @@ -488,6 +495,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iobitmap: { struct physdev_set_iobitmap set_iobitmap; + + if ( is_pvh_vcpu(current) ) + { + ret = -EPERM; + break; + } ret = -EFAULT; if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 ) break; -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 18/23] PVH xen: add hypercall support for PVH
This patch expands HVM hcall support to include PVH. Changes in v8: - Carve out PVH support of hvm_op to a small function. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 80 +++++++++++++++++++++++++++++++++++++------ xen/arch/x86/x86_64/traps.c | 2 +- 2 files changed, 70 insertions(+), 12 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 383c5cd..6af020e 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3192,6 +3192,17 @@ static long hvm_vcpu_op( case VCPUOP_register_vcpu_time_memory_area: rc = do_vcpu_op(cmd, vcpuid, arg); break; + + case VCPUOP_is_up: + case VCPUOP_up: + case VCPUOP_initialise: + /* PVH fixme: this white list should be removed eventually */ + if ( is_pvh_vcpu(current) ) + rc = do_vcpu_op(cmd, vcpuid, arg); + else + rc = -ENOSYS; + break; + default: rc = -ENOSYS; break; @@ -3312,6 +3323,24 @@ static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = { HYPERCALL(tmem_op) }; +/* PVH 32bitfixme. */ +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] = { + HYPERCALL(platform_op), + HYPERCALL(memory_op), + HYPERCALL(xen_version), + HYPERCALL(console_io), + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op, + HYPERCALL(mmuext_op), + HYPERCALL(xsm_op), + HYPERCALL(sched_op), + HYPERCALL(event_channel_op), + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, + HYPERCALL(hvm_op), + HYPERCALL(sysctl), + HYPERCALL(domctl) +}; + int hvm_do_hypercall(struct cpu_user_regs *regs) { struct vcpu *curr = current; @@ -3338,7 +3367,9 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) if ( (eax & 0x80000000) && is_viridian_domain(curr->domain) ) return viridian_hypercall(regs); - if ( (eax >= NR_hypercalls) || !hvm_hypercall32_table[eax] ) + if ( (eax >= NR_hypercalls) || + (is_pvh_vcpu(curr) && !pvh_hypercall64_table[eax]) || + (is_hvm_vcpu(curr) && !hvm_hypercall32_table[eax]) ) { regs->eax = -ENOSYS; return HVM_HCALL_completed; @@ -3353,16 +3384,20 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) regs->r10, regs->r8, regs->r9); curr->arch.hvm_vcpu.hcall_64bit = 1; - regs->rax = hvm_hypercall64_table[eax](regs->rdi, - regs->rsi, - regs->rdx, - regs->r10, - regs->r8, - regs->r9); + if ( is_pvh_vcpu(curr) ) + regs->rax = pvh_hypercall64_table[eax](regs->rdi, regs->rsi, + regs->rdx, regs->r10, + regs->r8, regs->r9); + else + regs->rax = hvm_hypercall64_table[eax](regs->rdi, regs->rsi, + regs->rdx, regs->r10, + regs->r8, regs->r9); curr->arch.hvm_vcpu.hcall_64bit = 0; } else { + ASSERT(!is_pvh_vcpu(curr)); /* PVH 32bitfixme. */ + HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%u(%x, %x, %x, %x, %x, %x)", eax, (uint32_t)regs->ebx, (uint32_t)regs->ecx, (uint32_t)regs->edx, (uint32_t)regs->esi, @@ -3760,6 +3795,23 @@ static int hvm_replace_event_channel(struct vcpu *v, domid_t remote_domid, return 0; } +static long pvh_hvm_op(unsigned long op, struct domain *d, + struct xen_hvm_param *harg) +{ + long rc = -ENOSYS; + + if ( op == HVMOP_set_param ) + { + if ( harg->index == HVM_PARAM_CALLBACK_IRQ ) + { + hvm_set_callback_via(d, harg->value); + hvm_latch_shinfo_size(d); + rc = 0; + } + } + return rc; +} + long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) { @@ -3787,12 +3839,18 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) return -ESRCH; rc = -EINVAL; - if ( !is_hvm_domain(d) ) - goto param_fail; + if ( is_pv_domain(d) ) + goto param_done; rc = xsm_hvm_param(XSM_TARGET, d, op); if ( rc ) - goto param_fail; + goto param_done; + + if ( is_pvh_domain(d) ) + { + rc = pvh_hvm_op(op, d, &a); + goto param_done; + } if ( op == HVMOP_set_param ) { @@ -4001,7 +4059,7 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) op == HVMOP_set_param ? "set" : "get", a.index, a.value); - param_fail: + param_done: rcu_unlock_domain(d); break; } diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index 6ac7762..af727f9 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -623,7 +623,7 @@ static void hypercall_page_initialise_ring3_kernel(void *hypercall_page) void hypercall_page_initialise(struct domain *d, void *hypercall_page) { memset(hypercall_page, 0xCC, PAGE_SIZE); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) hvm_hypercall_page_initialise(d, hypercall_page); else if ( !is_pv_32bit_domain(d) ) hypercall_page_initialise_ring3_kernel(hypercall_page); -- 1.7.2.3
This patch contains vmcs changes related for PVH, mainly creating a VMCS for PVH guest. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/vmcs.c | 247 ++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 245 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 36f167f..8d35370 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -634,7 +634,7 @@ void vmx_vmcs_exit(struct vcpu *v) { /* Don''t confuse vmx_do_resume (for @v or @current!) */ vmx_clear_vmcs(v); - if ( is_hvm_vcpu(current) ) + if ( !is_pv_vcpu(current) ) vmx_load_vmcs(current); spin_unlock(&v->arch.hvm_vmx.vmcs_lock); @@ -856,6 +856,239 @@ static void vmx_set_common_host_vmcs_fields(struct vcpu *v) __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); } +static int pvh_check_requirements(struct vcpu *v) +{ + u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + + if ( !paging_mode_hap(v->domain) ) + { + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); + return -EINVAL; + } + if ( !cpu_has_vmx_pat ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_msr_bitmap ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_vpid ) + { + printk(XENLOG_G_INFO "PVH: CPU doesn''t have VPID support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_secondary_exec_control ) + { + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); + return -ENOSYS; + } + + if ( v->domain->arch.vtsc ) + { + printk(XENLOG_G_INFO + "At present PVH only supports the default timer mode\n"); + return -ENOSYS; + } + + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; + if ( (tmpval & required) != required ) + { + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", + required); + return -ENOSYS; + } + + return 0; +} + +static int pvh_construct_vmcs(struct vcpu *v) +{ + int rc, msr_type; + unsigned long *msr_bitmap; + struct domain *d = v->domain; + struct p2m_domain *p2m = p2m_get_hostp2m(d); + struct ept_data *ept = &p2m->ept; + u32 vmexit_ctl = vmx_vmexit_control; + u32 vmentry_ctl = vmx_vmentry_control; + u64 host_pat, tmpval = -1; + + if ( (rc = pvh_check_requirements(v)) ) + return rc; + + msr_bitmap = alloc_xenheap_page(); + if ( msr_bitmap == NULL ) + return -ENOMEM; + + /* 1. Pin-Based Controls: */ + __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); + + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control; + + /* 2. Primary Processor-based controls: */ + /* + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, + * then pv_vcpu.ctrlreg must be added to the pvh struct. + */ + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; + + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING | + CPU_BASED_CR3_LOAD_EXITING | + CPU_BASED_CR3_STORE_EXITING); + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_MSR_BITMAP; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING; + + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); + + /* 3. Secondary Processor-based controls (Intel SDM: resvd bits are 0): */ + v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT; + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID; + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_PAUSE_LOOP_EXITING; + + __vmwrite(SECONDARY_VM_EXEC_CONTROL, + v->arch.hvm_vmx.secondary_exec_control); + + __vmwrite(IO_BITMAP_A, virt_to_maddr((char *)hvm_io_bitmap + 0)); + __vmwrite(IO_BITMAP_B, virt_to_maddr((char *)hvm_io_bitmap + PAGE_SIZE)); + + /* MSR bitmap for intercepts. */ + memset(msr_bitmap, ~0, PAGE_SIZE); + v->arch.hvm_vmx.msr_bitmap = msr_bitmap; + __vmwrite(MSR_BITMAP, virt_to_maddr(msr_bitmap)); + + msr_type = MSR_TYPE_R | MSR_TYPE_W; + /* Disable interecepts for MSRs that have corresponding VMCS fields. */ + vmx_disable_intercept_for_msr(v, MSR_FS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_GS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, msr_type); + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, msr_type); + + /* + * We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, MSR_CSTAR, + * and MSR_SYSCALL_MASK because we need to specify save/restore area to + * save/restore at every VM exit and entry. Instead, let the intercept + * functions save them into vmx_msr_state fields. See comment in + * vmx_restore_host_msrs(). See also vmx_restore_guest_msrs(). + */ + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); + + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); + + /* + * Note: we run with default VM_ENTRY_LOAD_DEBUG_CTLS of 1, which means + * upon vmentry, the cpu reads/loads VMCS.DR7 and VMCS.DEBUGCTLS, and not + * use the host values. 0 would cause it to not use the VMCS values. + */ + vmentry_ctl &= ~VM_ENTRY_LOAD_GUEST_EFER; + vmentry_ctl &= ~VM_ENTRY_SMM; + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; + /* PVH 32bitfixme. */ + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ + + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); + + vmx_set_common_host_vmcs_fields(v); + + __vmwrite(VM_ENTRY_INTR_INFO, 0); + __vmwrite(CR3_TARGET_COUNT, 0); + __vmwrite(GUEST_ACTIVITY_STATE, 0); + + /* These are sorta irrelevant as we load the discriptors directly. */ + __vmwrite(GUEST_CS_SELECTOR, 0); + __vmwrite(GUEST_DS_SELECTOR, 0); + __vmwrite(GUEST_SS_SELECTOR, 0); + __vmwrite(GUEST_ES_SELECTOR, 0); + __vmwrite(GUEST_FS_SELECTOR, 0); + __vmwrite(GUEST_GS_SELECTOR, 0); + + __vmwrite(GUEST_CS_BASE, 0); + __vmwrite(GUEST_CS_LIMIT, ~0u); + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); + + __vmwrite(GUEST_DS_BASE, 0); + __vmwrite(GUEST_DS_LIMIT, ~0u); + __vmwrite(GUEST_DS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_SS_BASE, 0); + __vmwrite(GUEST_SS_LIMIT, ~0u); + __vmwrite(GUEST_SS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_ES_BASE, 0); + __vmwrite(GUEST_ES_LIMIT, ~0u); + __vmwrite(GUEST_ES_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_FS_BASE, 0); + __vmwrite(GUEST_FS_LIMIT, ~0u); + __vmwrite(GUEST_FS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_GS_BASE, 0); + __vmwrite(GUEST_GS_LIMIT, ~0u); + __vmwrite(GUEST_GS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_GDTR_BASE, 0); + __vmwrite(GUEST_GDTR_LIMIT, 0); + + __vmwrite(GUEST_LDTR_BASE, 0); + __vmwrite(GUEST_LDTR_LIMIT, 0); + __vmwrite(GUEST_LDTR_AR_BYTES, 0x82); /* LDT */ + __vmwrite(GUEST_LDTR_SELECTOR, 0); + + /* Guest TSS. */ + __vmwrite(GUEST_TR_BASE, 0); + __vmwrite(GUEST_TR_LIMIT, 0xff); + __vmwrite(GUEST_TR_AR_BYTES, 0x8b); /* 32-bit TSS (busy) */ + + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); + __vmwrite(GUEST_DR7, 0); + __vmwrite(VMCS_LINK_POINTER, ~0UL); + + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0); + + v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK | (1U << TRAP_debug) | + (1U << TRAP_int3) | (1U << TRAP_no_device); + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); + + /* Set WP bit so rdonly pages are not written from CPL 0. */ + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP; + __vmwrite(GUEST_CR0, tmpval); + __vmwrite(CR0_READ_SHADOW, tmpval); + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = tmpval; + + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + __vmwrite(GUEST_CR4, tmpval); + __vmwrite(CR4_READ_SHADOW, tmpval); + v->arch.hvm_vcpu.guest_cr[4] = tmpval; + + __vmwrite(CR0_GUEST_HOST_MASK, ~0UL); + __vmwrite(CR4_GUEST_HOST_MASK, ~0UL); + + v->arch.hvm_vmx.vmx_realmode = 0; + + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); + + rdmsrl(MSR_IA32_CR_PAT, host_pat); + __vmwrite(HOST_PAT, host_pat); + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); + + /* The paging mode is updated for PVH by arch_set_info_guest(). */ + + return 0; +} + static int construct_vmcs(struct vcpu *v) { struct domain *d = v->domain; @@ -864,6 +1097,13 @@ static int construct_vmcs(struct vcpu *v) vmx_vmcs_enter(v); + if ( is_pvh_vcpu(v) ) + { + int rc = pvh_construct_vmcs(v); + vmx_vmcs_exit(v); + return rc; + } + /* VMCS controls. */ __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) hvm_asid_flush_vcpu(v); } + if ( is_pvh_vcpu(v) ) + reset_stack_and_jump(vmx_asm_do_vmentry); + debug_state = v->domain->debugger_attached || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_INT3] || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP]; @@ -1477,7 +1720,7 @@ static void vmcs_dump(unsigned char ch) for_each_domain ( d ) { - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) continue; printk("\n>>> Domain %d <<<\n", d->domain_id); for_each_vcpu ( d, v ) -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 20/23] PVH xen: HVM support of PVH guest creation/destruction
This patch implements the HVM portion of the guest create, ie vcpu and domain initilization. Some changes to support the destroy path. Changes in V10: - Move hvm_vcpu.guest_efer setting to here from VMX. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/arch/x86/hvm/hvm.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 65 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 6af020e..c742d7b 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -514,6 +514,27 @@ static int hvm_print_line( return X86EMUL_OKAY; } +static int pvh_dom_initialise(struct domain *d) +{ + int rc; + + if ( !d->arch.hvm_domain.hap_enabled ) + return -EINVAL; + + spin_lock_init(&d->arch.hvm_domain.irq_lock); + + hvm_init_cacheattr_region_list(d); + + if ( (rc = paging_enable(d, PG_refcounts|PG_translate|PG_external)) != 0 ) + goto pvh_dominit_fail; + + return 0; + + pvh_dominit_fail: + hvm_destroy_cacheattr_region_list(d); + return rc; +} + int hvm_domain_initialise(struct domain *d) { int rc; @@ -524,6 +545,8 @@ int hvm_domain_initialise(struct domain *d) "on a non-VT/AMDV platform.\n"); return -EINVAL; } + if ( is_pvh_domain(d) ) + return pvh_dom_initialise(d); spin_lock_init(&d->arch.hvm_domain.pbuf_lock); spin_lock_init(&d->arch.hvm_domain.irq_lock); @@ -588,6 +611,9 @@ int hvm_domain_initialise(struct domain *d) void hvm_domain_relinquish_resources(struct domain *d) { + if ( is_pvh_domain(d) ) + return; + if ( hvm_funcs.nhvm_domain_relinquish_resources ) hvm_funcs.nhvm_domain_relinquish_resources(d); @@ -612,11 +638,15 @@ void hvm_domain_relinquish_resources(struct domain *d) void hvm_domain_destroy(struct domain *d) { + hvm_destroy_cacheattr_region_list(d); + + if ( is_pvh_domain(d) ) + return; + hvm_funcs.domain_destroy(d); rtc_deinit(d); stdvga_deinit(d); vioapic_deinit(d); - hvm_destroy_cacheattr_region_list(d); } static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h) @@ -1070,6 +1100,33 @@ static int __init __hvm_register_CPU_XSAVE_save_and_restore(void) } __initcall(__hvm_register_CPU_XSAVE_save_and_restore); +static int pvh_vcpu_initialise(struct vcpu *v) +{ + int rc; + + if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) + return rc; + + softirq_tasklet_init(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet, + (void(*)(unsigned long))hvm_assert_evtchn_irq, + (unsigned long)v); + + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ + v->arch.user_regs.eflags = 2; + v->arch.hvm_vcpu.inject_trap.vector = -1; + + if ( (rc = hvm_vcpu_cacheattr_init(v)) != 0 ) + { + hvm_funcs.vcpu_destroy(v); + return rc; + } + + /* This for hvm_long_mode_enabled(v). */ + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; + + return 0; +} + int hvm_vcpu_initialise(struct vcpu *v) { int rc; @@ -1081,6 +1138,9 @@ int hvm_vcpu_initialise(struct vcpu *v) spin_lock_init(&v->arch.hvm_vcpu.tm_lock); INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); + if ( is_pvh_vcpu(v) ) + return pvh_vcpu_initialise(v); + if ( (rc = vlapic_init(v)) != 0 ) goto fail1; @@ -1169,7 +1229,10 @@ void hvm_vcpu_destroy(struct vcpu *v) tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet); hvm_vcpu_cacheattr_destroy(v); - vlapic_destroy(v); + + if ( !is_pvh_vcpu(v) ) + vlapic_destroy(v); + hvm_funcs.vcpu_destroy(v); /* Event channel is already freed by evtchn_destroy(). */ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 21/23] PVH xen: VMX support of PVH guest creation/destruction
This patch implements the vmx portion of the guest create, ie vcpu and domain initialization. Some changes to support the destroy path. Change in V10: - Don''t call vmx_domain_initialise / vmx_domain_destroy for PVH. - Do not set hvm_vcpu.guest_efer here in vmx.c. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/vmx.c | 28 ++++++++++++++++++++++++++++ 1 files changed, 28 insertions(+), 0 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 80109c1..f6ea39a 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1076,6 +1076,28 @@ static void vmx_update_host_cr3(struct vcpu *v) vmx_vmcs_exit(v); } +/* + * PVH guest never causes CR3 write vmexit. This is called during the guest + * setup. + */ +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) +{ + vmx_vmcs_enter(v); + switch ( cr ) + { + case 3: + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); + hvm_asid_flush_vcpu(v); + break; + + default: + printk(XENLOG_ERR + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", + v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP)); + } + vmx_vmcs_exit(v); +} + void vmx_update_debug_state(struct vcpu *v) { unsigned long mask; @@ -1095,6 +1117,12 @@ void vmx_update_debug_state(struct vcpu *v) static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr) { + if ( is_pvh_vcpu(v) ) + { + vmx_update_pvh_cr(v, cr); + return; + } + vmx_vmcs_enter(v); switch ( cr ) -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 22/23] PVH xen: preparatory patch for the pvh vmexit handler patch
This is a preparatory patch for the next pvh vmexit handler patch. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- xen/arch/x86/hvm/vmx/pvh.c | 5 +++++ xen/arch/x86/hvm/vmx/vmx.c | 6 ++++++ xen/arch/x86/traps.c | 4 ++-- xen/include/asm-x86/hvm/vmx/vmx.h | 1 + xen/include/asm-x86/processor.h | 2 ++ xen/include/asm-x86/traps.h | 2 ++ 6 files changed, 18 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/pvh.c b/xen/arch/x86/hvm/vmx/pvh.c index b37e423..8e61d23 100644 --- a/xen/arch/x86/hvm/vmx/pvh.c +++ b/xen/arch/x86/hvm/vmx/pvh.c @@ -20,6 +20,11 @@ #include <asm/hvm/nestedhvm.h> #include <asm/xstate.h> +/* Implemented in the next patch */ +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) +{ +} + /* * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called * from arch_set_info_guest() which sets the (PVH relevant) non-vmcs fields. diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index f6ea39a..bbfa130 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -2490,6 +2490,12 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) ) return vmx_failed_vmentry(exit_reason, regs); + if ( is_pvh_vcpu(v) ) + { + vmx_pvh_vmexit_handler(regs); + return; + } + if ( v->arch.hvm_vmx.vmx_realmode ) { /* Put RFLAGS back the way the guest wants it */ diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index fe8b94c..9c82b45 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -745,7 +745,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx, return 1; } -static void pv_cpuid(struct cpu_user_regs *regs) +void pv_cpuid(struct cpu_user_regs *regs) { uint32_t a, b, c, d; @@ -1904,7 +1904,7 @@ static int is_cpufreq_controller(struct domain *d) #include "x86_64/mmconfig.h" -static int emulate_privileged_op(struct cpu_user_regs *regs) +int emulate_privileged_op(struct cpu_user_regs *regs) { enum x86_segment which_sel; struct vcpu *v = current; diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index 9e6c481..44e4136 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -474,6 +474,7 @@ void vmx_dr_access(unsigned long exit_qualification, struct cpu_user_regs *regs); void vmx_fpu_enter(struct vcpu *v); int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp); +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs); int alloc_p2m_hap_data(struct p2m_domain *p2m); void free_p2m_hap_data(struct p2m_domain *p2m); diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h index 5cdacc7..22a9653 100644 --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -566,6 +566,8 @@ void microcode_set_module(unsigned int); int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); int microcode_resume_cpu(int cpu); +void pv_cpuid(struct cpu_user_regs *regs); + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_X86_PROCESSOR_H */ diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h index 1d9b087..8c3540a 100644 --- a/xen/include/asm-x86/traps.h +++ b/xen/include/asm-x86/traps.h @@ -50,4 +50,6 @@ extern int send_guest_trap(struct domain *d, uint16_t vcpuid, unsigned int trap_nr); int emulate_forced_invalid_op(struct cpu_user_regs *regs); +int emulate_privileged_op(struct cpu_user_regs *regs); + #endif /* ASM_TRAP_H */ -- 1.7.2.3
Mukesh Rathor
2013-Jul-24 01:59 UTC
[V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
This patch contains vmx exit handler for PVH guest. Note it contains a macro dbgp1 to print vmexit reasons and a lot of other data to go with it. It can be enabled by setting pvhdbg to 1. This can be very useful debugging for the first few months of testing, after which it can be removed at the maintainer''s discretion. Changes in V2: - Move non VMX generic code to arch/x86/hvm/pvh.c - Remove get_gpr_ptr() and use existing decode_register() instead. - Defer call to pvh vmx exit handler until interrupts are enabled. So the caller vmx_pvh_vmexit_handler() handles the NMI/EXT-INT/TRIPLE_FAULT now. - Fix the CPUID (wrongly) clearing bit 24. No need to do this now, set the correct feature bits in CR4 during vmcs creation. - Fix few hard tabs. Changes in V3: - Lot of cleanup and rework in PVH vm exit handler. - add parameter to emulate_forced_invalid_op(). Changes in V5: - Move pvh.c and emulate_forced_invalid_op related changes to another patch. - Formatting. - Remove vmx_pvh_read_descriptor(). - Use SS DPL instead of CS.RPL for CPL. - Remove pvh_user_cpuid() and call pv_cpuid for user mode also. Changes in V6: - Replace domain_crash_synchronous() with domain_crash(). Changes in V7: - Don''t read all selectors on every vmexit. Do that only for the IO instruction vmexit. - Add couple checks and set guest_cr[4] in access_cr4(). - Add period after all comments in case that''s an issue. - Move making pv_cpuid and emulate_privileged_op public here. Changes in V8: - Mainly, don''t read selectors on vmexit. The macros now come to VMCS to read selectors on demand. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/pvh.c | 451 +++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 450 insertions(+), 1 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/pvh.c b/xen/arch/x86/hvm/vmx/pvh.c index 8e61d23..ba11967 100644 --- a/xen/arch/x86/hvm/vmx/pvh.c +++ b/xen/arch/x86/hvm/vmx/pvh.c @@ -20,9 +20,458 @@ #include <asm/hvm/nestedhvm.h> #include <asm/xstate.h> -/* Implemented in the next patch */ +#ifndef NDEBUG +static int pvhdbg = 0; +#define dbgp1(...) do { (pvhdbg == 1) ? printk(__VA_ARGS__) : 0; } while ( 0 ) +#else +#define dbgp1(...) ((void)0) +#endif + +/* Returns : 0 == msr read successfully. */ +static int vmxit_msr_read(struct cpu_user_regs *regs) +{ + u64 msr_content = 0; + + switch ( regs->ecx ) + { + case MSR_IA32_MISC_ENABLE: + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; + break; + + default: + /* PVH fixme: see hvm_msr_read_intercept(). */ + rdmsrl(regs->ecx, msr_content); + break; + } + regs->eax = (uint32_t)msr_content; + regs->edx = (uint32_t)(msr_content >> 32); + vmx_update_guest_eip(); + + dbgp1("msr read c:%lx a:%lx d:%lx RIP:%lx RSP:%lx\n", regs->ecx, regs->eax, + regs->edx, regs->rip, regs->rsp); + + return 0; +} + +/* Returns : 0 == msr written successfully. */ +static int vmxit_msr_write(struct cpu_user_regs *regs) +{ + uint64_t msr_content = (uint32_t)regs->eax | ((uint64_t)regs->edx << 32); + + dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx, + regs->eax, regs->edx); + + if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY ) + { + vmx_update_guest_eip(); + return 0; + } + return 1; +} + +static int vmxit_debug(struct cpu_user_regs *regs) +{ + struct vcpu *vp = current; + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); + + write_debugreg(6, exit_qualification | 0xffff0ff0); + + /* gdbsx or another debugger. Never pause dom0. */ + if ( vp->domain->domain_id != 0 && vp->domain->debugger_attached ) + domain_pause_for_debugger(); + else + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE); + + return 0; +} + +/* Returns: rc == 0: handled the MTF vmexit. */ +static int vmxit_mtf(struct cpu_user_regs *regs) +{ + struct vcpu *vp = current; + int rc = -EINVAL, ss = vp->arch.hvm_vcpu.single_step; + + vp->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, vp->arch.hvm_vmx.exec_control); + vp->arch.hvm_vcpu.single_step = 0; + + if ( vp->domain->debugger_attached && ss ) + { + domain_pause_for_debugger(); + rc = 0; + } + return rc; +} + +static int vmxit_int3(struct cpu_user_regs *regs) +{ + int ilen = vmx_get_instruction_length(); + struct vcpu *vp = current; + struct hvm_trap trap_info = { + .vector = TRAP_int3, + .type = X86_EVENTTYPE_SW_EXCEPTION, + .error_code = HVM_DELIVER_NO_ERROR_CODE, + .insn_len = ilen + }; + + /* gdbsx or another debugger. Never pause dom0. */ + if ( vp->domain->domain_id != 0 && vp->domain->debugger_attached ) + { + regs->eip += ilen; + dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id()); + current->arch.gdbsx_vcpu_event = TRAP_int3; + domain_pause_for_debugger(); + return 0; + } + hvm_inject_trap(&trap_info); + + return 0; +} + +/* Just like HVM, PVH should be using "cpuid" from the kernel mode. */ +static int vmxit_invalid_op(struct cpu_user_regs *regs) +{ + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) ) + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); + + return 0; +} + +/* Returns: rc == 0: handled the exception. */ +static int vmxit_exception(struct cpu_user_regs *regs) +{ + int vector = (__vmread(VM_EXIT_INTR_INFO)) & INTR_INFO_VECTOR_MASK; + int rc = -ENOSYS; + + dbgp1(" EXCPT: vec:%d cs:%lx r.IP:%lx\n", vector, + __vmread(GUEST_CS_SELECTOR), regs->eip); + + switch ( vector ) + { + case TRAP_debug: + rc = vmxit_debug(regs); + break; + + case TRAP_int3: + rc = vmxit_int3(regs); + break; + + case TRAP_invalid_op: + rc = vmxit_invalid_op(regs); + break; + + case TRAP_no_device: + hvm_funcs.fpu_dirty_intercept(); + rc = 0; + break; + + default: + printk(XENLOG_G_WARNING + "PVH: Unhandled trap:%d. IP:%lx\n", vector, regs->eip); + } + return rc; +} + +static int vmxit_vmcall(struct cpu_user_regs *regs) +{ + if ( hvm_do_hypercall(regs) != HVM_HCALL_preempted ) + vmx_update_guest_eip(); + return 0; +} + +/* Returns: rc == 0: success. */ +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) +{ + struct vcpu *vp = current; + + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) + { + unsigned long new_cr0 = *regp; + unsigned long old_cr0 = __vmread(GUEST_CR0); + + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp); + if ( (u32)new_cr0 != new_cr0 ) + { + printk(XENLOG_G_WARNING + "Guest setting upper 32 bits in CR0: %lx", new_cr0); + return -EPERM; + } + + new_cr0 &= ~HVM_CR0_GUEST_RESERVED_BITS; + /* ET is reserved and should always be 1. */ + new_cr0 |= X86_CR0_ET; + + /* A pvh is not expected to change to real mode. */ + if ( (new_cr0 & (X86_CR0_PE | X86_CR0_PG)) !+ (X86_CR0_PG | X86_CR0_PE) ) + { + printk(XENLOG_G_WARNING + "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0); + return -EPERM; + } + /* TS going from 1 to 0 */ + if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) == 0) ) + vmx_fpu_enter(vp); + + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = new_cr0; + __vmwrite(GUEST_CR0, new_cr0); + __vmwrite(CR0_READ_SHADOW, new_cr0); + } + else + *regp = __vmread(GUEST_CR0); + + return 0; +} + +/* Returns: rc == 0: success. */ +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) +{ + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) + { + struct vcpu *vp = current; + u64 old_val = __vmread(GUEST_CR4); + u64 new = *regp; + + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) + { + printk(XENLOG_G_WARNING + "PVH guest attempts to set reserved bit in CR4: %lx", new); + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; + } + + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) + { + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " + "EFER.LMA is set"); + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; + } + + vp->arch.hvm_vcpu.guest_cr[4] = new; + + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) ) + vpid_sync_all(); + + __vmwrite(CR4_READ_SHADOW, new); + + new &= ~X86_CR4_PAE; /* PVH always runs with hap enabled. */ + new |= X86_CR4_VMXE | X86_CR4_MCE; + __vmwrite(GUEST_CR4, new); + } + else + *regp = __vmread(CR4_READ_SHADOW); + + return 0; +} + +/* Returns: rc == 0: success, else -errno. */ +static int vmxit_cr_access(struct cpu_user_regs *regs) +{ + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); + uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification); + int cr, rc = -EINVAL; + + switch ( acc_typ ) + { + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR: + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR: + { + uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification); + uint64_t *regp = decode_register(gpr, regs, 0); + cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification); + + if ( regp == NULL ) + break; + + switch ( cr ) + { + case 0: + rc = access_cr0(regs, acc_typ, regp); + break; + + case 3: + printk(XENLOG_G_ERR "PVH: unexpected cr3 vmexit. rip:%lx\n", + regs->rip); + domain_crash(current->domain); + break; + + case 4: + rc = access_cr4(regs, acc_typ, regp); + break; + } + if ( rc == 0 ) + vmx_update_guest_eip(); + break; + } + + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: + { + struct vcpu *vp = current; + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & ~X86_CR0_TS; + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = cr0; + + vmx_fpu_enter(vp); + __vmwrite(GUEST_CR0, cr0); + __vmwrite(CR0_READ_SHADOW, cr0); + vmx_update_guest_eip(); + rc = 0; + } + } + return rc; +} + +/* + * Note: A PVH guest sets IOPL natively by setting bits in the eflags, and not + * via hypercalls used by a PV. + */ +static int vmxit_io_instr(struct cpu_user_regs *regs) +{ + struct segment_register seg; + int requested = (regs->rflags & X86_EFLAGS_IOPL) >> 12; + int curr_lvl = (regs->rflags & X86_EFLAGS_VM) ? 3 : 0; + + if ( curr_lvl == 0 ) + { + hvm_get_segment_register(current, x86_seg_ss, &seg); + curr_lvl = seg.attr.fields.dpl; + } + if ( requested >= curr_lvl && emulate_privileged_op(regs) ) + return 0; + + hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code); + return 0; +} + +static int pvh_ept_handle_violation(unsigned long qualification, + paddr_t gpa, struct cpu_user_regs *regs) +{ + unsigned long gla, gfn = gpa >> PAGE_SHIFT; + p2m_type_t p2mt; + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt); + + printk(XENLOG_G_ERR "EPT violation %#lx (%c%c%c/%c%c%c), " + "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx RSP:0x%lx\n", + qualification, + (qualification & EPT_READ_VIOLATION) ? ''r'' : ''-'', + (qualification & EPT_WRITE_VIOLATION) ? ''w'' : ''-'', + (qualification & EPT_EXEC_VIOLATION) ? ''x'' : ''-'', + (qualification & EPT_EFFECTIVE_READ) ? ''r'' : ''-'', + (qualification & EPT_EFFECTIVE_WRITE) ? ''w'' : ''-'', + (qualification & EPT_EFFECTIVE_EXEC) ? ''x'' : ''-'', + gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp); + + ept_walk_table(current->domain, gfn); + + if ( qualification & EPT_GLA_VALID ) + { + gla = __vmread(GUEST_LINEAR_ADDRESS); + printk(XENLOG_G_ERR " --- GLA %#lx\n", gla); + } + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; +} + +/* + * Main vm exit handler for PVH . Called from vmx_vmexit_handler(). + * Note: vmx_asm_vmexit_handler updates rip/rsp/eflags in regs{} struct. + */ void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) { + unsigned long exit_qualification; + unsigned int exit_reason = __vmread(VM_EXIT_REASON); + int rc=0, ccpu = smp_processor_id(); + struct vcpu *v = current; + + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx EFLAGS:%lx CR0:%lx\n", + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, + __vmread(GUEST_CR0)); + + switch ( (uint16_t)exit_reason ) + { + /* NMI and machine_check are handled by the caller, we handle rest here */ + case EXIT_REASON_EXCEPTION_NMI: /* 0 */ + rc = vmxit_exception(regs); + break; + + case EXIT_REASON_EXTERNAL_INTERRUPT: /* 1 */ + break; /* handled in vmx_vmexit_handler() */ + + case EXIT_REASON_PENDING_VIRT_INTR: /* 7 */ + /* Disable the interrupt window. */ + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING; + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); + break; + + case EXIT_REASON_CPUID: /* 10 */ + pv_cpuid(regs); + vmx_update_guest_eip(); + break; + + case EXIT_REASON_HLT: /* 12 */ + vmx_update_guest_eip(); + hvm_hlt(regs->eflags); + break; + + case EXIT_REASON_VMCALL: /* 18 */ + rc = vmxit_vmcall(regs); + break; + + case EXIT_REASON_CR_ACCESS: /* 28 */ + rc = vmxit_cr_access(regs); + break; + + case EXIT_REASON_DR_ACCESS: /* 29 */ + exit_qualification = __vmread(EXIT_QUALIFICATION); + vmx_dr_access(exit_qualification, regs); + break; + + case EXIT_REASON_IO_INSTRUCTION: /* 30 */ + vmxit_io_instr(regs); + break; + + case EXIT_REASON_MSR_READ: /* 31 */ + rc = vmxit_msr_read(regs); + break; + + case EXIT_REASON_MSR_WRITE: /* 32 */ + rc = vmxit_msr_write(regs); + break; + + case EXIT_REASON_MONITOR_TRAP_FLAG: /* 37 */ + rc = vmxit_mtf(regs); + break; + + case EXIT_REASON_MCE_DURING_VMENTRY: /* 41 */ + break; /* handled in vmx_vmexit_handler() */ + + case EXIT_REASON_EPT_VIOLATION: /* 48 */ + { + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); + exit_qualification = __vmread(EXIT_QUALIFICATION); + rc = pvh_ept_handle_violation(exit_qualification, gpa, regs); + break; + } + + default: + rc = 1; + printk(XENLOG_G_ERR + "PVH: Unexpected exit reason:%#x\n", exit_reason); + } + + if ( rc ) + { + exit_qualification = __vmread(EXIT_QUALIFICATION); + printk(XENLOG_G_WARNING + "PVH: [%d] exit_reas:%d %#x qual:%ld 0x%lx cr0:0x%016lx\n", + ccpu, exit_reason, exit_reason, exit_qualification, + exit_qualification, __vmread(GUEST_CR0)); + printk(XENLOG_G_WARNING "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n", + regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3)); + domain_crash(v->domain); + } } /* -- 1.7.2.3
Keir Fraser
2013-Jul-24 06:21 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On 24/07/2013 02:59, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:> Hi Keir, > > These V10 patches are in pretty good shape. I''ve addressed all the > issues Jan had in previous versions, and jfyi, he and I''ve been back > and forth on pretty much every patch in this series. Lot of the patches > have ''acked'' or ''reviewed'' tags. Kindly review.These need to get in the tree now, or they''re going to miss yet another cycle. Hasn''t it been two/three years? Acked-by: Keir Fraser <keir@xen.org>> Christoph: > I''ve made the minor changes you suggested in V9, please review > patches 20 and 21. > > New in V10: minor changes in 20/21 to not call vmx create and destroy > functions, as they are noop for pvh. Also, in patch 16 > add check to not migrage hvm timers for PVH. > > To repeat from before, these are xen changes to support > boot of a 64bit PVH domU guest. Built on top of unstable git c/s: > 704302ce9404c73cfb687d31adcf67094ab5bb53 > > The public git tree for this: > git clone -n git://oss.oracle.com/git/mrathor/xen.git . > git checkout pvh.v10 > > Coming in future after this is done, two patchsets: > - 1) tools changes and 2) dom0 changes. > > Thanks for all the help, > Mukesh >
Andrew Cooper
2013-Jul-24 11:29 UTC
Re: [V10 PATCH 04/23] PVH xen: Move e820 fields out of pv_domain struct
On 24/07/13 02:59, Mukesh Rathor wrote:> This patch moves fields out of the pv_domain struct as they are used by > PVH also. > > Changes in V6: > - Don''t base on guest type the initialization and cleanup. > > Changes in V7: > - If statement doesn''t need to be split across lines anymore. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/domain.c | 10 ++++------ > xen/arch/x86/mm.c | 26 ++++++++++++-------------- > xen/include/asm-x86/domain.h | 10 +++++----- > 3 files changed, 21 insertions(+), 25 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 5de5e49..c361abf 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > if ( (rc = iommu_domain_init(d)) != 0 ) > goto fail; > } > + spin_lock_init(&d->arch.e820_lock); > > if ( is_hvm_domain(d) ) > { > @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > } > } > else > - { > /* 64-bit PV guest by default. */ > d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; > > - spin_lock_init(&d->arch.pv_domain.e820_lock); > - } > - > /* initialize default tsc behavior in case tools don''t */ > tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); > spin_lock_init(&d->arch.vtsc_lock); > @@ -592,8 +589,9 @@ void arch_domain_destroy(struct domain *d) > { > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > - else > - xfree(d->arch.pv_domain.e820); > + > + if ( d->arch.e820 ) > + xfree(d->arch.e820);xfree() works correctly wrt null pointers; remove the if(). You appear to have used in the next hunk. ~Andrew> > free_domain_pirqs(d); > if ( !is_idle_domain(d) ) > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index c00841c..412971e 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -4763,11 +4763,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) > return -EFAULT; > } > > - spin_lock(&d->arch.pv_domain.e820_lock); > - xfree(d->arch.pv_domain.e820); > - d->arch.pv_domain.e820 = e820; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > + xfree(d->arch.e820); > + d->arch.e820 = e820; > + d->arch.nr_e820 = fmap.map.nr_entries; > + spin_unlock(&d->arch.e820_lock); > > rcu_unlock_domain(d); > return rc; > @@ -4781,26 +4781,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > - spin_lock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > > /* Backwards compatibility. */ > - if ( (d->arch.pv_domain.nr_e820 == 0) || > - (d->arch.pv_domain.e820 == NULL) ) > + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -ENOSYS; > } > > - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); > - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, > - map.nr_entries) || > + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); > + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || > __copy_to_guest(arg, &map, 1) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -EFAULT; > } > > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return 0; > } > > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index d79464d..c3f9f8e 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -234,11 +234,6 @@ struct pv_domain > > /* map_domain_page() mapping cache. */ > struct mapcache_domain mapcache; > - > - /* Pseudophysical e820 map (XENMEM_memory_map). */ > - spinlock_t e820_lock; > - struct e820entry *e820; > - unsigned int nr_e820; > }; > > struct arch_domain > @@ -313,6 +308,11 @@ struct arch_domain > (possibly other cases in the future */ > uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ > uint64_t vtsc_usercount; /* not used for hvm */ > + > + /* Pseudophysical e820 map (XENMEM_memory_map). */ > + spinlock_t e820_lock; > + struct e820entry *e820; > + unsigned int nr_e820; > } __cacheline_aligned; > > #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list))
Andrew Cooper
2013-Jul-24 12:24 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On 24/07/13 07:21, Keir Fraser wrote:> On 24/07/2013 02:59, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote: > >> Hi Keir, >> >> These V10 patches are in pretty good shape. I''ve addressed all the >> issues Jan had in previous versions, and jfyi, he and I''ve been back >> and forth on pretty much every patch in this series. Lot of the patches >> have ''acked'' or ''reviewed'' tags. Kindly review. > These need to get in the tree now, or they''re going to miss yet another > cycle. Hasn''t it been two/three years? > > Acked-by: Keir Fraser <keir@xen.org>Other than my minor nit in patch 4, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> If possible, I will see about putting these patches in and running a standard set of regression tests. Given the extent of changes to both regular HVM and PV guests, it would be nice to know if there are any obvious problems caused by the introduction of PVH. ~Andrew> >> Christoph: >> I''ve made the minor changes you suggested in V9, please review >> patches 20 and 21. >> >> New in V10: minor changes in 20/21 to not call vmx create and destroy >> functions, as they are noop for pvh. Also, in patch 16 >> add check to not migrage hvm timers for PVH. >> >> To repeat from before, these are xen changes to support >> boot of a 64bit PVH domU guest. Built on top of unstable git c/s: >> 704302ce9404c73cfb687d31adcf67094ab5bb53 >> >> The public git tree for this: >> git clone -n git://oss.oracle.com/git/mrathor/xen.git . >> git checkout pvh.v10 >> >> Coming in future after this is done, two patchsets: >> - 1) tools changes and 2) dom0 changes. >> >> Thanks for all the help, >> Mukesh >> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Jul-24 15:04 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On Wed, Jul 24, 2013 at 01:24:32PM +0100, Andrew Cooper wrote:> On 24/07/13 07:21, Keir Fraser wrote: > > On 24/07/2013 02:59, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote: > > > >> Hi Keir, > >> > >> These V10 patches are in pretty good shape. I''ve addressed all the > >> issues Jan had in previous versions, and jfyi, he and I''ve been back > >> and forth on pretty much every patch in this series. Lot of the patches > >> have ''acked'' or ''reviewed'' tags. Kindly review. > > These need to get in the tree now, or they''re going to miss yet another > > cycle. Hasn''t it been two/three years? > > > > Acked-by: Keir Fraser <keir@xen.org> > > Other than my minor nit in patch 4, > > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> > > If possible, I will see about putting these patches in and running a > standard set of regression tests. Given the extent of changes to bothYeeey!> regular HVM and PV guests, it would be nice to know if there are any > obvious problems caused by the introduction of PVH.Absolutly. That is why I am glad that Mukesh split them up to make these the ''prep'' patches so if there are regressions we can easily identify them now. Thanks!
Keir Fraser
2013-Jul-24 20:25 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On 24/07/2013 13:24, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:>>> These V10 patches are in pretty good shape. I''ve addressed all the >>> issues Jan had in previous versions, and jfyi, he and I''ve been back >>> and forth on pretty much every patch in this series. Lot of the patches >>> have ''acked'' or ''reviewed'' tags. Kindly review. >> These need to get in the tree now, or they''re going to miss yet another >> cycle. Hasn''t it been two/three years? >> >> Acked-by: Keir Fraser <keir@xen.org> > > Other than my minor nit in patch 4, > > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> > > If possible, I will see about putting these patches in and running a > standard set of regression tests. Given the extent of changes to both > regular HVM and PV guests, it would be nice to know if there are any > obvious problems caused by the introduction of PVH.That would be most excellent, thank you! -- Keir
Mukesh Rathor
2013-Jul-25 01:07 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On Wed, 24 Jul 2013 13:24:32 +0100 Andrew Cooper <andrew.cooper3@citrix.com> wrote:> On 24/07/13 07:21, Keir Fraser wrote: > > On 24/07/2013 02:59, "Mukesh Rathor" <mukesh.rathor@oracle.com> > > wrote: > > > >> Hi Keir, > >> > >> These V10 patches are in pretty good shape. I''ve addressed all the > >> issues Jan had in previous versions, and jfyi, he and I''ve been > >> back and forth on pretty much every patch in this series. Lot of > >> the patches have ''acked'' or ''reviewed'' tags. Kindly review. > > These need to get in the tree now, or they''re going to miss yet > > another cycle. Hasn''t it been two/three years? > > > > Acked-by: Keir Fraser <keir@xen.org>Thanks Keir! About time it got in :).> > Other than my minor nit in patch 4, > > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> > > If possible, I will see about putting these patches in and running a > standard set of regression tests. Given the extent of changes to both > regular HVM and PV guests, it would be nice to know if there are any > obvious problems caused by the introduction of PVH. > > ~AndrewThanks Andrew for running the regressions, please lmk. I''ve pushed a new branch pvh.v10.acked, with the nit fixed in patch4 and the acks, at: git clone -n git://oss.oracle.com/git/mrathor/xen.git . git checkout pvh.v10.acked Look forward to seeing it in xen unstable tree. thanks again, Mukesh> > > >> Christoph: > >> I''ve made the minor changes you suggested in V9, please review > >> patches 20 and 21. > >> > >> New in V10: minor changes in 20/21 to not call vmx create and > >> destroy functions, as they are noop for pvh. Also, in patch 16 > >> add check to not migrage hvm timers for PVH. > >> > >> To repeat from before, these are xen changes to support > >> boot of a 64bit PVH domU guest. Built on top of unstable git c/s: > >> 704302ce9404c73cfb687d31adcf67094ab5bb53 > >> > >> The public git tree for this: > >> git clone -n git://oss.oracle.com/git/mrathor/xen.git . > >> git checkout pvh.v10 > >> > >> Coming in future after this is done, two patchsets: > >> - 1) tools changes and 2) dom0 changes. > >> > >> Thanks for all the help, > >> Mukesh > >> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel
Tim Deegan
2013-Jul-25 13:47 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote:> +/* > + * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called > + * from arch_set_info_guest() which sets the (PVH relevant) non-vmcs fields. > + * > + * In case of linux: > + * The boot vcpu calls this to set some context for the non boot smp vcpu. > + * The call comes from cpu_initialize_context(). (boot vcpu 0 context is > + * set by the tools via do_domctl -> vcpu_initialise). > + * > + * NOTE: In case of VMCS, loading a selector doesn''t cause the hidden fields > + * to be automatically loaded. We load selectors here but not the hidden > + * parts, except for GS_BASE and FS_BASE. This means we require the > + * guest to have same hidden values as the default values loaded in the > + * vmcs in pvh_construct_vmcs(), ie, the GDT the vcpu is coming up on > + * should be something like following, > + * (from 64bit linux, CS:0x10 DS/SS:0x18) : > + * > + * ffff88007f704000: 0000000000000000 00cf9b000000ffff > + * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff > + * ffff88007f704020: 00cffb000000ffff 00cff3000000ffffWhat a bizarre interface! Why does this operation not load the hidden parts from the GDT/LDT that the caller supplied? If this _is_ to be the interface: - this comment should be somewhere in the interface headers (and docs) so that OS authors can find it; and - it should specify the _exact_ constraints that the guest must follow in constructing its tables. Tim.
Tim Deegan
2013-Jul-25 14:01 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
At 18:59 -0700 on 23 Jul (1374605958), Mukesh Rathor wrote:> @@ -861,7 +865,7 @@ int arch_set_info_guest( > > if ( !cr3_page ) > rc = -EINVAL; > - else if ( paging_mode_refcounts(d) ) > + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) )Isn''t paging_mode_refcounts() true for all PVH vcpus/domains? (This code is OK - I just want to check that I''ve understood what''s going on.)> /* nothing */; > else if ( cr3_page == v->arch.old_guest_table ) > { > @@ -893,8 +897,15 @@ int arch_set_info_guest( > /* handled below */; > else if ( !compat ) > { > + /* PVH 32bitfixme. */ > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.cr3 = page_to_mfn(cr3_page);Not page_to_maddr()? I guess that with paging_mode_translate(), arch.cr3 doesn''t actually get used.> + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; > + } > + > v->arch.guest_table = pagetable_from_page(cr3_page); > - if ( c.nat->ctrlreg[1] ) > + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) )If you''re respinning this patch, maybe use is_pv_vcpu() here? Tim.
Tim Deegan
2013-Jul-25 16:28 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote:> +/* Just like HVM, PVH should be using "cpuid" from the kernel mode. */ > +static int vmxit_invalid_op(struct cpu_user_regs *regs) > +{ > + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) ) > + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);Was this discussed before? It seems harsh to stop kernel-mode code from using the pv cpuid operation if it wants to. In particular, what about loadable kernel modules? If you do go with this restriction, please document it in include/public/arch-x86/xen.h beside the XEN_CPUID definition.> +/* Returns: rc == 0: success. */ > +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) > +{ > + struct vcpu *vp = current; > + > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + unsigned long new_cr0 = *regp; > + unsigned long old_cr0 = __vmread(GUEST_CR0); > + > + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp); > + if ( (u32)new_cr0 != new_cr0 ) > + { > + printk(XENLOG_G_WARNING > + "Guest setting upper 32 bits in CR0: %lx", new_cr0); > + return -EPERM;AFAICS returning non-zero here crashes the guest. Shouldn''t this inject #GP instead?> + } > + > + new_cr0 &= ~HVM_CR0_GUEST_RESERVED_BITS; > + /* ET is reserved and should always be 1. */ > + new_cr0 |= X86_CR0_ET; > + > + /* A pvh is not expected to change to real mode. */ > + if ( (new_cr0 & (X86_CR0_PE | X86_CR0_PG)) !> + (X86_CR0_PG | X86_CR0_PE) ) > + { > + printk(XENLOG_G_WARNING > + "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0); > + return -EPERM;This I guess is more reasonable since the paging-mode restriction is part of the PVH ABI.> + } > + /* TS going from 1 to 0 */ > + if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) == 0) ) > + vmx_fpu_enter(vp); > + > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = new_cr0; > + __vmwrite(GUEST_CR0, new_cr0); > + __vmwrite(CR0_READ_SHADOW, new_cr0); > + } > + else > + *regp = __vmread(GUEST_CR0); > + > + return 0; > +} > + > +/* Returns: rc == 0: success. */ > +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) > +{ > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + struct vcpu *vp = current; > + u64 old_val = __vmread(GUEST_CR4); > + u64 new = *regp; > + > + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) > + { > + printk(XENLOG_G_WARNING > + "PVH guest attempts to set reserved bit in CR4: %lx", new); > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > + } > + > + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) > + { > + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " > + "EFER.LMA is set"); > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > + } > + > + vp->arch.hvm_vcpu.guest_cr[4] = new; > + > + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) ) > + vpid_sync_all(); > + > + __vmwrite(CR4_READ_SHADOW, new); > + > + new &= ~X86_CR4_PAE; /* PVH always runs with hap enabled. */The equivalent mask in vmx_update_guest_cr() is masking out a default setting of CR4.PAE _before_ the guest''s requested bits get ORred in. This is masking out the PAE bit that we just insisted on. I''m surprised that VMENTER doesn''t choke on this -- I guess it uses VM_ENTRY_IA32E_MODE rather than looking at these bits at all.> + new |= X86_CR4_VMXE | X86_CR4_MCE; > + __vmwrite(GUEST_CR4, new); > + } > + else > + *regp = __vmread(CR4_READ_SHADOW); > + > + return 0; > +} > + > +/* Returns: rc == 0: success, else -errno. */ > +static int vmxit_cr_access(struct cpu_user_regs *regs) > +{ > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > + uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification); > + int cr, rc = -EINVAL; > + > + switch ( acc_typ ) > + { > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR: > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR: > + { > + uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification); > + uint64_t *regp = decode_register(gpr, regs, 0); > + cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification); > + > + if ( regp == NULL ) > + break; > + > + switch ( cr ) > + { > + case 0: > + rc = access_cr0(regs, acc_typ, regp); > + break; > + > + case 3: > + printk(XENLOG_G_ERR "PVH: unexpected cr3 vmexit. rip:%lx\n", > + regs->rip); > + domain_crash(current->domain); > + break; > + > + case 4: > + rc = access_cr4(regs, acc_typ, regp); > + break; > + } > + if ( rc == 0 ) > + vmx_update_guest_eip(); > + break; > + } > + > + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: > + { > + struct vcpu *vp = current; > + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & ~X86_CR0_TS; > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = cr0; > + > + vmx_fpu_enter(vp); > + __vmwrite(GUEST_CR0, cr0); > + __vmwrite(CR0_READ_SHADOW, cr0); > + vmx_update_guest_eip(); > + rc = 0; > + } > + }No "case VMX_CONTROL_REG_ACCESS_TYPE_LMSW"? Tim.
Tim Deegan
2013-Jul-25 16:39 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
Hi, At 18:59 -0700 on 23 Jul (1374605948), Mukesh Rathor wrote:> These V10 patches are in pretty good shape. I''ve addressed all the > issues Jan had in previous versions, and jfyi, he and I''ve been back > and forth on pretty much every patch in this series. Lot of the patches > have ''acked'' or ''reviewed'' tags. Kindly review.If these aren''t already committed, you can add my Reviewed-by to all except 9, 10, 19 and 23. For #10 if you just s/page_to_mfn/page_to_maddr/ that''s good enough for me, and it can have my Reviewed-by as well. #9 and #23 I''ve commented on separately. #19 is probably OK for correctness, though I haven''t reviewed the VMCS settings in enough detail to be sure they''re complete. My main reservation is that it seems to duplicate a bunch of code from the HVM VMCS setup. One question that''s not from any particular patch: is there a check anywhere to stop the tools creating a PVH domain on an AMD machine? I didn''t see one but may just have missed it. Cheers, Tim.
Mukesh Rathor
2013-Jul-26 00:58 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Thu, 25 Jul 2013 14:47:55 +0100 Tim Deegan <tim@xen.org> wrote:> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote:....> > + * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise > > hcall. Called > > require the > > + * guest to have same hidden values as the default values > > loaded in the > > + * vmcs in pvh_construct_vmcs(), ie, the GDT the vcpu is > > coming up on > > + * should be something like following, > > + * (from 64bit linux, CS:0x10 DS/SS:0x18) : > > + * > > + * ffff88007f704000: 0000000000000000 00cf9b000000ffff > > + * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff > > + * ffff88007f704020: 00cffb000000ffff 00cff3000000ffff > > What a bizarre interface! Why does this operation not load the hidden > parts from the GDT/LDT that the caller supplied?Pl see: http://lists.xen.org/archives/html/xen-devel/2013-07/msg01489.html That was an option discussed with Jan, walking and reading the GDT entries from the gdtaddr the guest provided to load the hidden parts. But, I agree with him, that for the initial cpu boot we can restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, accessed" default attributes for the hidden part (64bit guest).> If this _is_ to be the interface: > - this comment should be somewhere in the interface headers (and > docs) so that OS authors can find it; and > - it should specify the _exact_ constraints that the guest must > follow in constructing its tables.I can add a comment to VCPUOP_initialise: diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h index e888daf..03baae2 100644 --- a/xen/include/public/vcpu.h +++ b/xen/include/public/vcpu.h @@ -43,6 +43,8 @@ * * @extra_arg == pointer to vcpu_guest_context structure containing initial * state for the VCPU. + * + * PVH: for constraints, please see prolog of vmx_pvh_set_vcpu_info(). */ #define VCPUOP_initialise 0 Or I can explicitly add the details here: * @extra_arg == pointer to vcpu_guest_context structure containing initial * state for the VCPU. + * + * PVH constraints: The vcpu GDT must adhere to following default values. + * For 64 bit guest: + * CS: base:0 limit:fffff Type:0xb DPL:0 P:1 AVL:0 L:1 D:0 G:1 + * ie, GDT entry: 00af9b000000ffff + * Others: base: 0 limit:fffff Type:3 DPL:0 P:1 AVL:0 L:0 B:1 G:1 + * i,e GDT entry: 00cf93000000ffff + * (except GS_BASE and FS_BASE that are guest provided). */ #define VCPUOP_initialise 0 Please lmk which is better. I can also redo the function prolog to explicitly provide entry details: /* * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called * from arch_set_info_guest() which sets the (PVH relevant) non-vmcs fields. * * (NOTE: In case of VMCS, loading a selector doesn''t cause the hidden fields * to be automatically loaded.) * * Guest Constraints: * We load selectors here but not the hidden parts, except for * GS_BASE and FS_BASE. This means we require the guest to have same * hidden values as the default values loaded in the vmcs (in * pvh_construct_vmcs()). * For 64 bit guest: * CS: base:0 limit:fffff Type:0xb DPL:0 P:1 AVL:0 L:1 D:0 G:1 * ie, GDT entry: 00af9b000000ffff * Others: base: 0 limit:fffff Type:3 DPL:0 P:1 AVL:0 L:0 B:1 G:1 * i,e GDT entry: 00cf93000000ffff * * In case of linux: * The boot vcpu calls this to set context for the non boot smp vcpu. The * call comes from cpu_initialize_context(). (boot vcpu 0 context is * set by the tools via do_domctl -> vcpu_initialise). * For 64bit, GDT for bringup vcpu looks like (CS:0x10 DS/SS:0x18) : * * ffff88007f704000: 0000000000000000 00cf9b000000ffff * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff * ffff88007f704020: 00cffb000000ffff 00cff3000000ffff * */ int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) Hope that helps. thanks mukesh
Mukesh Rathor
2013-Jul-26 01:02 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Thu, 25 Jul 2013 15:01:57 +0100 Tim Deegan <tim@xen.org> wrote:> At 18:59 -0700 on 23 Jul (1374605958), Mukesh Rathor wrote: > > @@ -861,7 +865,7 @@ int arch_set_info_guest( > > > > if ( !cr3_page ) > > rc = -EINVAL; > > - else if ( paging_mode_refcounts(d) ) > > + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) > > Isn''t paging_mode_refcounts() true for all PVH vcpus/domains? (This > code is OK - I just want to check that I''ve understood what''s going > on.)Yes it is.> > /* nothing */; > > else if ( cr3_page == v->arch.old_guest_table ) > > { > > @@ -893,8 +897,15 @@ int arch_set_info_guest( > > /* handled below */; > > else if ( !compat ) > > { > > + /* PVH 32bitfixme. */ > > + if ( is_pvh_vcpu(v) ) > > + { > > + v->arch.cr3 = page_to_mfn(cr3_page); > > Not page_to_maddr()? I guess that with paging_mode_translate(), > arch.cr3 doesn''t actually get used.Oops, yes, it should be page_to_maddr, and you just saved future debugging.> > + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; > > + } > > + > > v->arch.guest_table = pagetable_from_page(cr3_page); > > - if ( c.nat->ctrlreg[1] ) > > + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) > > If you''re respinning this patch, maybe use is_pv_vcpu() here?Ok. thanks Mukesh
Mukesh Rathor
2013-Jul-26 02:30 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Thu, 25 Jul 2013 17:28:40 +0100 Tim Deegan <tim@xen.org> wrote:> At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: > > +/* Just like HVM, PVH should be using "cpuid" from the kernel > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs *regs) > > +{ > > + if ( guest_kernel_mode(current, regs) > > || !emulate_forced_invalid_op(regs) ) > > + hvm_inject_hw_exception(TRAP_invalid_op, > > HVM_DELIVER_NO_ERROR_CODE); > > Was this discussed before? It seems harsh to stop kernel-mode code > from using the pv cpuid operation if it wants to. In particular, > what about loadable kernel modules?Yes, few times on the xen mailing list. The only PVH guest, linux as of now, the pv ops got rewired to use native cpuid, which is how hvm does it. So, couldn''t come up with any real reason to support it. The kernel modules in pv ops will go thru native_cpuid too, which will do hvm cpuid too.> If you do go with this restriction, please document it in > include/public/arch-x86/xen.h beside the XEN_CPUID definition.Ok, I''ll add it there.> > +/* Returns: rc == 0: success. */ > > +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, > > uint64_t *regp) +{ > > + struct vcpu *vp = current; > > + > > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > > + { > > + unsigned long new_cr0 = *regp; > > + unsigned long old_cr0 = __vmread(GUEST_CR0); > > + > > + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", > > regs->rip, *regp); > > + if ( (u32)new_cr0 != new_cr0 ) > > + { > > + printk(XENLOG_G_WARNING > > + "Guest setting upper 32 bits in CR0: %lx", > > new_cr0); > > + return -EPERM; > > AFAICS returning non-zero here crashes the guest. Shouldn''t this > inject #GP instead?Right, GPF it is. .....> > + > > + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) > > + { > > + printk(XENLOG_G_WARNING > > + "PVH guest attempts to set reserved bit in CR4: > > %lx", new); > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > + return 0; > > + } > > + > > + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) > > + { > > + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " > > + "EFER.LMA is set"); > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > + return 0; > > + } > > + > > + vp->arch.hvm_vcpu.guest_cr[4] = new; > > + > > + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | > > X86_CR4_PAE) ) > > + vpid_sync_all(); > > + > > + __vmwrite(CR4_READ_SHADOW, new); > > + > > + new &= ~X86_CR4_PAE; /* PVH always runs with hap > > enabled. */ > > The equivalent mask in vmx_update_guest_cr() is masking out a default > setting of CR4.PAE _before_ the guest''s requested bits get ORred in. > This is masking out the PAE bit that we just insisted on. I''m > surprised that VMENTER doesn''t choke on this -- I guess it uses > VM_ENTRY_IA32E_MODE rather than looking at these bits at all.Ah, I see. what a mess! Hmm... I can''t find much in the VMX sections of the SDM on this. Not sure what I should do here. How about just not clearing the X86_CR4_PAE, so the GUEST_CR4 will have it if it''s set in new, ie, the guest wants it set? ie, : ... __vmwrite(CR4_READ_SHADOW, new); new |= X86_CR4_VMXE | X86_CR4_MCE; __vmwrite(GUEST_CR4, new); ...> > + new |= X86_CR4_VMXE | X86_CR4_MCE; > > + __vmwrite(GUEST_CR4, new); > > + } > > + else > > + *regp = __vmread(CR4_READ_SHADOW); > > + > > + return 0; > > +} > > + > > +/* Returns: rc == 0: success, else -errno. */ > > +static int vmxit_cr_access(struct cpu_user_regs *regs) > > +{ > > + unsigned long exit_qualification > > __vmread(EXIT_QUALIFICATION); > > + uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification); > > + int cr, rc = -EINVAL; > > + > > + switch ( acc_typ ) > > + { > > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR: > > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR: > > + { > > + uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification); > > + uint64_t *regp = decode_register(gpr, regs, 0); > > + cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification); > > + > > + if ( regp == NULL ) > > + break; > > + > > + switch ( cr ) > > + { > > + case 0: > > + rc = access_cr0(regs, acc_typ, regp); > > + break; > > + > > + case 3: > > + printk(XENLOG_G_ERR "PVH: unexpected cr3 vmexit. > > rip:%lx\n", > > + regs->rip); > > + domain_crash(current->domain); > > + break; > > + > > + case 4: > > + rc = access_cr4(regs, acc_typ, regp); > > + break; > > + } > > + if ( rc == 0 ) > > + vmx_update_guest_eip(); > > + break; > > + } > > + > > + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: > > + { > > + struct vcpu *vp = current; > > + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & > > ~X86_CR0_TS; > > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] > > = cr0; + > > + vmx_fpu_enter(vp); > > + __vmwrite(GUEST_CR0, cr0); > > + __vmwrite(CR0_READ_SHADOW, cr0); > > + vmx_update_guest_eip(); > > + rc = 0; > > + } > > + } > > No "case VMX_CONTROL_REG_ACCESS_TYPE_LMSW"?Well, PVH is such a new concept, do you really think we need it? Lets not hold the series just for this, we can always add in future :). Thanks, Mukesh
Tim Deegan
2013-Jul-26 10:29 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 17:58 -0700 on 25 Jul (1374775120), Mukesh Rathor wrote:> Pl see: http://lists.xen.org/archives/html/xen-devel/2013-07/msg01489.html > > That was an option discussed with Jan, walking and reading the GDT > entries from the gdtaddr the guest provided to load the hidden > parts. But, I agree with him, that for the initial cpu boot we can > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > accessed" default attributes for the hidden part (64bit guest).I see, thanks for the link.> > If this _is_ to be the interface: > > - this comment should be somewhere in the interface headers (and > > docs) so that OS authors can find it; and > > - it should specify the _exact_ constraints that the guest must > > follow in constructing its tables. > > I can add a comment to VCPUOP_initialise: > > diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h > index e888daf..03baae2 100644 > --- a/xen/include/public/vcpu.h > +++ b/xen/include/public/vcpu.h > @@ -43,6 +43,8 @@ > * > * @extra_arg == pointer to vcpu_guest_context structure containing initial > * state for the VCPU. > + * > + * PVH: for constraints, please see prolog of vmx_pvh_set_vcpu_info(). > */ > #define VCPUOP_initialise 0 > > > Or I can explicitly add the details here: > > * @extra_arg == pointer to vcpu_guest_context structure containing initial > * state for the VCPU. > + * > + * PVH constraints: The vcpu GDT must adhere to following default values. > + * For 64 bit guest: > + * CS: base:0 limit:fffff Type:0xb DPL:0 P:1 AVL:0 L:1 D:0 G:1 > + * ie, GDT entry: 00af9b000000ffff > + * Others: base: 0 limit:fffff Type:3 DPL:0 P:1 AVL:0 L:0 B:1 G:1 > + * i,e GDT entry: 00cf93000000ffff > + * (except GS_BASE and FS_BASE that are guest provided). > */ > #define VCPUOP_initialise 0 > > > Please lmk which is better.The second, please. Having the details in the interface is much friendlier.> I can also redo the function prolog to explicitly provide entry details: > > /* > * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called > * from arch_set_info_guest() which sets the (PVH relevant) non-vmcs fields. > * > * (NOTE: In case of VMCS, loading a selector doesn''t cause the hidden fields > * to be automatically loaded.) > * > * Guest Constraints: > * We load selectors here but not the hidden parts, except for > * GS_BASE and FS_BASE. This means we require the guest to have same > * hidden values as the default values loaded in the vmcs (in > * pvh_construct_vmcs()). > * For 64 bit guest: > * CS: base:0 limit:fffff Type:0xb DPL:0 P:1 AVL:0 L:1 D:0 G:1 > * ie, GDT entry: 00af9b000000ffff > * Others: base: 0 limit:fffff Type:3 DPL:0 P:1 AVL:0 L:0 B:1 G:1 > * i,e GDT entry: 00cf93000000ffffYes, that''s helpful -- maybe a link back to the interface heder here too so if anyone changes it they''ll know to change both.> * In case of linux: > * The boot vcpu calls this to set context for the non boot smp vcpu. The > * call comes from cpu_initialize_context(). (boot vcpu 0 context is > * set by the tools via do_domctl -> vcpu_initialise). > * For 64bit, GDT for bringup vcpu looks like (CS:0x10 DS/SS:0x18) : > * > * ffff88007f704000: 0000000000000000 00cf9b000000ffff > * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff > * ffff88007f704020: 00cffb000000ffff 00cff3000000ffffI''m not sure that''s as much use -- it certainly won''t be kept in line with any changes in the linux code. Cheers, Tim.
Tim Deegan
2013-Jul-26 10:45 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote:> On Thu, 25 Jul 2013 17:28:40 +0100 > Tim Deegan <tim@xen.org> wrote: > > > At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: > > > +/* Just like HVM, PVH should be using "cpuid" from the kernel > > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs *regs) > > > +{ > > > + if ( guest_kernel_mode(current, regs) > > > || !emulate_forced_invalid_op(regs) ) > > > + hvm_inject_hw_exception(TRAP_invalid_op, > > > HVM_DELIVER_NO_ERROR_CODE); > > > > Was this discussed before? It seems harsh to stop kernel-mode code > > from using the pv cpuid operation if it wants to. In particular, > > what about loadable kernel modules? > > Yes, few times on the xen mailing list. The only PVH guest, linux > as of now, the pv ops got rewired to use native cpuid, which is > how hvm does it.Yes, but presumably you want to make it easy for other PV guests to port to PVH too?> So, couldn''t come up with any real reason to support it.Seems like there''s no reason not to -- wouldn''t just removing the check for kernel-mode DTRT? Or is there some other complication?> The kernel modules in pv ops will go thru native_cpuid > too, which will do hvm cpuid too. > > > > If you do go with this restriction, please document it in > > include/public/arch-x86/xen.h beside the XEN_CPUID definition. > > Ok, I''ll add it there.Thanks.> > > +/* Returns: rc == 0: success. */ > > > +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, > > > uint64_t *regp) +{ > > > + struct vcpu *vp = current; > > > + > > > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > > > + { > > > + unsigned long new_cr0 = *regp; > > > + unsigned long old_cr0 = __vmread(GUEST_CR0); > > > + > > > + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", > > > regs->rip, *regp); > > > + if ( (u32)new_cr0 != new_cr0 ) > > > + { > > > + printk(XENLOG_G_WARNING > > > + "Guest setting upper 32 bits in CR0: %lx", > > > new_cr0); > > > + return -EPERM; > > > > AFAICS returning non-zero here crashes the guest. Shouldn''t this > > inject #GP instead? > > Right, GPF it is.Ta.> > > + > > > + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) > > > + { > > > + printk(XENLOG_G_WARNING > > > + "PVH guest attempts to set reserved bit in CR4: > > > %lx", new); > > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > > + return 0; > > > + } > > > + > > > + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) > > > + { > > > + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " > > > + "EFER.LMA is set"); > > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > > + return 0; > > > + } > > > + > > > + vp->arch.hvm_vcpu.guest_cr[4] = new; > > > + > > > + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | > > > X86_CR4_PAE) ) > > > + vpid_sync_all(); > > > + > > > + __vmwrite(CR4_READ_SHADOW, new); > > > + > > > + new &= ~X86_CR4_PAE; /* PVH always runs with hap > > > enabled. */ > > > > The equivalent mask in vmx_update_guest_cr() is masking out a default > > setting of CR4.PAE _before_ the guest''s requested bits get ORred in. > > This is masking out the PAE bit that we just insisted on. I''m > > surprised that VMENTER doesn''t choke on this -- I guess it uses > > VM_ENTRY_IA32E_MODE rather than looking at these bits at all. > > Ah, I see. what a mess! Hmm... I can''t find much in the VMX sections > of the SDM on this. Not sure what I should do here. How about just > not clearing the X86_CR4_PAE, so the GUEST_CR4 will have it > if it''s set in new, ie, the guest wants it set? ie, : > ... > __vmwrite(CR4_READ_SHADOW, new); > > new |= X86_CR4_VMXE | X86_CR4_MCE; > __vmwrite(GUEST_CR4, new);Yep, that looks correct to me.> > > + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: > > > + { > > > + struct vcpu *vp = current; > > > + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & > > > ~X86_CR0_TS; > > > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] > > > = cr0; + > > > + vmx_fpu_enter(vp); > > > + __vmwrite(GUEST_CR0, cr0); > > > + __vmwrite(CR0_READ_SHADOW, cr0); > > > + vmx_update_guest_eip(); > > > + rc = 0; > > > + } > > > + } > > > > No "case VMX_CONTROL_REG_ACCESS_TYPE_LMSW"? > > Well, PVH is such a new concept, do you really think we need it?Yes! I''ve seen it used (in an implementation of stts() IIRC) quite recently. Besides, it''s only about three lines of code to lift from the normal VMX handler.> Lets > not hold the series just for this, we can always add in future :).Well, I assume with Keir''s ack the series will go in anyway and any changes from this review would be fixed up afterwards. Tim.
Mukesh Rathor
2013-Jul-26 18:55 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On Thu, 25 Jul 2013 17:39:28 +0100 Tim Deegan <tim@xen.org> wrote:> Hi, >.....> One question that''s not from any particular patch: is there a check > anywhere to stop the tools creating a PVH domain on an AMD machine? > I didn''t see one but may just have missed it.The past review comments were to make that part of the upcoming tools patch that would allow one to create a PVH guest. thanks mukesh
Mukesh Rathor
2013-Jul-27 00:59 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On Thu, 25 Jul 2013 17:39:28 +0100 Tim Deegan <tim@xen.org> wrote:> Hi,...> #19 is probably OK for correctness, though I haven''t reviewed the VMCS > settings in enough detail to be sure they''re complete. My main > reservation is that it seems to duplicate a bunch of code from the > HVM VMCS setup.Forgot to respond to this. I factored out some common code into vmx_set_common_host_vmcs_fields. Other lets just leave it, otherwise it clutters the existing function way too much with if (PVH) statements. thanks Mukesh
Mukesh Rathor
2013-Jul-27 01:05 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
On Wed, 24 Jul 2013 18:07:07 -0700 Mukesh Rathor <mukesh.rathor@oracle.com> wrote: .........> Thanks Andrew for running the regressions, please lmk. > > I''ve pushed a new branch pvh.v10.acked, with the nit fixed in patch4 > and the acks, at: > > git clone -n git://oss.oracle.com/git/mrathor/xen.git . > git checkout pvh.v10.acked > > Look forward to seeing it in xen unstable tree.Not sure who to direct this to, but I''ve pushed a new branch called pvh.v10.acked-1 which is same as pvh.v10.acked except for Tim''s "review by" in all except 9, 10, and 23. There is no code change. As agreed by him, I''ll send incremental patch for that after this is merged. thanks Mukesh
Jan Beulich
2013-Aug-05 11:03 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp)Minimally int vmx_pvh_set_vcpu_info(struct vcpu *v, const struct vcpu_guest_context *ctxtp) And you should overall try to constify function parameters wherever possible.> +{ > + if ( v->vcpu_id == 0 ) > + return 0; > + > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > + return -EINVAL; > + > + vmx_vmcs_enter(v); > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > + > + __vmwrite(GUEST_FS_BASE, ctxtp->fs_base); > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_kernel); > + > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > + __vmwrite(GUEST_FS_SELECTOR, ctxtp->user_regs.fs); > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > + > + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) > + { > + vmx_vmcs_exit(v); > + return -EINVAL; > + } > + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_user); > + > + vmx_vmcs_exit(v); > + return 0; > +}So despite my earlier comments you still neither use nor check the implied / defaulted to hidden parts of the segment registers vs the destined descriptor table entries. I''m not going to ack the patch without this being done at least in an #ifndef NDEBUG code section, or without a code comment explaining why this cannot be done. Jan
Jan Beulich
2013-Aug-05 11:08 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 25 Jul 2013 14:47:55 +0100 > Tim Deegan <tim@xen.org> wrote: > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > .... >> > + * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise >> > hcall. Called >> > require the >> > + * guest to have same hidden values as the default values >> > loaded in the >> > + * vmcs in pvh_construct_vmcs(), ie, the GDT the vcpu is >> > coming up on >> > + * should be something like following, >> > + * (from 64bit linux, CS:0x10 DS/SS:0x18) : >> > + * >> > + * ffff88007f704000: 0000000000000000 00cf9b000000ffff >> > + * ffff88007f704010: 00af9b000000ffff 00cf93000000ffff >> > + * ffff88007f704020: 00cffb000000ffff 00cff3000000ffff >> >> What a bizarre interface! Why does this operation not load the hidden >> parts from the GDT/LDT that the caller supplied? > > Pl see: http://lists.xen.org/archives/html/xen-devel/2013-07/msg01489.html > > That was an option discussed with Jan, walking and reading the GDT > entries from the gdtaddr the guest provided to load the hidden > parts. But, I agree with him, that for the initial cpu boot we can > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > accessed" default attributes for the hidden part (64bit guest).That must be a misunderstanding then (also see my other reply) - I always meant to require that you either properly load the hidden register portions from the descriptor tables, or at least verify that the descriptor table entries referenced match the defaults you enforce. Jan
Jan Beulich
2013-Aug-05 11:10 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) > +{ > + if ( v->vcpu_id == 0 ) > + return 0; > + > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > + return -EINVAL; > + > + vmx_vmcs_enter(v); > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents);Just noticed: Aren''t you mixing up entries and bytes here? Jan> + > + __vmwrite(GUEST_FS_BASE, ctxtp->fs_base); > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_kernel); > + > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > + __vmwrite(GUEST_FS_SELECTOR, ctxtp->user_regs.fs); > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > + > + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) > + { > + vmx_vmcs_exit(v); > + return -EINVAL; > + } > + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_user); > + > + vmx_vmcs_exit(v); > + return 0; > +}
Jan Beulich
2013-Aug-05 11:13 UTC
Re: [V10 PATCH 00/23]PVH xen: Phase I, Version 10 patches...
>>> On 27.07.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 24 Jul 2013 18:07:07 -0700 > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > ......... > >> Thanks Andrew for running the regressions, please lmk. >> >> I''ve pushed a new branch pvh.v10.acked, with the nit fixed in patch4 >> and the acks, at: >> >> git clone -n git://oss.oracle.com/git/mrathor/xen.git . >> git checkout pvh.v10.acked >> >> Look forward to seeing it in xen unstable tree. > > Not sure who to direct this to, but I''ve pushed a new branch > called pvh.v10.acked-1 which is same as pvh.v10.acked except for > Tim''s "review by" in all except 9, 10, and 23. There is no code change. > As agreed by him, I''ll send incremental patch for that after this is > merged.Just FYI: If I''m expected to be applying at least the initial part of this series (as far as I''m comfortable with it, i.e. right now perhaps up to patch 8) I will need you to post to the list the versions you want to be applied, even if only for minor edits or tag additions. If you don''t do this, I''ll leave applying the series to others - at least for the time being I''m not planning to get into "git pull" mode. Jan
Jan Beulich
2013-Aug-05 11:16 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
>>> On 25.07.13 at 16:01, Tim Deegan <tim@xen.org> wrote: > At 18:59 -0700 on 23 Jul (1374605958), Mukesh Rathor wrote: >> @@ -861,7 +865,7 @@ int arch_set_info_guest( >> >> if ( !cr3_page ) >> rc = -EINVAL; >> - else if ( paging_mode_refcounts(d) ) >> + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) > > Isn''t paging_mode_refcounts() true for all PVH vcpus/domains? (This > code is OK - I just want to check that I''ve understood what''s going on.) > >> /* nothing */; >> else if ( cr3_page == v->arch.old_guest_table ) >> { >> @@ -893,8 +897,15 @@ int arch_set_info_guest( >> /* handled below */; >> else if ( !compat ) >> { >> + /* PVH 32bitfixme. */ >> + if ( is_pvh_vcpu(v) ) >> + { >> + v->arch.cr3 = page_to_mfn(cr3_page); > > Not page_to_maddr()? I guess that with paging_mode_translate(), > arch.cr3 doesn''t actually get used. > >> + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; >> + } >> + >> v->arch.guest_table = pagetable_from_page(cr3_page); >> - if ( c.nat->ctrlreg[1] ) >> + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) > > If you''re respinning this patch, maybe use is_pv_vcpu() here?And perhaps it would be worth rejecting non-zero CR1 values for PVH? Jan
Jan Beulich
2013-Aug-05 11:25 UTC
Re: [V10 PATCH 21/23] PVH xen: VMX support of PVH guest creation/destruction
>>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) > +{ > + vmx_vmcs_enter(v); > + switch ( cr ) > + { > + case 3: > + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); > + hvm_asid_flush_vcpu(v); > + break; > + > + default: > + printk(XENLOG_ERR > + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", > + v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP));IMO this should be an ASSERT(), not a printk(), as there should be no way to get here with an value other then 3 if all code is working as expected. Or if this was to remain a printk(), you probably should use XENLOG_G_ERR. Jan> + } > + vmx_vmcs_exit(v); > +}
Jan Beulich
2013-Aug-05 11:37 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
>>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > +static int vmxit_msr_read(struct cpu_user_regs *regs) > +{ > + u64 msr_content = 0; > + > + switch ( regs->ecx ) > + { > + case MSR_IA32_MISC_ENABLE: > + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); > + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | > + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; > + break; > + > + default: > + /* PVH fixme: see hvm_msr_read_intercept(). */ > + rdmsrl(regs->ecx, msr_content); > + break; > + } > + regs->eax = (uint32_t)msr_content; > + regs->edx = (uint32_t)(msr_content >> 32); > + vmx_update_guest_eip(); > + > + dbgp1("msr read c:%lx a:%lx d:%lx RIP:%lx RSP:%lx\n", regs->ecx, regs->eax, > + regs->edx, regs->rip, regs->rsp); > + > + return 0; > +} > + > +/* Returns : 0 == msr written successfully. */ > +static int vmxit_msr_write(struct cpu_user_regs *regs) > +{ > + uint64_t msr_content = (uint32_t)regs->eax | ((uint64_t)regs->edx << 32);... = regs->_eax | (regs->rdx << 32); would do - the construct you likely copied from elsewhere was only necessary when we still had 32-bit support in the tree.> + > + dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx, > + regs->eax, regs->edx);Please settle on whether you use 0x prefixes or not (a few lines up you don''t), and if you do please use %#lx instead of spelling out the 0x (a cleanup patch to that effect went in not too long ago).> +static int vmxit_exception(struct cpu_user_regs *regs) > +{ > + int vector = (__vmread(VM_EXIT_INTR_INFO)) & INTR_INFO_VECTOR_MASK; > + int rc = -ENOSYS; > + > + dbgp1(" EXCPT: vec:%d cs:%lx r.IP:%lx\n", vector, > + __vmread(GUEST_CS_SELECTOR), regs->eip);Vectors printed in decimal are almost always useless. And - what''s "r.IP"? If you say "cs:", you should likewise say "rip:" or "ip:".> +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) > +{ > + struct vcpu *vp = current; > + > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + unsigned long new_cr0 = *regp; > + unsigned long old_cr0 = __vmread(GUEST_CR0); > + > + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp);Here you''re mixing 0x and not-0x even within a single message. Such style can only help confusion. Jan
Mukesh Rathor
2013-Aug-06 01:34 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 05 Aug 2013 12:08:50 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Thu, 25 Jul 2013 14:47:55 +0100 > > Tim Deegan <tim@xen.org> wrote: > > > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > > .... > >> > + * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise > >> > hcall. Called > >> > require the > >> > + * guest to have same hidden values as the default values > >> > loaded in the > >> > + * vmcs in pvh_construct_vmcs(), ie, the GDT the vcpu is > >> > coming up on > >> > + * should be something like following, > >> > + * (from 64bit linux, CS:0x10 DS/SS:0x18) : > >> > + * > >> > + * ffff88007f704000: 0000000000000000 > >> > 00cf9b000000ffff > >> > + * ffff88007f704010: 00af9b000000ffff > >> > 00cf93000000ffff > >> > + * ffff88007f704020: 00cffb000000ffff > >> > 00cff3000000ffff > >> > >> What a bizarre interface! Why does this operation not load the > >> hidden parts from the GDT/LDT that the caller supplied? > > > > Pl see: > > http://lists.xen.org/archives/html/xen-devel/2013-07/msg01489.html > > > > That was an option discussed with Jan, walking and reading the GDT > > entries from the gdtaddr the guest provided to load the hidden > > parts. But, I agree with him, that for the initial cpu boot we can > > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > > accessed" default attributes for the hidden part (64bit guest). > > That must be a misunderstanding then (also see my other reply) - I > always meant to require that you either properly load the hidden > register portions from the descriptor tables, or at least verify that > the descriptor table entries referenced match the defaults you > enforce.Ok, I thought you just wanted to be documented, If I''m gonna write the code to verify, i might as well just write the hidden porions, an option I''d proposed. That way there are no constraints. I''m currently working on just doing that, and will be in the next version of the patch. thanks mukesh
George Dunlap
2013-Aug-06 11:29 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch introduces the concept of a pvh guest. There are other basic > changes like creating macros to check for pv/pvh vcpu/domain, and also > modifying copy-macros to account for pvh. Finally, guest_kernel_mode is > changed to boast that a PVH doesn''t need to check for TF_kernel_mode > flag since the kernel runs in ring 0. > > Chagnes in V2: > - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag. > - fix indentation and spacing in guest_kernel_mode macro. > - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no longer > be called in any PVH paths. > > Chagnes in V3: > - Rename enum fields, and add is_pv to it. > - Get rid if is_hvm_or_pvh_* macros. > > Chagnes in V4: > - Move e820 fields out of pv_domain struct. > > Chagnes in V5: > - Move e820 changes above in V4, to a separate patch. > > Chagnes in V5: > - Rename enum guest_type from is_pv, ... to guest_type_pv, .... > > Chagnes in V8: > - Got to VMCS for DPL check instead of checking the rpl in > guest_kernel_mode. Note, we drop the const qualifier from > vcpu_show_registers() to accomodate the hvm function call in > guest_kernel_mode(). > - Also, hvm_kernel_mode is put in hvm.c because it''s called from > guest_kernel_mode in regs.h which is a pretty early header include. > Hence, we can''t place it in hvm.h like other similar functions. > The other alternative, to put hvm_kernel_mode in regs.h itself, > but then it calls hvm_get_segment_register() for which either we > need to include hvm.h in regs.h, not possible, or add > proto for hvm_get_segment_register(). But then the args to > hvm_get_segment_register() also need their headers. So, in the > end this seems to be the best/only way. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/debug.c | 2 +- > xen/arch/x86/hvm/hvm.c | 8 ++++++++ > xen/arch/x86/x86_64/traps.c | 2 +- > xen/common/domain.c | 2 +- > xen/include/asm-x86/desc.h | 4 +++- > xen/include/asm-x86/domain.h | 2 +- > xen/include/asm-x86/guest_access.h | 12 ++++++------ > xen/include/asm-x86/x86_64/regs.h | 11 +++++++---- > xen/include/public/domctl.h | 3 +++ > xen/include/xen/sched.h | 21 ++++++++++++++++++--- > 10 files changed, 49 insertions(+), 18 deletions(-) > > diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c > index e67473e..167421d 100644 > --- a/xen/arch/x86/debug.c > +++ b/xen/arch/x86/debug.c > @@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp, > > pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len); > > - mfn = (dp->is_hvm > + mfn = (!is_pv_domain(dp) > ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn) > : dbg_pv_va2mfn(addr, dp, pgd3)); > if ( mfn == INVALID_MFN ) > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 8284b3b..bac4708 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -4642,6 +4642,14 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v) > return hvm_funcs.nhvm_intr_blocked(v); > } > > +bool_t hvm_kernel_mode(struct vcpu *v) > +{ > + struct segment_register seg; > + > + hvm_get_segment_register(v, x86_seg_ss, &seg); > + return (seg.attr.fields.dpl == 0); > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c > index 9e0571d..feb50ff 100644 > --- a/xen/arch/x86/x86_64/traps.c > +++ b/xen/arch/x86/x86_64/traps.c > @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs *regs) > } > } > > -void vcpu_show_registers(const struct vcpu *v) > +void vcpu_show_registers(struct vcpu *v)Rather than doing this (which could potentially mask a bug in which something actually *does* get changed), wouldn''t it make more sense to make hvm_kernel_mode (and hvm_get_segment_register) be const? -George
Jan Beulich
2013-Aug-06 11:47 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
>>> On 06.08.13 at 13:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> --- a/xen/arch/x86/x86_64/traps.c >> +++ b/xen/arch/x86/x86_64/traps.c >> @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs *regs) >> } >> } >> >> -void vcpu_show_registers(const struct vcpu *v) >> +void vcpu_show_registers(struct vcpu *v) > > Rather than doing this (which could potentially mask a bug in which > something actually *does* get changed), wouldn''t it make more sense > to make hvm_kernel_mode (and hvm_get_segment_register) be const?That''s what I suggested first too, but which turned out not to work: Down the call tree there is a use of v where a pointer to non-const is required (iirc in VMX specific code). Jan
George Dunlap
2013-Aug-06 12:06 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On Tue, Aug 6, 2013 at 12:47 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 06.08.13 at 13:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >>> --- a/xen/arch/x86/x86_64/traps.c >>> +++ b/xen/arch/x86/x86_64/traps.c >>> @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs *regs) >>> } >>> } >>> >>> -void vcpu_show_registers(const struct vcpu *v) >>> +void vcpu_show_registers(struct vcpu *v) >> >> Rather than doing this (which could potentially mask a bug in which >> something actually *does* get changed), wouldn''t it make more sense >> to make hvm_kernel_mode (and hvm_get_segment_register) be const? > > That''s what I suggested first too, but which turned out not to > work: Down the call tree there is a use of v where a pointer to > non-const is required (iirc in VMX specific code).Then the changelog should say that, preferably the exact function where non-const is needed, so people know why that''s necessary without having to do their own looking. -George
Mukesh Rathor
2013-Aug-06 23:26 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On Tue, 6 Aug 2013 13:06:37 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Tue, Aug 6, 2013 at 12:47 PM, Jan Beulich <JBeulich@suse.com> > wrote: > >>>> On 06.08.13 at 13:29, George Dunlap > >>>> <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote: > >>> --- a/xen/arch/x86/x86_64/traps.c > >>> +++ b/xen/arch/x86/x86_64/traps.c > >>> @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs > >>> *regs) } > >>> } > >>> > >>> -void vcpu_show_registers(const struct vcpu *v) > >>> +void vcpu_show_registers(struct vcpu *v) > >> > >> Rather than doing this (which could potentially mask a bug in which > >> something actually *does* get changed), wouldn''t it make more > >> sense to make hvm_kernel_mode (and hvm_get_segment_register) be > >> const? > > > > That''s what I suggested first too, but which turned out not to > > work: Down the call tree there is a use of v where a pointer to > > non-const is required (iirc in VMX specific code). > > Then the changelog should say that, preferably the exact function > where non-const is needed, so people know why that''s necessary without > having to do their own looking.And the changelog does say it: "Note, we drop the const qualifier from vcpu_show_registers() to accomodate the hvm function call in guest_kernel_mode()." -Mukesh
Mukesh Rathor
2013-Aug-07 00:37 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Fri, 26 Jul 2013 11:45:19 +0100 Tim Deegan <tim@xen.org> wrote:> At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote: > > On Thu, 25 Jul 2013 17:28:40 +0100 > > Tim Deegan <tim@xen.org> wrote: > > > > > At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: > > > > +/* Just like HVM, PVH should be using "cpuid" from the kernel > > > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs > > > > *regs) +{ > > > > + if ( guest_kernel_mode(current, regs) > > > > || !emulate_forced_invalid_op(regs) ) > > > > + hvm_inject_hw_exception(TRAP_invalid_op, > > > > HVM_DELIVER_NO_ERROR_CODE); > > > > > > Was this discussed before? It seems harsh to stop kernel-mode > > > code from using the pv cpuid operation if it wants to. In > > > particular, what about loadable kernel modules? > > > > Yes, few times on the xen mailing list. The only PVH guest, linux > > as of now, the pv ops got rewired to use native cpuid, which is > > how hvm does it. > > Yes, but presumably you want to make it easy for other PV guests to > port to PVH too?True, but how would not allowing kernel mode emulation impede that? I fail to understand why a new kernel would wanna use xen signature emulation over just plain cpuid instruction? I can understand an application would want to be ported unmodified, so we need to support at least in the short term, but a PVH kernel would need to be modified anyways, so why create more work, right?> > So, couldn''t come up with any real reason to support it. > > Seems like there''s no reason not to -- wouldn''t jus > definitiont removing the check for kernel-mode DTRT? ..Most likely, but there''s lot of code in that path, and I''d feel more comfortable doing it after testing it. It''ll be a small patch in future anyways :). I''ll document it near XEN_CPUID definition for now. thanks, mukesh
Mukesh Rathor
2013-Aug-07 01:53 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 5 Aug 2013 18:34:36 -0700 Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Mon, 05 Aug 2013 12:08:50 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> > > >>> wrote: > > > On Thu, 25 Jul 2013 14:47:55 +0100 > > > Tim Deegan <tim@xen.org> wrote:....... > >> > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > > > That was an option discussed with Jan, walking and reading the GDT > > > entries from the gdtaddr the guest provided to load the hidden > > > parts. But, I agree with him, that for the initial cpu boot we can > > > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > > > accessed" default attributes for the hidden part (64bit guest). > > > > That must be a misunderstanding then (also see my other reply) - I > > always meant to require that you either properly load the hidden > > register portions from the descriptor tables, or at least verify > > that the descriptor table entries referenced match the defaults you > > enforce. > > Ok, I thought you just wanted to be documented, If I''m gonna write > the code to verify, i might as well just write the hidden porions, an > option I''d proposed. That way there are no constraints. I''m currently > working on just doing that, and will be in the next version of the > patch.Ok, I''ve mostly got code to set the hidden fields, but the more I think about it, the more I feel that the right/better thing to do is to just not set any selectors at all, instead default them in the VMCS. Meaning, we set CS=0x10, DS/SS = 0x18 in vmcs create code. Then we document that the guest needs to set a boot GDT as: 0000000000000000 00cf9b000000ffff 00af9b000000ffff 00cf93000000ffff to bootstrap. PV does this anyways, and since we are dealing with PVH here, which is still modified guest, it seems fairly plausible. The guest just needs to send down the GDT addr and size for the boot vcpu then as part of this ABI. Recall this is just for the intial boot of the vcpu, once its up, it manages its own GDT. What do you guys think? please lmk. thanks mukesh
Jan Beulich
2013-Aug-07 06:34 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 07.08.13 at 03:53, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Mon, 5 Aug 2013 18:34:36 -0700 > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> On Mon, 05 Aug 2013 12:08:50 +0100 >> "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > >>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> >> > >>> wrote: >> > > On Thu, 25 Jul 2013 14:47:55 +0100 >> > > Tim Deegan <tim@xen.org> wrote: > ....... > > >> > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: >> > > That was an option discussed with Jan, walking and reading the GDT >> > > entries from the gdtaddr the guest provided to load the hidden >> > > parts. But, I agree with him, that for the initial cpu boot we can >> > > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, >> > > accessed" default attributes for the hidden part (64bit guest). >> > >> > That must be a misunderstanding then (also see my other reply) - I >> > always meant to require that you either properly load the hidden >> > register portions from the descriptor tables, or at least verify >> > that the descriptor table entries referenced match the defaults you >> > enforce. >> >> Ok, I thought you just wanted to be documented, If I''m gonna write >> the code to verify, i might as well just write the hidden porions, an >> option I''d proposed. That way there are no constraints. I''m currently >> working on just doing that, and will be in the next version of the >> patch. > > Ok, I''ve mostly got code to set the hidden fields, but the more I think > about it, the more I feel that the right/better thing to do is to > just not set any selectors at all, instead default them in the VMCS. > Meaning, we set CS=0x10, DS/SS = 0x18 in vmcs create code. Then we > document that the guest needs to set a boot GDT as: > > 0000000000000000 00cf9b000000ffff > 00af9b000000ffff 00cf93000000ffff > > to bootstrap. PV does this anyways, and since we are dealing with PVH > here, which is still modified guest, it seems fairly plausible. The > guest just needs to send down the GDT addr and size for the boot vcpu > then as part of this ABI. Recall this is just for the intial boot of the > vcpu, once its up, it manages its own GDT. > > What do you guys think? please lmk.I don''t mind doing it that way, as long as (as said before) the code properly verifies these constraints at least in debug mode. However, if you want to go with a "minimal" concept, then let''s please require a truly minimal GDT: For a 64-bit guest that''d be CS = 0x0008, all other selector registers zero. All you need to validate then is a single descriptor, and the guest could even choose to point the initial GDT at the "normal" GDT''s 64-bit code descriptor minus 8. Jan
Jan Beulich
2013-Aug-07 07:14 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
>>> On 07.08.13 at 01:26, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 6 Aug 2013 13:06:37 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Tue, Aug 6, 2013 at 12:47 PM, Jan Beulich <JBeulich@suse.com> >> wrote: >> >>>> On 06.08.13 at 13:29, George Dunlap >> >>>> <George.Dunlap@eu.citrix.com> wrote: >> >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> >> <mukesh.rathor@oracle.com> wrote: >> >>> --- a/xen/arch/x86/x86_64/traps.c >> >>> +++ b/xen/arch/x86/x86_64/traps.c >> >>> @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs >> >>> *regs) } >> >>> } >> >>> >> >>> -void vcpu_show_registers(const struct vcpu *v) >> >>> +void vcpu_show_registers(struct vcpu *v) >> >> >> >> Rather than doing this (which could potentially mask a bug in which >> >> something actually *does* get changed), wouldn''t it make more >> >> sense to make hvm_kernel_mode (and hvm_get_segment_register) be >> >> const? >> > >> > That''s what I suggested first too, but which turned out not to >> > work: Down the call tree there is a use of v where a pointer to >> > non-const is required (iirc in VMX specific code). >> >> Then the changelog should say that, preferably the exact function >> where non-const is needed, so people know why that''s necessary without >> having to do their own looking. > > And the changelog does say it: > > "Note, we drop the const qualifier from vcpu_show_registers() to > accomodate the hvm function call in guest_kernel_mode()."But I think George really was after you pointing out where down the call tree the real need for this arises (i.e. why that call tree can''t instead have const-s added). Jan
George Dunlap
2013-Aug-07 09:14 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On 07/08/13 00:26, Mukesh Rathor wrote:> On Tue, 6 Aug 2013 13:06:37 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Tue, Aug 6, 2013 at 12:47 PM, Jan Beulich <JBeulich@suse.com> >> wrote: >>>>>> On 06.08.13 at 13:29, George Dunlap >>>>>> <George.Dunlap@eu.citrix.com> wrote: >>>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >>>> <mukesh.rathor@oracle.com> wrote: >>>>> --- a/xen/arch/x86/x86_64/traps.c >>>>> +++ b/xen/arch/x86/x86_64/traps.c >>>>> @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs >>>>> *regs) } >>>>> } >>>>> >>>>> -void vcpu_show_registers(const struct vcpu *v) >>>>> +void vcpu_show_registers(struct vcpu *v) >>>> Rather than doing this (which could potentially mask a bug in which >>>> something actually *does* get changed), wouldn''t it make more >>>> sense to make hvm_kernel_mode (and hvm_get_segment_register) be >>>> const? >>> That''s what I suggested first too, but which turned out not to >>> work: Down the call tree there is a use of v where a pointer to >>> non-const is required (iirc in VMX specific code). >> Then the changelog should say that, preferably the exact function >> where non-const is needed, so people know why that''s necessary without >> having to do their own looking. > And the changelog does say it: > > "Note, we drop the const qualifier from vcpu_show_registers() to > accomodate the hvm function call in guest_kernel_mode()."I said *exact function*. guest_kernel_mode() doesn''t need it non-const; it needs it because of a function that it calls. That in turn doesn''t need it non-const either -- it needs it because of the next one down. Who *actually* needs vcpu to be non-const, way down at the bottom? That''s what I need to know to understand why we can''t just change each of those functions to const all the way down. -George
Tim Deegan
2013-Aug-07 09:54 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
At 17:37 -0700 on 06 Aug (1375810631), Mukesh Rathor wrote:> On Fri, 26 Jul 2013 11:45:19 +0100 > Tim Deegan <tim@xen.org> wrote: > > > At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote: > > > On Thu, 25 Jul 2013 17:28:40 +0100 > > > Tim Deegan <tim@xen.org> wrote: > > > > > > > At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: > > > > > +/* Just like HVM, PVH should be using "cpuid" from the kernel > > > > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs > > > > > *regs) +{ > > > > > + if ( guest_kernel_mode(current, regs) > > > > > || !emulate_forced_invalid_op(regs) ) > > > > > + hvm_inject_hw_exception(TRAP_invalid_op, > > > > > HVM_DELIVER_NO_ERROR_CODE); > > > > > > > > Was this discussed before? It seems harsh to stop kernel-mode > > > > code from using the pv cpuid operation if it wants to. In > > > > particular, what about loadable kernel modules? > > > > > > Yes, few times on the xen mailing list. The only PVH guest, linux > > > as of now, the pv ops got rewired to use native cpuid, which is > > > how hvm does it. > > > > Yes, but presumably you want to make it easy for other PV guests to > > port to PVH too? > > True, but how would not allowing kernel mode emulation impede that? > I fail to understand why a new kernel would wanna use xen signature > emulation over just plain cpuid instruction?I''m talking about existing PV kernel code that already uses PV CPUID. And in particular what if that kernel code is in a device driver, or even a third-party loadable module? Porting the core kernel from PV to PVH shouldn''t break that code if it doesn''t have to. But TBH my objection is really more aesthetic than anything else. Restricting the PV CPUID instruction here adds another ragged edge in the ABI that all kernel writers have to think about, and for little or no benfit. And, to be clear, I object to it and this patch does not have my Ack. Tim.
Tim Deegan
2013-Aug-07 10:01 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 18:53 -0700 on 06 Aug (1375815193), Mukesh Rathor wrote:> On Mon, 5 Aug 2013 18:34:36 -0700 > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > On Mon, 05 Aug 2013 12:08:50 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > >>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> > > > >>> wrote: > > > > On Thu, 25 Jul 2013 14:47:55 +0100 > > > > Tim Deegan <tim@xen.org> wrote: > ....... > > > > > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > > > > That was an option discussed with Jan, walking and reading the GDT > > > > entries from the gdtaddr the guest provided to load the hidden > > > > parts. But, I agree with him, that for the initial cpu boot we can > > > > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > > > > accessed" default attributes for the hidden part (64bit guest). > > > > > > That must be a misunderstanding then (also see my other reply) - I > > > always meant to require that you either properly load the hidden > > > register portions from the descriptor tables, or at least verify > > > that the descriptor table entries referenced match the defaults you > > > enforce. > > > > Ok, I thought you just wanted to be documented, If I''m gonna write > > the code to verify, i might as well just write the hidden porions, an > > option I''d proposed. That way there are no constraints. I''m currently > > working on just doing that, and will be in the next version of the > > patch. > > Ok, I''ve mostly got code to set the hidden fields, but the more I think > about it, the more I feel that the right/better thing to do is to > just not set any selectors at all, instead default them in the VMCS.Why do you think that''s better? Why on earth wouldn''t a VCPU context-setting hypercall that takes segment selectors as arguments just DTRT? The principle of least astonishment should apply here.> Meaning, we set CS=0x10, DS/SS = 0x18 in vmcs create code. Then we > document that the guest needs to set a boot GDT as: > > 0000000000000000 00cf9b000000ffff > 00af9b000000ffff 00cf93000000ffffThat''s a pretty ugly interface -- just because it happens to be easy to implement in linux doesn''t mean it''s a good idea. What about other OSes (or linux, if it changes its default GDT settings)? They should carry extra code that makes up a GDT just for this one hypercall and then more code to switch away from it after boot? Tim.
Ian Campbell
2013-Aug-07 10:07 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Wed, 2013-08-07 at 11:01 +0100, Tim Deegan wrote:> At 18:53 -0700 on 06 Aug (1375815193), Mukesh Rathor wrote: > > On Mon, 5 Aug 2013 18:34:36 -0700 > > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > > On Mon, 05 Aug 2013 12:08:50 +0100 > > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > > > >>> On 26.07.13 at 02:58, Mukesh Rathor <mukesh.rathor@oracle.com> > > > > >>> wrote: > > > > > On Thu, 25 Jul 2013 14:47:55 +0100 > > > > > Tim Deegan <tim@xen.org> wrote: > > ....... > > > > > > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > > > > > That was an option discussed with Jan, walking and reading the GDT > > > > > entries from the gdtaddr the guest provided to load the hidden > > > > > parts. But, I agree with him, that for the initial cpu boot we can > > > > > restrict the ABI to say: 0 base addr, ~0 limit, and "read/write, > > > > > accessed" default attributes for the hidden part (64bit guest). > > > > > > > > That must be a misunderstanding then (also see my other reply) - I > > > > always meant to require that you either properly load the hidden > > > > register portions from the descriptor tables, or at least verify > > > > that the descriptor table entries referenced match the defaults you > > > > enforce. > > > > > > Ok, I thought you just wanted to be documented, If I''m gonna write > > > the code to verify, i might as well just write the hidden porions, an > > > option I''d proposed. That way there are no constraints. I''m currently > > > working on just doing that, and will be in the next version of the > > > patch. > > > > Ok, I''ve mostly got code to set the hidden fields, but the more I think > > about it, the more I feel that the right/better thing to do is to > > just not set any selectors at all, instead default them in the VMCS. > > Why do you think that''s better? Why on earth wouldn''t a VCPU > context-setting hypercall that takes segment selectors as arguments just > DTRT? The principle of least astonishment should apply here. > > > Meaning, we set CS=0x10, DS/SS = 0x18 in vmcs create code. Then we > > document that the guest needs to set a boot GDT as: > > > > 0000000000000000 00cf9b000000ffff > > 00af9b000000ffff 00cf93000000ffff > > That''s a pretty ugly interface -- just because it happens to be easy to > implement in linux doesn''t mean it''s a good idea. What about other OSes > (or linux, if it changes its default GDT settings)? They should carry > extra code that makes up a GDT just for this one hypercall and then more > code to switch away from it after boot?I''m not sure that it applies here but a proper PV guest is entitled to expect that certain hardcoded selectors be present in the GDT, in the last couple of pages which are reserved for Xen but contain convenient selectors. A PV guest is even launched on those IIRC, but perhaps not for secondary vcpus, unless the guest chooses to. Ian.
George Dunlap
2013-Aug-07 10:24 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch mostly contains changes to arch/x86/domain.c to allow for a PVH > domain creation. The new function pvh_set_vcpu_info(), introduced in the > previous patch, is called here to set some guest context in the VMCS. > This patch also changes the context_switch code in the same file to follow > HVM behaviour for PVH. > > Changes in V2: > - changes to read_segment_register() moved to this patch. > - The other comment was to create NULL functions for pvh_set_vcpu_info > and pvh_read_descriptor which are implemented in later patch, but since > I disable PVH creation until all patches are checked in, it is not needed. > But it helps breaking down of patches. > > Changes in V3: > - Fix read_segment_register() macro to make sure args are evaluated once, > and use # instead of STR for name in the macro. > > Changes in V4: > - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been > moved out of pv_vcpu struct. > - rename hvm_pvh_* functions to hvm_*. > > Changes in V5: > - remove pvh_read_descriptor(). > > Changes in V7: > - remove hap_update_cr3() and read_segment_register changes from here. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/domain.c | 56 ++++++++++++++++++++++++++++++++---------------- > xen/arch/x86/mm.c | 3 ++ > 2 files changed, 40 insertions(+), 19 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index c361abf..fccb4ee 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) > > vmce_init_vcpu(v); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > rc = hvm_vcpu_initialise(v); > goto done; > @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) > > vcpu_destroy_fpu(v); > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > hvm_vcpu_destroy(v); > else > xfree(v->arch.pv_vcpu.trap_ctxt); > @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > int rc = -ENOMEM; > > d->arch.hvm_domain.hap_enabled > - is_hvm_domain(d) && > + !is_pv_domain(d) && > hvm_funcs.hap_supported && > (domcr_flags & DOMCRF_hap); > d->arch.hvm_domain.mem_sharing_enabled = 0; > @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > mapcache_domain_init(d); > > HYPERVISOR_COMPAT_VIRT_START(d) > - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; > + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; > > if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) > goto fail; > @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > } > spin_lock_init(&d->arch.e820_lock); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > if ( (rc = hvm_domain_initialise(d)) != 0 ) > { > @@ -651,7 +651,7 @@ int arch_set_info_guest( > #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) > flags = c(flags); > > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > { > if ( !compat ) > { > @@ -704,7 +704,7 @@ int arch_set_info_guest( > v->fpu_initialised = !!(flags & VGCF_I387_VALID); > > v->arch.flags &= ~TF_kernel_mode; > - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) > + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) > v->arch.flags |= TF_kernel_mode; > > v->arch.vgc_flags = flags; > @@ -719,7 +719,7 @@ int arch_set_info_guest( > if ( !compat ) > { > memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, > sizeof(c.nat->trap_ctxt)); > } > @@ -735,10 +735,13 @@ int arch_set_info_guest( > > v->arch.user_regs.eflags |= 2; > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > { > hvm_set_info_guest(v); > - goto out; > + if ( is_hvm_vcpu(v) || v->is_initialised ) > + goto out; > + else > + goto pvh_skip_pv_stuff; > } > > init_int80_direct_trap(v); > @@ -853,6 +856,7 @@ int arch_set_info_guest( > > set_bit(_VPF_in_reset, &v->pause_flags); > > + pvh_skip_pv_stuff: > if ( !compat ) > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); > else > @@ -861,7 +865,7 @@ int arch_set_info_guest( > > if ( !cr3_page ) > rc = -EINVAL; > - else if ( paging_mode_refcounts(d) ) > + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) > /* nothing */; > else if ( cr3_page == v->arch.old_guest_table ) > { > @@ -893,8 +897,15 @@ int arch_set_info_guest( > /* handled below */; > else if ( !compat ) > { > + /* PVH 32bitfixme. */ > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.cr3 = page_to_mfn(cr3_page); > + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; > + } > + > v->arch.guest_table = pagetable_from_page(cr3_page); > - if ( c.nat->ctrlreg[1] ) > + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) > { > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); > cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); > @@ -954,6 +965,13 @@ int arch_set_info_guest( > > update_cr3(v); > > + if ( is_pvh_vcpu(v) ) > + { > + /* Set VMCS fields. */ > + if ( (rc = pvh_set_vcpu_info(v, c.nat)) != 0 ) > + return rc; > + } > + > out: > if ( flags & VGCF_online ) > clear_bit(_VPF_down, &v->pause_flags); > @@ -1315,7 +1333,7 @@ static void update_runstate_area(struct vcpu *v) > > static inline int need_full_gdt(struct vcpu *v) > { > - return (!is_hvm_vcpu(v) && !is_idle_vcpu(v)); > + return (is_pv_vcpu(v) && !is_idle_vcpu(v)); > } > > static void __context_switch(void) > @@ -1450,7 +1468,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next) > /* Re-enable interrupts before restoring state which may fault. */ > local_irq_enable(); > > - if ( !is_hvm_vcpu(next) ) > + if ( is_pv_vcpu(next) ) > { > load_LDT(next); > load_segments(next); > @@ -1576,12 +1594,12 @@ unsigned long hypercall_create_continuation( > regs->eax = op; > > /* Ensure the hypercall trap instruction is re-executed. */ > - if ( !is_hvm_vcpu(current) ) > + if ( is_pv_vcpu(current) ) > regs->eip -= 2; /* re-execute ''syscall'' / ''int $xx'' */ > else > current->arch.hvm_vcpu.hcall_preempted = 1; > > - if ( !is_hvm_vcpu(current) ? > + if ( is_pv_vcpu(current) ? > !is_pv_32on64_vcpu(current) : > (hvm_guest_x86_mode(current) == 8) ) > { > @@ -1849,7 +1867,7 @@ int domain_relinquish_resources(struct domain *d) > return ret; > } > > - if ( !is_hvm_domain(d) ) > + if ( is_pv_domain(d) ) > { > for_each_vcpu ( d, v ) > { > @@ -1922,7 +1940,7 @@ int domain_relinquish_resources(struct domain *d) > BUG(); > } > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > hvm_domain_relinquish_resources(d); > > return 0; > @@ -2006,7 +2024,7 @@ void vcpu_mark_events_pending(struct vcpu *v) > if ( already_pending ) > return; > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > hvm_assert_evtchn_irq(v); > else > vcpu_kick(v); > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 412971e..ece11e4 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) > int i; > unsigned long pfn; > > + if ( is_pvh_vcpu(v) ) > + return;There seems to be some inconsistency with where this is supposed to be checked -- in domain_relinquish_resources(), destroy_gdt() is only called for pv domains (gated on is_pv_domain); but in arch_set_info_guest(), it *was* gated on being PV, but with the PVH changes it''s still being called. Either this should only be called for PV domains (and this check should be an ASSERT), or it should be called regardless of the type of domain. I prefer the first if possible. (FYI I''m giving comments as I make my way through the series, but I''m holding off on ack / reviewed-by until I''ve grokked the whole thing.) -George
George Dunlap
2013-Aug-07 10:48 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch mostly contains changes to arch/x86/domain.c to allow for a PVH > domain creation. The new function pvh_set_vcpu_info(), introduced in the > previous patch, is called here to set some guest context in the VMCS. > This patch also changes the context_switch code in the same file to follow > HVM behaviour for PVH. > > Changes in V2: > - changes to read_segment_register() moved to this patch. > - The other comment was to create NULL functions for pvh_set_vcpu_info > and pvh_read_descriptor which are implemented in later patch, but since > I disable PVH creation until all patches are checked in, it is not needed. > But it helps breaking down of patches. > > Changes in V3: > - Fix read_segment_register() macro to make sure args are evaluated once, > and use # instead of STR for name in the macro. > > Changes in V4: > - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been > moved out of pv_vcpu struct. > - rename hvm_pvh_* functions to hvm_*. > > Changes in V5: > - remove pvh_read_descriptor(). > > Changes in V7: > - remove hap_update_cr3() and read_segment_register changes from here. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/domain.c | 56 ++++++++++++++++++++++++++++++++---------------- > xen/arch/x86/mm.c | 3 ++ > 2 files changed, 40 insertions(+), 19 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index c361abf..fccb4ee 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) > > vmce_init_vcpu(v); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > rc = hvm_vcpu_initialise(v); > goto done; > @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) > > vcpu_destroy_fpu(v); > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > hvm_vcpu_destroy(v); > else > xfree(v->arch.pv_vcpu.trap_ctxt); > @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > int rc = -ENOMEM; > > d->arch.hvm_domain.hap_enabled > - is_hvm_domain(d) && > + !is_pv_domain(d) && > hvm_funcs.hap_supported && > (domcr_flags & DOMCRF_hap); > d->arch.hvm_domain.mem_sharing_enabled = 0; > @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > mapcache_domain_init(d); > > HYPERVISOR_COMPAT_VIRT_START(d) > - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; > + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; > > if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) > goto fail; > @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > } > spin_lock_init(&d->arch.e820_lock); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > if ( (rc = hvm_domain_initialise(d)) != 0 ) > { > @@ -651,7 +651,7 @@ int arch_set_info_guest( > #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) > flags = c(flags); > > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > { > if ( !compat ) > { > @@ -704,7 +704,7 @@ int arch_set_info_guest( > v->fpu_initialised = !!(flags & VGCF_I387_VALID); > > v->arch.flags &= ~TF_kernel_mode; > - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) > + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) > v->arch.flags |= TF_kernel_mode; > > v->arch.vgc_flags = flags; > @@ -719,7 +719,7 @@ int arch_set_info_guest( > if ( !compat ) > { > memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, > sizeof(c.nat->trap_ctxt)); > } > @@ -735,10 +735,13 @@ int arch_set_info_guest( > > v->arch.user_regs.eflags |= 2; > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > { > hvm_set_info_guest(v); > - goto out; > + if ( is_hvm_vcpu(v) || v->is_initialised ) > + goto out; > + else > + goto pvh_skip_pv_stuff; > } > > init_int80_direct_trap(v); > @@ -853,6 +856,7 @@ int arch_set_info_guest( > > set_bit(_VPF_in_reset, &v->pause_flags); > > + pvh_skip_pv_stuff: > if ( !compat ) > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); > else > @@ -861,7 +865,7 @@ int arch_set_info_guest( > > if ( !cr3_page ) > rc = -EINVAL; > - else if ( paging_mode_refcounts(d) ) > + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) > /* nothing */; > else if ( cr3_page == v->arch.old_guest_table ) > { > @@ -893,8 +897,15 @@ int arch_set_info_guest( > /* handled below */; > else if ( !compat ) > { > + /* PVH 32bitfixme. */ > + if ( is_pvh_vcpu(v) ) > + { > + v->arch.cr3 = page_to_mfn(cr3_page); > + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; > + } > + > v->arch.guest_table = pagetable_from_page(cr3_page); > - if ( c.nat->ctrlreg[1] ) > + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) > { > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); > cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); > @@ -954,6 +965,13 @@ int arch_set_info_guest( > > update_cr3(v); > > + if ( is_pvh_vcpu(v) ) > + { > + /* Set VMCS fields. */ > + if ( (rc = pvh_set_vcpu_info(v, c.nat)) != 0 ) > + return rc; > + }BTW, would it make sense to pull the code above, which sets FS and GS for PV guests into a separate function, and then do something like the following? is_pv_vcpu(v) ? pv_set_vcpu_info() : pvh_set_vcpu_info(); -George
George Dunlap
2013-Aug-07 11:29 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch supports invalid op emulation for PVH by calling appropriate > copy macros and and HVM function to inject PF. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/traps.c | 17 ++++++++++++++--- > xen/include/asm-x86/traps.h | 1 + > 2 files changed, 15 insertions(+), 3 deletions(-) > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index 378ef0a..a3ca70b 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -459,6 +459,11 @@ static void instruction_done( > struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch) > { > regs->eip = eip; > + > + /* PVH fixme: debug trap below */ > + if ( is_pvh_vcpu(current) ) > + return;What exactly does this comment mean? Do you mean, "FIXME: Make debug trapping work for PVH guests"? (i.e., this functionality will be implemented later?)> + > regs->eflags &= ~X86_EFLAGS_RF; > if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) > { > @@ -913,7 +918,7 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs) > return EXCRET_fault_fixed; > } > > -static int emulate_forced_invalid_op(struct cpu_user_regs *regs) > +int emulate_forced_invalid_op(struct cpu_user_regs *regs)Why make this non-static? No one is using this in this patch. If a later patch needs it, you should make it non-static there, so we can decide at that point if making it non-static is merited or not.> { > char sig[5], instr[2]; > unsigned long eip, rc; > @@ -921,7 +926,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) > eip = regs->eip; > > /* Check for forced emulation signature: ud2 ; .ascii "xen". */ > - if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 ) > + if ( (rc = raw_copy_from_guest(sig, (char *)eip, sizeof(sig))) != 0 ) > { > propagate_page_fault(eip + sizeof(sig) - rc, 0); > return EXCRET_fault_fixed; > @@ -931,7 +936,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) > eip += sizeof(sig); > > /* We only emulate CPUID. */ > - if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 ) > + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, sizeof(instr))) != 0 ) > { > propagate_page_fault(eip + sizeof(instr) - rc, 0); > return EXCRET_fault_fixed; > @@ -1076,6 +1081,12 @@ void propagate_page_fault(unsigned long addr, u16 error_code) > struct vcpu *v = current; > struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; > > + if ( is_pvh_vcpu(v) ) > + { > + hvm_inject_page_fault(error_code, addr); > + return; > + }Would it make more sense to rename this function "pv_inject_page_fault", and then make a macro to switch between the two?> + > v->arch.pv_vcpu.ctrlreg[2] = addr; > arch_set_cr2(v, addr); > > diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h > index 82cbcee..1d9b087 100644 > --- a/xen/include/asm-x86/traps.h > +++ b/xen/include/asm-x86/traps.h > @@ -48,5 +48,6 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid, > */ > extern int send_guest_trap(struct domain *d, uint16_t vcpuid, > unsigned int trap_nr); > +int emulate_forced_invalid_op(struct cpu_user_regs *regs);Same as above re making the function non-static. -George
George Dunlap
2013-Aug-07 13:10 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On Wed, Aug 7, 2013 at 10:14 AM, George Dunlap <george.dunlap@eu.citrix.com> wrote:>> And the changelog does say it: >> >> "Note, we drop the const qualifier from vcpu_show_registers() to >> accomodate the hvm function call in guest_kernel_mode()." > > > I said *exact function*. guest_kernel_mode() doesn''t need it non-const; it > needs it because of a function that it calls. That in turn doesn''t need it > non-const either -- it needs it because of the next one down. Who > *actually* needs vcpu to be non-const, way down at the bottom? That''s what > I need to know to understand why we can''t just change each of those > functions to const all the way down.The general principle here is that you have already done the work of tracing through the code to figure out what''s going on; you should cache that information in the commit log so that reviewers (and people doing archaeology) don''t need to duplicate the effort. -George
George Dunlap
2013-Aug-07 13:49 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch changes mostly traps.c to support privileged op emulation for PVH. > A new function read_descriptor_sel() is introduced to read descriptor for PVH > given a selector. Another new function vmx_read_selector() reads a selector > from VMCS, to support read_segment_register() for PVH. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/hvm/vmx/vmx.c | 40 +++++++++++++++++++ > xen/arch/x86/traps.c | 86 +++++++++++++++++++++++++++++++++++----- > xen/include/asm-x86/hvm/hvm.h | 7 +++ > xen/include/asm-x86/system.h | 19 +++++++-- > 4 files changed, 137 insertions(+), 15 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index e3c7515..80109c1 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -664,6 +664,45 @@ static void vmx_ctxt_switch_to(struct vcpu *v) > .fields = { .type = 0xb, .s = 0, .dpl = 0, .p = 1, .avl = 0, \ > .l = 0, .db = 0, .g = 0, .pad = 0 } }).bytes) > > +u16 vmx_read_selector(struct vcpu *v, enum x86_segment seg) > +{ > + u16 sel = 0; > + > + vmx_vmcs_enter(v); > + switch ( seg ) > + { > + case x86_seg_cs: > + sel = __vmread(GUEST_CS_SELECTOR); > + break; > + > + case x86_seg_ss: > + sel = __vmread(GUEST_SS_SELECTOR); > + break; > + > + case x86_seg_es: > + sel = __vmread(GUEST_ES_SELECTOR); > + break; > + > + case x86_seg_ds: > + sel = __vmread(GUEST_DS_SELECTOR); > + break; > + > + case x86_seg_fs: > + sel = __vmread(GUEST_FS_SELECTOR); > + break; > + > + case x86_seg_gs: > + sel = __vmread(GUEST_GS_SELECTOR); > + break; > + > + default: > + BUG(); > + } > + vmx_vmcs_exit(v); > + > + return sel; > +} > + > void vmx_get_segment_register(struct vcpu *v, enum x86_segment seg, > struct segment_register *reg) > { > @@ -1563,6 +1602,7 @@ static struct hvm_function_table __initdata vmx_function_table = { > .handle_eoi = vmx_handle_eoi, > .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m, > .pvh_set_vcpu_info = vmx_pvh_set_vcpu_info, > + .read_selector = vmx_read_selector, > }; > > const struct hvm_function_table * __init start_vmx(void) > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index a3ca70b..fe8b94c 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -480,6 +480,10 @@ static unsigned int check_guest_io_breakpoint(struct vcpu *v, > unsigned int width, i, match = 0; > unsigned long start; > > + /* PVH fixme: support io breakpoint. */ > + if ( is_pvh_vcpu(v) ) > + return 0;Does this one, and the check to IO below, have anything to do with privileged op emulation?> + > if ( !(v->arch.debugreg[5]) || > !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) ) > return 0; > @@ -1525,6 +1529,49 @@ static int read_descriptor(unsigned int sel, > return 1; > } > > +static int read_descriptor_sel(unsigned int sel, > + enum x86_segment which_sel, > + struct vcpu *v, > + const struct cpu_user_regs *regs, > + unsigned long *base, > + unsigned long *limit, > + unsigned int *ar, > + unsigned int vm86attr) > +{ > + struct segment_register seg; > + bool_t long_mode; > + > + if ( !is_pvh_vcpu(v) ) > + return read_descriptor(sel, v, regs, base, limit, ar, vm86attr);Again, wouldn''t it be better to rename read_desrciptor to pv_read_descriptor(), name this one pvh_read_desrciptor(), give them a similar function signature (e.g., have both take a which_sel and have it look up the selector itself), rather than have this one-function-calls-another-function thing?> + > + hvm_get_segment_register(v, x86_seg_cs, &seg); > + long_mode = seg.attr.fields.l; > + > + if ( which_sel != x86_seg_cs ) > + hvm_get_segment_register(v, which_sel, &seg); > + > + /* "ar" is returned packed as in segment_attributes_t. Fix it up. */ > + *ar = seg.attr.bytes; > + *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4); > + *ar <<= 8; > + > + if ( long_mode ) > + { > + *limit = ~0UL; > + > + if ( which_sel < x86_seg_fs ) > + { > + *base = 0UL; > + return 1; > + } > + } > + else > + *limit = seg.limit; > + > + *base = seg.base; > + return 1; > +} > + > static int read_gate_descriptor(unsigned int gate_sel, > const struct vcpu *v, > unsigned int *sel, > @@ -1590,6 +1637,13 @@ static int guest_io_okay( > int user_mode = !(v->arch.flags & TF_kernel_mode); > #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v) > > + /* > + * For PVH we check this in vmexit for EXIT_REASON_IO_INSTRUCTION > + * and so don''t need to check again here. > + */ > + if ( is_pvh_vcpu(v) ) > + return 1;Same question re IO emulation.> + > if ( !vm86_mode(regs) && > (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) ) > return 1; > @@ -1835,7 +1889,7 @@ static inline uint64_t guest_misc_enable(uint64_t val) > _ptr = (unsigned int)_ptr; \ > if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) ) \ > goto fail; \ > - if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ > + if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \Why is this changed?> { \ > propagate_page_fault(_ptr + sizeof(_x) - _rc, 0); \ > goto skip; \ > @@ -1852,6 +1906,7 @@ static int is_cpufreq_controller(struct domain *d) > > static int emulate_privileged_op(struct cpu_user_regs *regs) > { > + enum x86_segment which_sel; > struct vcpu *v = current; > unsigned long *reg, eip = regs->eip; > u8 opcode, modrm_reg = 0, modrm_rm = 0, rep_prefix = 0, lock = 0, rex = 0; > @@ -1874,9 +1929,10 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) > void (*io_emul)(struct cpu_user_regs *) __attribute__((__regparm__(1))); > uint64_t val, msr_content; > > - if ( !read_descriptor(regs->cs, v, regs, > - &code_base, &code_limit, &ar, > - _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) ) > + if ( !read_descriptor_sel(regs->cs, x86_seg_cs, v, regs, > + &code_base, &code_limit, &ar, > + _SEGMENT_CODE|_SEGMENT_S| > + _SEGMENT_DPL|_SEGMENT_P) ) > goto fail; > op_default = op_bytes = (ar & (_SEGMENT_L|_SEGMENT_DB)) ? 4 : 2; > ad_default = ad_bytes = (ar & _SEGMENT_L) ? 8 : op_default; > @@ -1887,6 +1943,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) > > /* emulating only opcodes not allowing SS to be default */ > data_sel = read_segment_register(v, regs, ds); > + which_sel = x86_seg_ds; > > /* Legacy prefixes. */ > for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) ) > @@ -1902,23 +1959,29 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) > continue; > case 0x2e: /* CS override */ > data_sel = regs->cs; > + which_sel = x86_seg_cs; > continue; > case 0x3e: /* DS override */ > data_sel = read_segment_register(v, regs, ds); > + which_sel = x86_seg_ds; > continue; > case 0x26: /* ES override */ > data_sel = read_segment_register(v, regs, es); > + which_sel = x86_seg_es; > continue; > case 0x64: /* FS override */ > data_sel = read_segment_register(v, regs, fs); > + which_sel = x86_seg_fs; > lm_ovr = lm_seg_fs; > continue; > case 0x65: /* GS override */ > data_sel = read_segment_register(v, regs, gs); > + which_sel = x86_seg_gs; > lm_ovr = lm_seg_gs; > continue; > case 0x36: /* SS override */ > data_sel = regs->ss; > + which_sel = x86_seg_ss;...If you did that, you wouldn''t need this pair of assignments, only one of which is actually going to be used, all the way through this function.> continue; > case 0xf0: /* LOCK */ > lock = 1; > @@ -1962,15 +2025,16 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) > if ( !(opcode & 2) ) > { > data_sel = read_segment_register(v, regs, es); > + which_sel = x86_seg_es; > lm_ovr = lm_seg_none; > } > > if ( !(ar & _SEGMENT_L) ) > { > - if ( !read_descriptor(data_sel, v, regs, > - &data_base, &data_limit, &ar, > - _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| > - _SEGMENT_P) ) > + if ( !read_descriptor_sel(data_sel, which_sel, v, regs, > + &data_base, &data_limit, &ar, > + _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| > + _SEGMENT_P) ) > goto fail; > if ( !(ar & _SEGMENT_S) || > !(ar & _SEGMENT_P) || > @@ -2000,9 +2064,9 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) > } > } > else > - read_descriptor(data_sel, v, regs, > - &data_base, &data_limit, &ar, > - 0); > + read_descriptor_sel(data_sel, which_sel, v, regs, > + &data_base, &data_limit, &ar, > + 0); > data_limit = ~0UL; > ar = _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P; > } > diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h > index 072a2a7..29ed313 100644 > --- a/xen/include/asm-x86/hvm/hvm.h > +++ b/xen/include/asm-x86/hvm/hvm.h > @@ -195,6 +195,8 @@ struct hvm_function_table { > bool_t access_w, bool_t access_x); > > int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp); > + > + u16 (*read_selector)(struct vcpu *v, enum x86_segment seg); > }; > > extern struct hvm_function_table hvm_funcs; > @@ -334,6 +336,11 @@ static inline int pvh_set_vcpu_info(struct vcpu *v, > return hvm_funcs.pvh_set_vcpu_info(v, ctxtp); > } > > +static inline u16 pvh_get_selector(struct vcpu *v, enum x86_segment seg) > +{ > + return hvm_funcs.read_selector(v, seg); > +} > + > #define is_viridian_domain(_d) \ > (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN])) > > diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h > index 9bb22cb..1242657 100644 > --- a/xen/include/asm-x86/system.h > +++ b/xen/include/asm-x86/system.h > @@ -4,10 +4,21 @@ > #include <xen/lib.h> > #include <xen/bitops.h> > > -#define read_segment_register(vcpu, regs, name) \ > -({ u16 __sel; \ > - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ > - __sel; \ > +/* > + * We need vcpu because during context switch, going from PV to PVH, > + * in save_segments() current has been updated to next, and no longer pointing > + * to the PV, but the intention is to get selector for the PV. Checking > + * is_pvh_vcpu(current) will yield incorrect results in such a case. > + */ > +#define read_segment_register(vcpu, regs, name) \ > +({ u16 __sel; \ > + struct cpu_user_regs *_regs = (regs); \ > + \ > + if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) ) \ > + __sel = pvh_get_selector(vcpu, x86_seg_##name); \ > + else \ > + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \Is there a reason you discarded the STR() macro here?
Jan Beulich
2013-Aug-07 14:23 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
>>> On 07.08.13 at 15:49, George Dunlap <dunlapg@umich.edu> wrote: >> @@ -1835,7 +1889,7 @@ static inline uint64_t guest_misc_enable(uint64_t val) >> _ptr = (unsigned int)_ptr; \ >> if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) ) \ >> goto fail; \ >> - if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ >> + if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ > > Why is this changed?copy_from_user() accesses memory directly, i.e. is not suitable for dereferencing a pointer into the virtual memory space of a PVH/HVM guest. Jan
George Dunlap
2013-Aug-07 14:47 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Wed, Aug 7, 2013 at 3:23 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 07.08.13 at 15:49, George Dunlap <dunlapg@umich.edu> wrote: >>> @@ -1835,7 +1889,7 @@ static inline uint64_t guest_misc_enable(uint64_t val) >>> _ptr = (unsigned int)_ptr; \ >>> if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) ) \ >>> goto fail; \ >>> - if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ >>> + if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ >> >> Why is this changed? > > copy_from_user() accesses memory directly, i.e. is not suitable > for dereferencing a pointer into the virtual memory space of a > PVH/HVM guest.Oh right -- sorry, I noticed adding the "raw", but not the change from user -> guest. This should be mentioned in the commit message at any rate. -George
George Dunlap
2013-Aug-07 15:37 UTC
Re: [V10 PATCH 13/23] PVH xen: interrupt/event-channel delivery to PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c > index e376f3c..ce42950 100644 > --- a/xen/arch/x86/hvm/vmx/intr.c > +++ b/xen/arch/x86/hvm/vmx/intr.c > @@ -165,6 +165,9 @@ static int nvmx_intr_intercept(struct vcpu *v, struct hvm_intack intack) > { > u32 ctrl; > > + if ( is_pvh_vcpu(v) ) > + return 0;Is this a "FIXME: Implement later"? Or is there something else going on? -George
George Dunlap
2013-Aug-07 15:50 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> > @@ -443,6 +445,9 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > domcr_flags = 0; > if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest ) > domcr_flags |= DOMCRF_hvm; > + else if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) > + domcr_flags |= DOMCRF_pvh; /* PV with HAP is a PVH guest */Um, wait a minute -- I don''t think we want to exclude the possibility of *ever* having PVH with shadow pagetables, do we? Wouldn''t it make more sense to just add XEN_DOMCTL_CDF_pvh_guest?> + > if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) > domcr_flags |= DOMCRF_hap; > if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity ) > diff --git a/xen/common/kernel.c b/xen/common/kernel.c > index 72fb905..3bba758 100644 > --- a/xen/common/kernel.c > +++ b/xen/common/kernel.c > @@ -289,7 +289,11 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > if ( current->domain == dom0 ) > fi.submap |= 1U << XENFEAT_dom0; > #ifdef CONFIG_X86 > - if ( !is_hvm_vcpu(current) ) > + if ( is_pvh_vcpu(current) ) > + fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | > + (1U << XENFEAT_supervisor_mode_kernel) | > + (1U << XENFEAT_hvm_callback_vector); > + else if ( !is_hvm_vcpu(current) )While you''re at it, would it make sense to change this to "is_pv_vcpu()", just to make it easier to read? -George
George Dunlap
2013-Aug-07 16:04 UTC
Re: [V10 PATCH 16/23] PVH xen: mtrr, tsc, timers, grant changes...
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> PVH only supports limited memory types in Phase I. TSC is limited to > native mode only also for the moment. Finally, grant mapping of iomem > for PVH hasn''t been explored in phase I. > > Changes in V10: > - don''t migrate timers for PVH as it doesn''t use rtc or emulated timers. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/hvm.c | 4 ++++ > xen/arch/x86/hvm/mtrr.c | 8 ++++++++ > xen/arch/x86/time.c | 8 ++++++++ > xen/common/grant_table.c | 4 ++-- > 4 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index bac4708..93aa42c 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -301,6 +301,10 @@ u64 hvm_get_guest_tsc_adjust(struct vcpu *v) > > void hvm_migrate_timers(struct vcpu *v) > { > + /* PVH doesn''t use rtc and emulated timers, it uses pvclock mechanism. */ > + if ( is_pvh_vcpu(v) ) > + return;Would it make sense to move this one in with 13/NN, "interrupt/event-channel delivery to PVH", since you''re dealing with timers there as well? That would group things more by feature than by where it is in the code. -George
George Dunlap
2013-Aug-07 16:43 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch expands HVM hcall support to include PVH. > > Changes in v8: > - Carve out PVH support of hvm_op to a small function. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/hvm.c | 80 +++++++++++++++++++++++++++++++++++++------ > xen/arch/x86/x86_64/traps.c | 2 +- > 2 files changed, 70 insertions(+), 12 deletions(-) > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 383c5cd..6af020e 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -3192,6 +3192,17 @@ static long hvm_vcpu_op( > case VCPUOP_register_vcpu_time_memory_area: > rc = do_vcpu_op(cmd, vcpuid, arg); > break; > + > + case VCPUOP_is_up: > + case VCPUOP_up: > + case VCPUOP_initialise: > + /* PVH fixme: this white list should be removed eventually */What do you mean by this? That PVH won''t need these in the future, or that you''ll have some other way?> + if ( is_pvh_vcpu(current) ) > + rc = do_vcpu_op(cmd, vcpuid, arg); > + else > + rc = -ENOSYS; > + break; > + > default: > rc = -ENOSYS; > break; > @@ -3312,6 +3323,24 @@ static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = { > HYPERCALL(tmem_op) > }; > > +/* PVH 32bitfixme. */ > +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] = { > + HYPERCALL(platform_op), > + HYPERCALL(memory_op), > + HYPERCALL(xen_version), > + HYPERCALL(console_io), > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op, > + HYPERCALL(mmuext_op), > + HYPERCALL(xsm_op), > + HYPERCALL(sched_op), > + HYPERCALL(event_channel_op), > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, > + HYPERCALL(hvm_op), > + HYPERCALL(sysctl), > + HYPERCALL(domctl) > +};It would be nice if this list were in the same order as the other lists, so that it is easy to figure out what calls are common and what calls are different. -George
Mukesh Rathor
2013-Aug-07 22:37 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On Wed, 7 Aug 2013 14:10:27 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Aug 7, 2013 at 10:14 AM, George Dunlap > <george.dunlap@eu.citrix.com> wrote: > >> And the changelog does say it: > >> > >> "Note, we drop the const qualifier from vcpu_show_registers() to > >> accomodate the hvm function call in guest_kernel_mode()." > > > > > > I said *exact function*. guest_kernel_mode() doesn''t need it > > non-const; it needs it because of a function that it calls. That > > in turn doesn''t need it non-const either -- it needs it because of > > the next one down. Who *actually* needs vcpu to be non-const, way > > down at the bottom? That''s what I need to know to understand why > > we can''t just change each of those functions to const all the way > > down. > > The general principle here is that you have already done the work of > tracing through the code to figure out what''s going on; you should > cache that information in the commit log so that reviewers (and people > doing archaeology) don''t need to duplicate the effort.I can''t remember exact function that can''t allow const, but there are tons of leaf calls being passed v. Anyways, I''ll recreate the crime scene, and put in the comment log. Mukesh
Mukesh Rathor
2013-Aug-08 01:05 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 05 Aug 2013 12:10:15 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > > vcpu_guest_context *ctxtp) +{ > > + if ( v->vcpu_id == 0 ) > > + return 0; > > + > > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > > + return -EINVAL; > > + > > + vmx_vmcs_enter(v); > > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > > Just noticed: Aren''t you mixing up entries and bytes here?Right: __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); Any formatting issues here? I don''t see in coding style, and see both code where there is a space around ''*'' and not. Also, when setting the limit, do we need to worry about the G flag? or for that matter, D/B whether segment is growing up or down? It appears we don''t need to worry about that for LDT, but not sure reading the SDMs.. thanks, Mukesh
Mukesh Rathor
2013-Aug-08 01:27 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Wed, 7 Aug 2013 11:01:21 +0100 Tim Deegan <tim@xen.org> wrote:> At 18:53 -0700 on 06 Aug (1375815193), Mukesh Rathor wrote: > > On Mon, 5 Aug 2013 18:34:36 -0700 > > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > > On Mon, 05 Aug 2013 12:08:50 +0100 > > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > > > >>> On 26.07.13 at 02:58, Mukesh Rathor > > > > >>> <mukesh.rathor@oracle.com> wrote: > > > > > On Thu, 25 Jul 2013 14:47:55 +0100 > > > > > Tim Deegan <tim@xen.org> wrote: > > ....... > > > > > > >> At 18:59 -0700 on 23 Jul (1374605957), Mukesh Rathor wrote: > > > > > That was an option discussed with Jan, walking and reading > > > > > the GDT entries from the gdtaddr the guest provided to load > > > > > the hidden parts. But, I agree with him, that for the initial > > > > > cpu boot we can restrict the ABI to say: 0 base addr, ~0 > > > > > limit, and "read/write, accessed" default attributes for the > > > > > hidden part (64bit guest). > > > > > > > > That must be a misunderstanding then (also see my other reply) > > > > - I always meant to require that you either properly load the > > > > hidden register portions from the descriptor tables, or at > > > > least verify that the descriptor table entries referenced match > > > > the defaults you enforce. > > > > > > Ok, I thought you just wanted to be documented, If I''m gonna write > > > the code to verify, i might as well just write the hidden > > > porions, an option I''d proposed. That way there are no > > > constraints. I''m currently working on just doing that, and will > > > be in the next version of the patch. > > > > Ok, I''ve mostly got code to set the hidden fields, but the more I > > think about it, the more I feel that the right/better thing to do > > is to just not set any selectors at all, instead default them in > > the VMCS. > > Why do you think that''s better? Why on earth wouldn''t a VCPU > context-setting hypercall that takes segment selectors as arguments > just DTRT? The principle of least astonishment should apply here.Ok. looks like the common denometer is to just set the hidden fields, so I''ll just do that. thanks, mukesh
Mukesh Rathor
2013-Aug-08 01:40 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Wed, 7 Aug 2013 11:24:42 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote: .........> > index 412971e..ece11e4 100644 > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) > > int i; > > unsigned long pfn; > > > > + if ( is_pvh_vcpu(v) ) > > + return; > > There seems to be some inconsistency with where this is supposed to be > checked -- in domain_relinquish_resources(), destroy_gdt() is only > called for pv domains (gated on is_pv_domain); but in > arch_set_info_guest(), it *was* gated on being PV, but with the PVH > changes it''s still being called. > > Either this should only be called for PV domains (and this check > should be an ASSERT), or it should be called regardless of the type of > domain. I prefer the first if possible.In the original version it was being called for pv domains only, and I had checks in the caller. But, Jan preferred the check in destroy_gdt() so I moved it to destroy_gdt(). thanks mukesh
Mukesh Rathor
2013-Aug-08 01:42 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Wed, 7 Aug 2013 11:48:26 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote:..........> > + > > v->arch.guest_table = pagetable_from_page(cr3_page); > > - if ( c.nat->ctrlreg[1] ) > > + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) > > { > > cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); > > cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, > > P2M_ALLOC); @@ -954,6 +965,13 @@ int arch_set_info_guest( > > > > update_cr3(v); > > > > + if ( is_pvh_vcpu(v) ) > > + { > > + /* Set VMCS fields. */ > > + if ( (rc = pvh_set_vcpu_info(v, c.nat)) != 0 ) > > + return rc; > > + } > > BTW, would it make sense to pull the code above, which sets FS and GS > for PV guests into a separate function, and then do something like the > following? > > is_pv_vcpu(v) ? > pv_set_vcpu_info() : > pvh_set_vcpu_info();Perhaps! But, these patches have been out for about 7 months now, and are tested by us and Andrew for PV and HVM regressions. I prefer not changing common code at this point, unless buggy. May be we can do an incremental change later if you really want this function to be re-org''d. thanks mukesh
Mukesh Rathor
2013-Aug-08 01:49 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On Wed, 7 Aug 2013 12:29:13 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > This patch supports invalid op emulation for PVH by calling > > appropriate copy macros and and HVM function to inject PF. > > > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > > Reviewed-by: Jan Beulich <jbeulich@suse.com> > > --- > > xen/arch/x86/traps.c | 17 ++++++++++++++--- > > xen/include/asm-x86/traps.h | 1 + > > 2 files changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > > index 378ef0a..a3ca70b 100644 > > --- a/xen/arch/x86/traps.c > > +++ b/xen/arch/x86/traps.c > > @@ -459,6 +459,11 @@ static void instruction_done( > > struct cpu_user_regs *regs, unsigned long eip, unsigned int > > bpmatch) { > > regs->eip = eip; > > + > > + /* PVH fixme: debug trap below */ > > + if ( is_pvh_vcpu(current) ) > > + return; > > What exactly does this comment mean? Do you mean, "FIXME: Make debug > trapping work for PVH guests"? (i.e., this functionality will be > implemented later?)Correct, future work. Look at what the db trap is doing and make it work for PVH if it doesn''t already.> > + > > regs->eflags &= ~X86_EFLAGS_RF; > > if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) > > { > > @@ -913,7 +918,7 @@ static int emulate_invalid_rdtscp(struct > > cpu_user_regs *regs) return EXCRET_fault_fixed; > > } > > > > -static int emulate_forced_invalid_op(struct cpu_user_regs *regs) > > +int emulate_forced_invalid_op(struct cpu_user_regs *regs) > > Why make this non-static? No one is using this in this patch. If a > later patch needs it, you should make it non-static there, so we can > decide at that point if making it non-static is merited or not.Sigh! Originally, it was that way, but then to keep that patch from getting too big, it got moved here after few versions. We are making emulation available for outside the PV, ie, to PVH.> > + if ( (rc = raw_copy_from_guest(sig, (char *)eip, > > sizeof(sig))) != 0 ) { > > propagate_page_fault(eip + sizeof(sig) - rc, 0); > > return EXCRET_fault_fixed; > > @@ -931,7 +936,7 @@ static int emulate_forced_invalid_op(struct > > cpu_user_regs *regs) eip += sizeof(sig); > > > > /* We only emulate CPUID. */ > > - if ( ( rc = copy_from_user(instr, (char *)eip, > > sizeof(instr))) != 0 ) > > + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, > > sizeof(instr))) != 0 ) { > > propagate_page_fault(eip + sizeof(instr) - rc, 0); > > return EXCRET_fault_fixed; > > @@ -1076,6 +1081,12 @@ void propagate_page_fault(unsigned long > > addr, u16 error_code) struct vcpu *v = current; > > struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; > > > > + if ( is_pvh_vcpu(v) ) > > + { > > + hvm_inject_page_fault(error_code, addr); > > + return; > > + } > > Would it make more sense to rename this function > "pv_inject_page_fault", and then make a macro to switch between the > two?I don''t think so, propagate_page_fault seems generic enough. -mukesh
Mukesh Rathor
2013-Aug-08 01:59 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Wed, 7 Aug 2013 14:49:50 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote:....> > > > const struct hvm_function_table * __init start_vmx(void) > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > > index a3ca70b..fe8b94c 100644 > > --- a/xen/arch/x86/traps.c > > +++ b/xen/arch/x86/traps.c > > @@ -480,6 +480,10 @@ static unsigned int > > check_guest_io_breakpoint(struct vcpu *v, unsigned int width, i, > > match = 0; unsigned long start; > > > > + /* PVH fixme: support io breakpoint. */ > > + if ( is_pvh_vcpu(v) ) > > + return 0; > > Does this one, and the check to IO below, have anything to do with > privileged op emulation?Yes, it''s called from emulate_privileged_op(). ...> > +static int read_descriptor_sel(unsigned int sel, > > + enum x86_segment which_sel, > > + struct vcpu *v, > > + const struct cpu_user_regs *regs, > > + unsigned long *base, > > + unsigned long *limit, > > + unsigned int *ar, > > + unsigned int vm86attr) > > +{ > > + struct segment_register seg; > > + bool_t long_mode; > > + > > + if ( !is_pvh_vcpu(v) ) > > + return read_descriptor(sel, v, regs, base, limit, ar, > > vm86attr); > > Again, wouldn''t it be better to rename read_desrciptor to > pv_read_descriptor(), name this one pvh_read_desrciptor(), give them a > similar function signature (e.g., have both take a which_sel and have > it look up the selector itself), rather than have this > one-function-calls-another-function thing?If you go back to where we discussed this in previous reviews, it is being done this way because of other callers of read_descriptor that don''t need to be changed to pass enum x86_segment.> > int user_mode = !(v->arch.flags & TF_kernel_mode); > > #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v) > > > > + /* > > + * For PVH we check this in vmexit for > > EXIT_REASON_IO_INSTRUCTION > > + * and so don''t need to check again here. > > + */ > > + if ( is_pvh_vcpu(v) ) > > + return 1; > > Same question re IO emulation.Same answer.> > + * We need vcpu because during context switch, going from PV to > > PVH, > > + * in save_segments() current has been updated to next, and no > > longer pointing > > + * to the PV, but the intention is to get selector for the PV. > > Checking > > + * is_pvh_vcpu(current) will yield incorrect results in such a > > case. > > + */ > > +#define read_segment_register(vcpu, regs, name) \ > > +({ u16 __sel; \ > > + struct cpu_user_regs *_regs = (regs); \ > > + \ > > + if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) ) \ > > + __sel = pvh_get_selector(vcpu, x86_seg_##name); \ > > + else \ > > + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ > > Is there a reason you discarded the STR() macro here?Suggested by Jan to change it, not sure the reason. Jan do you recall? -Mukesh
Mukesh Rathor
2013-Aug-08 02:05 UTC
Re: [V10 PATCH 13/23] PVH xen: interrupt/event-channel delivery to PVH
On Wed, 7 Aug 2013 16:37:33 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > diff --git a/xen/arch/x86/hvm/vmx/intr.c > > b/xen/arch/x86/hvm/vmx/intr.c index e376f3c..ce42950 100644 > > --- a/xen/arch/x86/hvm/vmx/intr.c > > +++ b/xen/arch/x86/hvm/vmx/intr.c > > @@ -165,6 +165,9 @@ static int nvmx_intr_intercept(struct vcpu *v, > > struct hvm_intack intack) { > > u32 ctrl; > > > > + if ( is_pvh_vcpu(v) ) > > + return 0; > > Is this a "FIXME: Implement later"? Or is there something else going > on? > > -GeorgeAt least now, there''s no nested vmx for PVH. -M
Mukesh Rathor
2013-Aug-08 02:12 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On Wed, 7 Aug 2013 17:43:54 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > This patch expands HVM hcall support to include PVH. > > > > Changes in v8: > > - Carve out PVH support of hvm_op to a small function. > > > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > > --- > > xen/arch/x86/hvm/hvm.c | 80 > > +++++++++++++++++++++++++++++++++++++------ > > xen/arch/x86/x86_64/traps.c | 2 +- 2 files changed, 70 > > insertions(+), 12 deletions(-) > > > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > > index 383c5cd..6af020e 100644 > > --- a/xen/arch/x86/hvm/hvm.c > > +++ b/xen/arch/x86/hvm/hvm.c > > @@ -3192,6 +3192,17 @@ static long hvm_vcpu_op( > > case VCPUOP_register_vcpu_time_memory_area: > > rc = do_vcpu_op(cmd, vcpuid, arg); > > break; > > + > > + case VCPUOP_is_up: > > + case VCPUOP_up: > > + case VCPUOP_initialise: > > + /* PVH fixme: this white list should be removed eventually > > */ > > What do you mean by this? That PVH won''t need these in the future, or > that you''ll have some other way?Just not have these checks here, but just support them all, whatever makese sense.> > + if ( is_pvh_vcpu(current) ) > > + rc = do_vcpu_op(cmd, vcpuid, arg); > > + else > > + rc = -ENOSYS; > > + break; > > + > > default: > > rc = -ENOSYS; > > break; > > @@ -3312,6 +3323,24 @@ static hvm_hypercall_t *const > > hvm_hypercall32_table[NR_hypercalls] = { HYPERCALL(tmem_op) > > }; > > > > +/* PVH 32bitfixme. */ > > +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] > > = { > > + HYPERCALL(platform_op), > > + HYPERCALL(memory_op), > > + HYPERCALL(xen_version), > > + HYPERCALL(console_io), > > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t > > *)hvm_grant_table_op, > > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t > > *)hvm_vcpu_op, > > + HYPERCALL(mmuext_op), > > + HYPERCALL(xsm_op), > > + HYPERCALL(sched_op), > > + HYPERCALL(event_channel_op), > > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t > > *)hvm_physdev_op, > > + HYPERCALL(hvm_op), > > + HYPERCALL(sysctl), > > + HYPERCALL(domctl) > > +}; > > It would be nice if this list were in the same order as the other > lists, so that it is easy to figure out what calls are common and what > calls are different.These are ordered by the hcall number, and assists in the debug. -Mukesh
Jan Beulich
2013-Aug-08 06:56 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Mon, 05 Aug 2013 12:10:15 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct >> > vcpu_guest_context *ctxtp) +{ >> > + if ( v->vcpu_id == 0 ) >> > + return 0; >> > + >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) >> > + return -EINVAL; >> > + >> > + vmx_vmcs_enter(v); >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); >> >> Just noticed: Aren''t you mixing up entries and bytes here? > > Right: > > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); > > Any formatting issues here? I don''t see in coding style, and see > both code where there is a space around ''*'' and not.The inner parentheses are superfluous. CODING_STYLE is pretty explicit about there needing to be white space around operators: "Spaces are placed [...], and around binary operators (except the structure access operators, ''.'' and ''->'')."> Also, when setting the limit, do we need to worry about the G flag? > or for that matter, D/B whether segment is growing up or down? > It appears we don''t need to worry about that for LDT, but not sure > reading the SDMs..The D/B bit doesn''t matter for LDT (and TSS), but the G bit would. However - now that you''re intending to require trivial state (64-bit CS, all other selectors zero), it would only be logical to also require a zero LDT selector (and hence base and entry count to be zero). Jan
Jan Beulich
2013-Aug-08 07:29 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
>>> On 08.08.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 7 Aug 2013 11:24:42 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > ......... >> > index 412971e..ece11e4 100644 >> > --- a/xen/arch/x86/mm.c >> > +++ b/xen/arch/x86/mm.c >> > @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) >> > int i; >> > unsigned long pfn; >> > >> > + if ( is_pvh_vcpu(v) ) >> > + return; >> >> There seems to be some inconsistency with where this is supposed to be >> checked -- in domain_relinquish_resources(), destroy_gdt() is only >> called for pv domains (gated on is_pv_domain); but in >> arch_set_info_guest(), it *was* gated on being PV, but with the PVH >> changes it''s still being called. >> >> Either this should only be called for PV domains (and this check >> should be an ASSERT), or it should be called regardless of the type of >> domain. I prefer the first if possible. > > In the original version it was being called for pv domains only, and I > had checks in the caller. But, Jan preferred the check in destroy_gdt() > so I moved it to destroy_gdt().But that perspective may have changed with other code changes: If all callers now suppress the call for PVH guests, this should indeed be an assertion (if anything). If all but one caller checks for PV (or are in PV only code paths), the better approach now may still be to have the one odd caller do the check and have an assertion in the function. Iirc it was at least two call sites you had to adjust originally, which then warranted to do the check in just one place (in the function itself). Jan
Jan Beulich
2013-Aug-08 07:35 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
>>> On 08.08.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 7 Aug 2013 14:49:50 +0100 > George Dunlap <dunlapg@umich.edu> wrote: >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > + * We need vcpu because during context switch, going from PV to >> > PVH, >> > + * in save_segments() current has been updated to next, and no >> > longer pointing >> > + * to the PV, but the intention is to get selector for the PV. >> > Checking >> > + * is_pvh_vcpu(current) will yield incorrect results in such a >> > case. >> > + */ >> > +#define read_segment_register(vcpu, regs, name) \ >> > +({ u16 __sel; \ >> > + struct cpu_user_regs *_regs = (regs); \ >> > + \ >> > + if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) ) \ >> > + __sel = pvh_get_selector(vcpu, x86_seg_##name); \ >> > + else \ >> > + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ >> >> Is there a reason you discarded the STR() macro here? > > Suggested by Jan to change it, not sure the reason. Jan do you recall?I think this is the result of multiple iterations of the patch, where intermediately the stringification had disappeared altogether. When I requested it to be restored, I used the simpler # operator in the outline. In any event I think STR() should go away altogether (where necessary replaced by __stringify()), and was needlessly used in the original code here: The intended use is when you need the argument macro expanded before stringification, which is not the case here. Jan
Jan Beulich
2013-Aug-08 07:41 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
>>> On 08.08.13 at 04:12, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 7 Aug 2013 17:43:54 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] >> > = { >> > + HYPERCALL(platform_op), >> > + HYPERCALL(memory_op), >> > + HYPERCALL(xen_version), >> > + HYPERCALL(console_io), >> > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, >> > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op, >> > + HYPERCALL(mmuext_op), >> > + HYPERCALL(xsm_op), >> > + HYPERCALL(sched_op), >> > + HYPERCALL(event_channel_op), >> > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, >> > + HYPERCALL(hvm_op), >> > + HYPERCALL(sysctl), >> > + HYPERCALL(domctl) >> > +}; >> >> It would be nice if this list were in the same order as the other >> lists, so that it is easy to figure out what calls are common and what >> calls are different. > > These are ordered by the hcall number, and assists in the debug.But with George asking, do you now understand a little better why on a very early revision I had asked to copy either the HVM or PV hypercall table, and override just the entries that need overrides (making it very clear which ones differ)? Jan
Ian Campbell
2013-Aug-08 08:21 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On Wed, 2013-08-07 at 16:50 +0100, George Dunlap wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > > @@ -443,6 +445,9 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > > domcr_flags = 0; > > if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest ) > > domcr_flags |= DOMCRF_hvm; > > + else if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) > > + domcr_flags |= DOMCRF_pvh; /* PV with HAP is a PVH guest */ > > Um, wait a minute -- I don''t think we want to exclude the possibility > of *ever* having PVH with shadow pagetables, do we? Wouldn''t it make > more sense to just add XEN_DOMCTL_CDF_pvh_guest?This is a domctl so we can always change the interface if it comes to it. Ian.
Ian Campbell
2013-Aug-08 08:26 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On Thu, 2013-08-08 at 08:41 +0100, Jan Beulich wrote:> >>> On 08.08.13 at 04:12, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > On Wed, 7 Aug 2013 17:43:54 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> > +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] > >> > = { > >> > + HYPERCALL(platform_op), > >> > + HYPERCALL(memory_op), > >> > + HYPERCALL(xen_version), > >> > + HYPERCALL(console_io), > >> > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, > >> > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op, > >> > + HYPERCALL(mmuext_op), > >> > + HYPERCALL(xsm_op), > >> > + HYPERCALL(sched_op), > >> > + HYPERCALL(event_channel_op), > >> > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, > >> > + HYPERCALL(hvm_op), > >> > + HYPERCALL(sysctl), > >> > + HYPERCALL(domctl) > >> > +}; > >> > >> It would be nice if this list were in the same order as the other > >> lists, so that it is easy to figure out what calls are common and what > >> calls are different. > > > > These are ordered by the hcall number, and assists in the debug. > > But with George asking, do you now understand a little better > why on a very early revision I had asked to copy either the > HVM or PV hypercall table, and override just the entries that > need overrides (making it very clear which ones differ)?I was just wondering if we couldn''t generate (at build time) all three tables from a more readable combined form of some sort. It''s probably an awk script or something away... Ian.
George Dunlap
2013-Aug-08 08:55 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On 08/08/13 02:49, Mukesh Rathor wrote:> On Wed, 7 Aug 2013 12:29:13 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >>> This patch supports invalid op emulation for PVH by calling >>> appropriate copy macros and and HVM function to inject PF. >>> >>> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >>> Reviewed-by: Jan Beulich <jbeulich@suse.com> >>> --- >>> xen/arch/x86/traps.c | 17 ++++++++++++++--- >>> xen/include/asm-x86/traps.h | 1 + >>> 2 files changed, 15 insertions(+), 3 deletions(-) >>> >>> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c >>> index 378ef0a..a3ca70b 100644 >>> --- a/xen/arch/x86/traps.c >>> +++ b/xen/arch/x86/traps.c >>> @@ -459,6 +459,11 @@ static void instruction_done( >>> struct cpu_user_regs *regs, unsigned long eip, unsigned int >>> bpmatch) { >>> regs->eip = eip; >>> + >>> + /* PVH fixme: debug trap below */ >>> + if ( is_pvh_vcpu(current) ) >>> + return; >> What exactly does this comment mean? Do you mean, "FIXME: Make debug >> trapping work for PVH guests"? (i.e., this functionality will be >> implemented later?) > Correct, future work. Look at what the db trap is doing and make > it work for PVH if it doesn''t already. > >>> + >>> regs->eflags &= ~X86_EFLAGS_RF; >>> if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) >>> { >>> @@ -913,7 +918,7 @@ static int emulate_invalid_rdtscp(struct >>> cpu_user_regs *regs) return EXCRET_fault_fixed; >>> } >>> >>> -static int emulate_forced_invalid_op(struct cpu_user_regs *regs) >>> +int emulate_forced_invalid_op(struct cpu_user_regs *regs) >> Why make this non-static? No one is using this in this patch. If a >> later patch needs it, you should make it non-static there, so we can >> decide at that point if making it non-static is merited or not. > Sigh! Originally, it was that way, but then to keep that patch from > getting too big, it got moved here after few versions. We are making > emulation available for outside the PV, ie, to PVH.As far as I''m concerned, the size of the patch itself is immaterial; the only important question, regarding how to break down patches (just like in breaking down functions), is how easy or difficult it is to understand the whole thing. Now it''s typically the case that long patches are hard to understand, and that breaking them down into smaller chunks makes them easier to read. But a division like this, where you''ve moved some random hunk into a different patch with which it has no logical relation, makes the series *harder* to understand, not easier. Additionally, as the series evolves, it makes it difficult to keep all of the dependencies straight. Suppose you changed your approach for that future patch so that you didn''t need this public anymore. You, and all the reviewers, could easily forget about the dependency, since it''s in a separate patch which may have already been classified as "OK". It''s like taking a function named foo() and breaking it down into foo_1() and foo_2(). You''re making the function shorter, but achieving the opposite of what having short functions is supposed to achieve. :-)> >>> + if ( (rc = raw_copy_from_guest(sig, (char *)eip, >>> sizeof(sig))) != 0 ) { >>> propagate_page_fault(eip + sizeof(sig) - rc, 0); >>> return EXCRET_fault_fixed; >>> @@ -931,7 +936,7 @@ static int emulate_forced_invalid_op(struct >>> cpu_user_regs *regs) eip += sizeof(sig); >>> >>> /* We only emulate CPUID. */ >>> - if ( ( rc = copy_from_user(instr, (char *)eip, >>> sizeof(instr))) != 0 ) >>> + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, >>> sizeof(instr))) != 0 ) { >>> propagate_page_fault(eip + sizeof(instr) - rc, 0); >>> return EXCRET_fault_fixed; >>> @@ -1076,6 +1081,12 @@ void propagate_page_fault(unsigned long >>> addr, u16 error_code) struct vcpu *v = current; >>> struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; >>> >>> + if ( is_pvh_vcpu(v) ) >>> + { >>> + hvm_inject_page_fault(error_code, addr); >>> + return; >>> + } >> Would it make more sense to rename this function >> "pv_inject_page_fault", and then make a macro to switch between the >> two? > I don''t think so, propagate_page_fault seems generic enough.What I meant was something similar to what I suggested for patch 10 -- make propagate_page_fault() truly generic, by making it check what mode is running and calling either pv_inject_page_fault() or hvm_inject_page_fault() as appropriate. -George
George Dunlap
2013-Aug-08 08:57 UTC
Re: [V10 PATCH 08/23] PVH xen: Introduce PVH guest type and some basic changes.
On 07/08/13 23:37, Mukesh Rathor wrote:> On Wed, 7 Aug 2013 14:10:27 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Aug 7, 2013 at 10:14 AM, George Dunlap >> <george.dunlap@eu.citrix.com> wrote: >>>> And the changelog does say it: >>>> >>>> "Note, we drop the const qualifier from vcpu_show_registers() to >>>> accomodate the hvm function call in guest_kernel_mode()." >>> >>> I said *exact function*. guest_kernel_mode() doesn''t need it >>> non-const; it needs it because of a function that it calls. That >>> in turn doesn''t need it non-const either -- it needs it because of >>> the next one down. Who *actually* needs vcpu to be non-const, way >>> down at the bottom? That''s what I need to know to understand why >>> we can''t just change each of those functions to const all the way >>> down. >> The general principle here is that you have already done the work of >> tracing through the code to figure out what''s going on; you should >> cache that information in the commit log so that reviewers (and people >> doing archaeology) don''t need to duplicate the effort. > I can''t remember exact function that can''t allow const, but there > are tons of leaf calls being passed v. Anyways, I''ll recreate the > crime scene, and put in the comment log.I appreciate it, thanks. -George
George Dunlap
2013-Aug-08 09:02 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On 08/08/13 08:29, Jan Beulich wrote:>>>> On 08.08.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> On Wed, 7 Aug 2013 11:24:42 +0100 >> George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> ......... >>>> index 412971e..ece11e4 100644 >>>> --- a/xen/arch/x86/mm.c >>>> +++ b/xen/arch/x86/mm.c >>>> @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) >>>> int i; >>>> unsigned long pfn; >>>> >>>> + if ( is_pvh_vcpu(v) ) >>>> + return; >>> There seems to be some inconsistency with where this is supposed to be >>> checked -- in domain_relinquish_resources(), destroy_gdt() is only >>> called for pv domains (gated on is_pv_domain); but in >>> arch_set_info_guest(), it *was* gated on being PV, but with the PVH >>> changes it''s still being called. >>> >>> Either this should only be called for PV domains (and this check >>> should be an ASSERT), or it should be called regardless of the type of >>> domain. I prefer the first if possible. >> In the original version it was being called for pv domains only, and I >> had checks in the caller. But, Jan preferred the check in destroy_gdt() >> so I moved it to destroy_gdt(). > But that perspective may have changed with other code changes: > If all callers now suppress the call for PVH guests, this should indeed > be an assertion (if anything). If all but one caller checks for PV (or > are in PV only code paths), the better approach now may still be to > have the one odd caller do the check and have an assertion in the > function. Iirc it was at least two call sites you had to adjust originally, > which then warranted to do the check in just one place (in the > function itself).Overall I think checking before calling would make the code easier to understand. All the functions which call this have is_foo_domain() sprinkled all over anyway, and so it''s easier for someone reading the code to understand immediately that HVM and PVH guests don''t need their gdt destroyed. But my main point was that if we check inside the function, we should avoid checking outside the function for consistency. -George
Jan Beulich
2013-Aug-08 09:08 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
>>> On 08.08.13 at 11:02, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Overall I think checking before calling would make the code easier to > understand. All the functions which call this have is_foo_domain() > sprinkled all over anyway, and so it''s easier for someone reading the > code to understand immediately that HVM and PVH guests don''t need their > gdt destroyed. But my main point was that if we check inside the > function, we should avoid checking outside the function for consistency.And I fully agree with that (here as well as in the various cases where such redundancy already exists in the code). Jan
George Dunlap
2013-Aug-08 09:14 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On 08/08/13 02:42, Mukesh Rathor wrote:> On Wed, 7 Aug 2013 11:48:26 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: > .......... >>> + >>> v->arch.guest_table = pagetable_from_page(cr3_page); >>> - if ( c.nat->ctrlreg[1] ) >>> + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) >>> { >>> cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); >>> cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, >>> P2M_ALLOC); @@ -954,6 +965,13 @@ int arch_set_info_guest( >>> >>> update_cr3(v); >>> >>> + if ( is_pvh_vcpu(v) ) >>> + { >>> + /* Set VMCS fields. */ >>> + if ( (rc = pvh_set_vcpu_info(v, c.nat)) != 0 ) >>> + return rc; >>> + } >> BTW, would it make sense to pull the code above, which sets FS and GS >> for PV guests into a separate function, and then do something like the >> following? >> >> is_pv_vcpu(v) ? >> pv_set_vcpu_info() : >> pvh_set_vcpu_info(); > Perhaps! But, these patches have been out for about 7 months now, and > are tested by us and Andrew for PV and HVM regressions. I prefer not > changing common code at this point, unless buggy. May be we can do an > incremental change later if you really want this function to be > re-org''d.I''m keen to get the patches in, but I''m also keen on making the changes as clean as possible. It''s a lot harder to go through and clean things up later. I don''t think this level of re-arranging should be too much additional delay, or have very much risk of regression. Obviously the other maintainers may feel differently. I''m sorry I''m a bit late to the party and only now able to find time to give my input. Once I have a view of the whole series, I can come back and give maybe a "Not-perfectly-happy-but-wont-NACK" or something like that, and others can decide whether to hold out for the change or just get it done with. :-) BTW the testing thing I don''t think is a valid argument; we''re early in the dev cycle, so a regression isn''t a big deal; and in any case the risk of regression is the same whether the change happens before it''s committed or afterwards. -George
George Dunlap
2013-Aug-08 09:20 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On 08/08/13 03:12, Mukesh Rathor wrote:> On Wed, 7 Aug 2013 17:43:54 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >>> This patch expands HVM hcall support to include PVH. >>> >>> Changes in v8: >>> - Carve out PVH support of hvm_op to a small function. >>> >>> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >>> --- >>> xen/arch/x86/hvm/hvm.c | 80 >>> +++++++++++++++++++++++++++++++++++++------ >>> xen/arch/x86/x86_64/traps.c | 2 +- 2 files changed, 70 >>> insertions(+), 12 deletions(-) >>> >>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c >>> index 383c5cd..6af020e 100644 >>> --- a/xen/arch/x86/hvm/hvm.c >>> +++ b/xen/arch/x86/hvm/hvm.c >>> @@ -3192,6 +3192,17 @@ static long hvm_vcpu_op( >>> case VCPUOP_register_vcpu_time_memory_area: >>> rc = do_vcpu_op(cmd, vcpuid, arg); >>> break; >>> + >>> + case VCPUOP_is_up: >>> + case VCPUOP_up: >>> + case VCPUOP_initialise: >>> + /* PVH fixme: this white list should be removed eventually >>> */ >> What do you mean by this? That PVH won''t need these in the future, or >> that you''ll have some other way? > Just not have these checks here, but just support them all, whatever > makese sense.Sorry, I still don''t understand -- do you mean you want to eventually just allow all VCPUOPs for PVH?> >>> + if ( is_pvh_vcpu(current) ) >>> + rc = do_vcpu_op(cmd, vcpuid, arg); >>> + else >>> + rc = -ENOSYS; >>> + break; >>> + >>> default: >>> rc = -ENOSYS; >>> break; >>> @@ -3312,6 +3323,24 @@ static hvm_hypercall_t *const >>> hvm_hypercall32_table[NR_hypercalls] = { HYPERCALL(tmem_op) >>> }; >>> >>> +/* PVH 32bitfixme. */ >>> +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] >>> = { >>> + HYPERCALL(platform_op), >>> + HYPERCALL(memory_op), >>> + HYPERCALL(xen_version), >>> + HYPERCALL(console_io), >>> + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t >>> *)hvm_grant_table_op, >>> + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t >>> *)hvm_vcpu_op, >>> + HYPERCALL(mmuext_op), >>> + HYPERCALL(xsm_op), >>> + HYPERCALL(sched_op), >>> + HYPERCALL(event_channel_op), >>> + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t >>> *)hvm_physdev_op, >>> + HYPERCALL(hvm_op), >>> + HYPERCALL(sysctl), >>> + HYPERCALL(domctl) >>> +}; >> It would be nice if this list were in the same order as the other >> lists, so that it is easy to figure out what calls are common and what >> calls are different. > These are ordered by the hcall number, and assists in the debug.That makes sense. What about adding a "prep" patch which re-organizes the other lists by hcall number? I''m not particular about which order, I just think they should be the same. -George
Jan Beulich
2013-Aug-08 10:18 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
>>> On 08.08.13 at 11:20, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 08/08/13 03:12, Mukesh Rathor wrote: >> On Wed, 7 Aug 2013 17:43:54 +0100 >> George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> >>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >>> <mukesh.rathor@oracle.com> wrote: >>>> --- a/xen/arch/x86/hvm/hvm.c >>>> +++ b/xen/arch/x86/hvm/hvm.c >>>> @@ -3192,6 +3192,17 @@ static long hvm_vcpu_op( >>>> case VCPUOP_register_vcpu_time_memory_area: >>>> rc = do_vcpu_op(cmd, vcpuid, arg); >>>> break; >>>> + >>>> + case VCPUOP_is_up: >>>> + case VCPUOP_up: >>>> + case VCPUOP_initialise: >>>> + /* PVH fixme: this white list should be removed eventually >>>> */ >>> What do you mean by this? That PVH won''t need these in the future, or >>> that you''ll have some other way? >> Just not have these checks here, but just support them all, whatever >> makese sense. > > Sorry, I still don''t understand -- do you mean you want to eventually > just allow all VCPUOPs for PVH?That must be the ultimate goal (i.e. as little special casing between PV and PVH as possible). Jan
George Dunlap
2013-Aug-08 14:18 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Thu, Aug 8, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Wed, 7 Aug 2013 14:49:50 +0100 > George Dunlap <dunlapg@umich.edu> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: > .... >> > >> > const struct hvm_function_table * __init start_vmx(void) >> > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c >> > index a3ca70b..fe8b94c 100644 >> > --- a/xen/arch/x86/traps.c >> > +++ b/xen/arch/x86/traps.c >> > @@ -480,6 +480,10 @@ static unsigned int >> > check_guest_io_breakpoint(struct vcpu *v, unsigned int width, i, >> > match = 0; unsigned long start; >> > >> > + /* PVH fixme: support io breakpoint. */ >> > + if ( is_pvh_vcpu(v) ) >> > + return 0; >> >> Does this one, and the check to IO below, have anything to do with >> privileged op emulation? > > Yes, it''s called from emulate_privileged_op(). > > ... > >> > +static int read_descriptor_sel(unsigned int sel, >> > + enum x86_segment which_sel, >> > + struct vcpu *v, >> > + const struct cpu_user_regs *regs, >> > + unsigned long *base, >> > + unsigned long *limit, >> > + unsigned int *ar, >> > + unsigned int vm86attr) >> > +{ >> > + struct segment_register seg; >> > + bool_t long_mode; >> > + >> > + if ( !is_pvh_vcpu(v) ) >> > + return read_descriptor(sel, v, regs, base, limit, ar, >> > vm86attr); >> >> Again, wouldn''t it be better to rename read_desrciptor to >> pv_read_descriptor(), name this one pvh_read_desrciptor(), give them a >> similar function signature (e.g., have both take a which_sel and have >> it look up the selector itself), rather than have this >> one-function-calls-another-function thing? > > If you go back to where we discussed this in previous reviews, it > is being done this way because of other callers of read_descriptor > that don''t need to be changed to pass enum x86_segment.OK, first, like I said, I''m sorry I didn''t have a chance to look at this before, and in general it''s totally fair for you to say "we talked about this already". But in this particular case, I have to complain. I just spent 45 minutes going back and finding where it was discussed in previous reviews, and there turns out to have been NO DISCUSSION. You just said, "I did it for this reason" (which is the same as what you said above), and Jan said, "OK". That was a complete waste of my time. In the future, only send me back to look at previous discussions if 1) there''s actually something there that''s worth reading, and 2) you can''t summarize it here. OK, so read_descriptor() has other callers. I still think, though, that if you''re doing a wrapper you should do it properly. Before this patch, callers of read_descriptor look up the selector themselves (normally by directly reading regs->$SEGMENT_REGISTER). You can''t do this for PVH, because you need do have VMX code read the segment register to find which descriptor you want to read. So you have a "wrapper" function, read_descriptor_sel, which takes the segment register, rather than the contents of the segment register. All well and good so far. The problem I have is that you still pass in *both* the value of regs->$SEGMENT_REGISTER, *and* an enum of a segment register, and use one in one case, and another in a different case. That''s just a really ugly interface. What I''d like to see is for read_descriptor_sel() to *just* take which_sel (perhaps renamed sreg or something, since it''s referring to a segment register), and in the PV case, read the appropriate segment register, then calling read_descriptor(). Then you don''t have this crazy thing where you set two variables (sel and which_cs) all over the place. -George
George Dunlap
2013-Aug-08 14:21 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Thu, Aug 8, 2013 at 8:35 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 08.08.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> On Wed, 7 Aug 2013 14:49:50 +0100 >> George Dunlap <dunlapg@umich.edu> wrote: >>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >>> > + * We need vcpu because during context switch, going from PV to >>> > PVH, >>> > + * in save_segments() current has been updated to next, and no >>> > longer pointing >>> > + * to the PV, but the intention is to get selector for the PV. >>> > Checking >>> > + * is_pvh_vcpu(current) will yield incorrect results in such a >>> > case. >>> > + */ >>> > +#define read_segment_register(vcpu, regs, name) \ >>> > +({ u16 __sel; \ >>> > + struct cpu_user_regs *_regs = (regs); \ >>> > + \ >>> > + if ( is_pvh_vcpu(vcpu) && guest_mode(_regs) ) \ >>> > + __sel = pvh_get_selector(vcpu, x86_seg_##name); \ >>> > + else \ >>> > + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ >>> >>> Is there a reason you discarded the STR() macro here? >> >> Suggested by Jan to change it, not sure the reason. Jan do you recall? > > I think this is the result of multiple iterations of the patch, where > intermediately the stringification had disappeared altogether. > When I requested it to be restored, I used the simpler # operator > in the outline. > > In any event I think STR() should go away altogether (where > necessary replaced by __stringify()), and was needlessly used > in the original code here: The intended use is when you need > the argument macro expanded before stringification, which is > not the case here.OK -- I don''t have strong opinions on STR, I just wanted to make sure it wasn''t an oversight. (Might be worth mentioning this in the change log.) -George
George Dunlap
2013-Aug-08 14:36 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Thu, Aug 8, 2013 at 3:18 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:>>> Again, wouldn''t it be better to rename read_desrciptor to >>> pv_read_descriptor(), name this one pvh_read_desrciptor(), give them a >>> similar function signature (e.g., have both take a which_sel and have >>> it look up the selector itself), rather than have this >>> one-function-calls-another-function thing? >> >> If you go back to where we discussed this in previous reviews, it >> is being done this way because of other callers of read_descriptor >> that don''t need to be changed to pass enum x86_segment. > > OK, first, like I said, I''m sorry I didn''t have a chance to look at > this before, and in general it''s totally fair for you to say "we > talked about this already".Although, on second thought, it may not be fair after all: all the necessary information to understand why the change was made should be in the commit message, not just for stragglers like me, but for archaeologists going back to figure out why the code is the way it is. It''s not fair to expect them to go back and read the entire 7-month saga to figure out what''s going on. That said, I can understand why you may be frustrated with someone coming in at the 11th hour and having a bunch of criticisms and questions. I am sorry that''s the way it turned out. Nonetheless, my comments stand. -George
Mukesh Rathor
2013-Aug-08 22:07 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Thu, 08 Aug 2013 07:56:41 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Mon, 05 Aug 2013 12:10:15 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > >> > vcpu_guest_context *ctxtp) +{ > >> > + if ( v->vcpu_id == 0 ) > >> > + return 0; > >> > + > >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > >> > + return -EINVAL; > >> > + > >> > + vmx_vmcs_enter(v); > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > >> > >> Just noticed: Aren''t you mixing up entries and bytes here? > > > > Right: > > > > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); > > > > Any formatting issues here? I don''t see in coding style, and see > > both code where there is a space around ''*'' and not. > > The inner parentheses are superfluous. > > CODING_STYLE is pretty explicit about there needing to be white > space around operators: "Spaces are placed [...], and around > binary operators (except the structure access operators, ''.'' and > ''->'')." > > > Also, when setting the limit, do we need to worry about the G flag? > > or for that matter, D/B whether segment is growing up or down? > > It appears we don''t need to worry about that for LDT, but not sure > > reading the SDMs.. > > The D/B bit doesn''t matter for LDT (and TSS), but the G bit would. > However - now that you''re intending to require trivial state (64-bit > CS, all other selectors zero), it would only be logical to also > require a zero LDT selector (and hence base and entry count to be > zero).No, Tim would like to see the hcall set the hidden fields. Since, you are also OK with that, I''m just doing that now. So, we just build default VMCS like we do now, and then this hcall will let them set the selectors, and all hidden fields (implicitly). thanks mukesh
Mukesh Rathor
2013-Aug-09 00:12 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Thu, 08 Aug 2013 08:29:27 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 08.08.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Wed, 7 Aug 2013 11:24:42 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > ......... > >> > index 412971e..ece11e4 100644 > >> > --- a/xen/arch/x86/mm.c > >> > +++ b/xen/arch/x86/mm.c > >> > @@ -4334,6 +4334,9 @@ void destroy_gdt(struct vcpu *v) > >> > int i; > >> > unsigned long pfn; > >> > > >> > + if ( is_pvh_vcpu(v) ) > >> > + return; > >> > >> There seems to be some inconsistency with where this is supposed > >> to be checked -- in domain_relinquish_resources(), destroy_gdt() > >> is only called for pv domains (gated on is_pv_domain); but in > >> arch_set_info_guest(), it *was* gated on being PV, but with the PVH > >> changes it''s still being called. > >> > >> Either this should only be called for PV domains (and this check > >> should be an ASSERT), or it should be called regardless of the > >> type of domain. I prefer the first if possible. > > > > In the original version it was being called for pv domains only, > > and I had checks in the caller. But, Jan preferred the check in > > destroy_gdt() so I moved it to destroy_gdt(). > > But that perspective may have changed with other code changes: > If all callers now suppress the call for PVH guests, this should > indeed be an assertion (if anything). If all but one caller checks > for PV (or are in PV only code paths), the better approach now may > still be to have the one odd caller do the check and have an > assertion in the function. Iirc it was at least two call sites you > had to adjust originally, which then warranted to do the check in > just one place (in the function itself).The only change I see relevent to this between my patch and now, is fewer callers in arch_set_info_guest(). Change is small enough and I''ll make it so that destroy_gdt() gets called for PV only, ie, the check is in the caller. thanks mukesh
Mukesh Rathor
2013-Aug-09 00:17 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On Thu, 8 Aug 2013 09:55:26 +0100 George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 08/08/13 02:49, Mukesh Rathor wrote: > > On Wed, 7 Aug 2013 12:29:13 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor.........> >>> + if ( (rc = raw_copy_from_guest(sig, (char *)eip, > >>> sizeof(sig))) != 0 ) { > >>> propagate_page_fault(eip + sizeof(sig) - rc, 0); > >>> return EXCRET_fault_fixed; > >>> @@ -931,7 +936,7 @@ static int emulate_forced_invalid_op(struct > >>> cpu_user_regs *regs) eip += sizeof(sig); > >>> > >>> /* We only emulate CPUID. */ > >>> - if ( ( rc = copy_from_user(instr, (char *)eip, > >>> sizeof(instr))) != 0 ) > >>> + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, > >>> sizeof(instr))) != 0 ) { > >>> propagate_page_fault(eip + sizeof(instr) - rc, 0); > >>> return EXCRET_fault_fixed; > >>> @@ -1076,6 +1081,12 @@ void propagate_page_fault(unsigned long > >>> addr, u16 error_code) struct vcpu *v = current; > >>> struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; > >>> > >>> + if ( is_pvh_vcpu(v) ) > >>> + { > >>> + hvm_inject_page_fault(error_code, addr); > >>> + return; > >>> + } > >> Would it make more sense to rename this function > >> "pv_inject_page_fault", and then make a macro to switch between the > >> two? > > I don''t think so, propagate_page_fault seems generic enough. > > What I meant was something similar to what I suggested for patch 10 > -- make propagate_page_fault() truly generic, by making it check what > mode is running and calling either pv_inject_page_fault() or > hvm_inject_page_fault() as appropriate.I guess, what you mean: propagate_page_fault(): { if (pvh) hvm_inject_pf else if pv pv_inject_pf } where pv_inject_pf() is all the code after my "if pvh" in the current patch. Small enough change I can accomodate in the next patch. Mukesh
Mukesh Rathor
2013-Aug-09 00:55 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On Thu, 08 Aug 2013 08:41:04 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 08.08.13 at 04:12, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Wed, 7 Aug 2013 17:43:54 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote: > >> > +static hvm_hypercall_t *const > >> > pvh_hypercall64_table[NR_hypercalls] = { > >> > + HYPERCALL(platform_op), > >> > + HYPERCALL(memory_op), > >> > + HYPERCALL(xen_version), > >> > + HYPERCALL(console_io), > >> > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t > >> > *)hvm_grant_table_op, > >> > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t > >> > *)hvm_vcpu_op, > >> > + HYPERCALL(mmuext_op), > >> > + HYPERCALL(xsm_op), > >> > + HYPERCALL(sched_op), > >> > + HYPERCALL(event_channel_op), > >> > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t > >> > *)hvm_physdev_op, > >> > + HYPERCALL(hvm_op), > >> > + HYPERCALL(sysctl), > >> > + HYPERCALL(domctl) > >> > +}; > >> > >> It would be nice if this list were in the same order as the other > >> lists, so that it is easy to figure out what calls are common and > >> what calls are different. > > > > These are ordered by the hcall number, and assists in the debug. > > But with George asking, do you now understand a little better > why on a very early revision I had asked to copy either the > HVM or PV hypercall table, and override just the entries that > need overrides (making it very clear which ones differ)?Like I''ve said before, I believe that is a poor and obfuscating way of doing it, and I don''t want my name on something I completely disagree with. It makes code harder to read IMO. I''m adding such a small extension to the existing HVM code, that I believe its hardly reaching a tipping point. PVH is still evolving, this is first patch, again, minimal changes to make a guest boot and come up in PVH mode. Over time we''ll come to understand more what other hcalls need to be added and to what extent. At that point further enhancements can be made... Mukesh
Mukesh Rathor
2013-Aug-09 01:32 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On Thu, 8 Aug 2013 15:18:56 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Thu, Aug 8, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > On Wed, 7 Aug 2013 14:49:50 +0100 > > George Dunlap <dunlapg@umich.edu> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote:.......> The problem I have is that you still pass in *both* the value of > regs->$SEGMENT_REGISTER, *and* an enum of a segment register, and use > one in one case, and another in a different case. That''s just a > really ugly interface. > > What I''d like to see is for read_descriptor_sel() to *just* take > which_sel (perhaps renamed sreg or something, since it''s referring to > a segment register), and in the PV case, read the appropriate segment > register, then calling read_descriptor(). Then you don''t have this > crazy thing where you set two variables (sel and which_cs) all over > the place.Hmm... lemme make sure I understand precisely, what you mean is something like: static int read_descriptor_sel(enum x86_segment which_sel, struct vcpu *v, const struct cpu_user_regs *regs, unsigned long *base, unsigned long *limit, unsigned int *ar, unsigned int vm86attr) { uint sel; if (!pvh) { sel = read_pv_segreg(which_sel) return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); } } where read_pv_segreg() has one long case statment: case x86_seg_cs return read_segment_register(v, regs, cs); case x86_seg_cs return read_segment_register(v, regs, ds); ..... Then emulate_privileged_op() will not be setting data_sel, but only which_sel, except for one place: .... if ( lm_ovr == lm_seg_none || data_sel < 4 ) { switch ( lm_ovr ) { case lm_seg_none: ... That sounds like a good change to me. Jan, you OK with this? -mukesh
Jan Beulich
2013-Aug-09 06:54 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
>>> On 09.08.13 at 03:32, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 8 Aug 2013 15:18:56 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> The problem I have is that you still pass in *both* the value of >> regs->$SEGMENT_REGISTER, *and* an enum of a segment register, and use >> one in one case, and another in a different case. That''s just a >> really ugly interface. >> >> What I''d like to see is for read_descriptor_sel() to *just* take >> which_sel (perhaps renamed sreg or something, since it''s referring to >> a segment register), and in the PV case, read the appropriate segment >> register, then calling read_descriptor(). Then you don''t have this >> crazy thing where you set two variables (sel and which_cs) all over >> the place. > > > Hmm... lemme make sure I understand precisely, what you mean is > something like: > > static int read_descriptor_sel(enum x86_segment which_sel, > struct vcpu *v, > const struct cpu_user_regs *regs, > unsigned long *base, > unsigned long *limit, > unsigned int *ar, > unsigned int vm86attr) > > { > uint sel; > if (!pvh) > { > sel = read_pv_segreg(which_sel) > return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); > } > } > > where read_pv_segreg() has one long case statment: > case x86_seg_cs > return read_segment_register(v, regs, cs); > case x86_seg_cs > return read_segment_register(v, regs, ds); > ..... > > > Then emulate_privileged_op() will not be setting data_sel, but > only which_sel, except for one place: > > .... > if ( lm_ovr == lm_seg_none || data_sel < 4 ) > { > switch ( lm_ovr ) > { > case lm_seg_none: > ... > > That sounds like a good change to me. Jan, you OK with this?It''s worse performance wise, but better maintenance wise, so I guess I don''t really object (but also am not too happy with it). And of course your use of read_segment_register(v, regs, cs) above is all but correct - CS and SS need to be read from regs. Jan
Jan Beulich
2013-Aug-09 06:56 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
>>> On 09.08.13 at 02:55, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 08 Aug 2013 08:41:04 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 08.08.13 at 04:12, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Wed, 7 Aug 2013 17:43:54 +0100 >> > George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> >> <mukesh.rathor@oracle.com> wrote: >> >> > +static hvm_hypercall_t *const >> >> > pvh_hypercall64_table[NR_hypercalls] = { >> >> > + HYPERCALL(platform_op), >> >> > + HYPERCALL(memory_op), >> >> > + HYPERCALL(xen_version), >> >> > + HYPERCALL(console_io), >> >> > + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t >> >> > *)hvm_grant_table_op, >> >> > + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t >> >> > *)hvm_vcpu_op, >> >> > + HYPERCALL(mmuext_op), >> >> > + HYPERCALL(xsm_op), >> >> > + HYPERCALL(sched_op), >> >> > + HYPERCALL(event_channel_op), >> >> > + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t >> >> > *)hvm_physdev_op, >> >> > + HYPERCALL(hvm_op), >> >> > + HYPERCALL(sysctl), >> >> > + HYPERCALL(domctl) >> >> > +}; >> >> >> >> It would be nice if this list were in the same order as the other >> >> lists, so that it is easy to figure out what calls are common and >> >> what calls are different. >> > >> > These are ordered by the hcall number, and assists in the debug. >> >> But with George asking, do you now understand a little better >> why on a very early revision I had asked to copy either the >> HVM or PV hypercall table, and override just the entries that >> need overrides (making it very clear which ones differ)? > > Like I''ve said before, I believe that is a poor and obfuscating way of doing > it, and I don''t want my name on something I completely disagree with. It > makes code harder to read IMO. I''m adding such a > small extension to the existing HVM code, that I believe its hardly > reaching a tipping point. PVH is still evolving, this is first patch, > again, minimal changes to make a guest boot and come up in PVH mode. > Over time we''ll come to understand more what other hcalls need to be > added and to what extent. At that point further enhancements can be > made...And I didn''t mean to ask that you change your patch in this regard, I just wanted to point out that I''m not the only one thinking differently than you. In any event, I guess once your code is in I''ll try to remember to follow IanC''s suggestion and script the hypercall table generation. Jan
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch contains vmcs changes related for PVH, mainly creating a VMCS > for PVH guest. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/vmx/vmcs.c | 247 ++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 245 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > index 36f167f..8d35370 100644 > --- a/xen/arch/x86/hvm/vmx/vmcs.c > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > @@ -634,7 +634,7 @@ void vmx_vmcs_exit(struct vcpu *v) > { > /* Don''t confuse vmx_do_resume (for @v or @current!) */ > vmx_clear_vmcs(v); > - if ( is_hvm_vcpu(current) ) > + if ( !is_pv_vcpu(current) ) > vmx_load_vmcs(current); > > spin_unlock(&v->arch.hvm_vmx.vmcs_lock); > @@ -856,6 +856,239 @@ static void vmx_set_common_host_vmcs_fields(struct vcpu *v) > __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); > } > > +static int pvh_check_requirements(struct vcpu *v) > +{ > + u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > + > + if ( !paging_mode_hap(v->domain) ) > + { > + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); > + return -EINVAL; > + } > + if ( !cpu_has_vmx_pat ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); > + return -ENOSYS; > + } > + if ( !cpu_has_vmx_msr_bitmap ) > + { > + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); > + return -ENOSYS; > + } > + if ( !cpu_has_vmx_vpid ) > + { > + printk(XENLOG_G_INFO "PVH: CPU doesn''t have VPID support\n"); > + return -ENOSYS; > + } > + if ( !cpu_has_vmx_secondary_exec_control ) > + { > + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); > + return -ENOSYS; > + } > + > + if ( v->domain->arch.vtsc ) > + { > + printk(XENLOG_G_INFO > + "At present PVH only supports the default timer mode\n"); > + return -ENOSYS; > + } > + > + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; > + if ( (tmpval & required) != required ) > + { > + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", > + required); > + return -ENOSYS; > + } > + > + return 0; > +} > + > +static int pvh_construct_vmcs(struct vcpu *v) > +{ > + int rc, msr_type; > + unsigned long *msr_bitmap; > + struct domain *d = v->domain; > + struct p2m_domain *p2m = p2m_get_hostp2m(d); > + struct ept_data *ept = &p2m->ept; > + u32 vmexit_ctl = vmx_vmexit_control; > + u32 vmentry_ctl = vmx_vmentry_control; > + u64 host_pat, tmpval = -1; > + > + if ( (rc = pvh_check_requirements(v)) ) > + return rc; > + > + msr_bitmap = alloc_xenheap_page(); > + if ( msr_bitmap == NULL ) > + return -ENOMEM; > + > + /* 1. Pin-Based Controls: */ > + __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); > + > + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control; > + > + /* 2. Primary Processor-based controls: */ > + /* > + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, > + * then pv_vcpu.ctrlreg must be added to the pvh struct. > + */ > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; > + > + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING | > + CPU_BASED_CR3_LOAD_EXITING | > + CPU_BASED_CR3_STORE_EXITING); > + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_MSR_BITMAP; > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING; > + > + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); > + > + /* 3. Secondary Processor-based controls (Intel SDM: resvd bits are 0): */ > + v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT; > + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID; > + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_PAUSE_LOOP_EXITING; > + > + __vmwrite(SECONDARY_VM_EXEC_CONTROL, > + v->arch.hvm_vmx.secondary_exec_control); > + > + __vmwrite(IO_BITMAP_A, virt_to_maddr((char *)hvm_io_bitmap + 0)); > + __vmwrite(IO_BITMAP_B, virt_to_maddr((char *)hvm_io_bitmap + PAGE_SIZE)); > + > + /* MSR bitmap for intercepts. */ > + memset(msr_bitmap, ~0, PAGE_SIZE); > + v->arch.hvm_vmx.msr_bitmap = msr_bitmap; > + __vmwrite(MSR_BITMAP, virt_to_maddr(msr_bitmap)); > + > + msr_type = MSR_TYPE_R | MSR_TYPE_W; > + /* Disable interecepts for MSRs that have corresponding VMCS fields. */ > + vmx_disable_intercept_for_msr(v, MSR_FS_BASE, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_GS_BASE, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, msr_type); > + vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, msr_type); > + > + /* > + * We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, MSR_CSTAR, > + * and MSR_SYSCALL_MASK because we need to specify save/restore area to > + * save/restore at every VM exit and entry. Instead, let the intercept > + * functions save them into vmx_msr_state fields. See comment in > + * vmx_restore_host_msrs(). See also vmx_restore_guest_msrs(). > + */ > + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); > + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); > + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); > + > + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); > + > + /* > + * Note: we run with default VM_ENTRY_LOAD_DEBUG_CTLS of 1, which means > + * upon vmentry, the cpu reads/loads VMCS.DR7 and VMCS.DEBUGCTLS, and not > + * use the host values. 0 would cause it to not use the VMCS values. > + */ > + vmentry_ctl &= ~VM_ENTRY_LOAD_GUEST_EFER; > + vmentry_ctl &= ~VM_ENTRY_SMM; > + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; > + /* PVH 32bitfixme. */ > + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ > + > + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); > + > + vmx_set_common_host_vmcs_fields(v); > + > + __vmwrite(VM_ENTRY_INTR_INFO, 0); > + __vmwrite(CR3_TARGET_COUNT, 0); > + __vmwrite(GUEST_ACTIVITY_STATE, 0); > + > + /* These are sorta irrelevant as we load the discriptors directly. */ > + __vmwrite(GUEST_CS_SELECTOR, 0); > + __vmwrite(GUEST_DS_SELECTOR, 0); > + __vmwrite(GUEST_SS_SELECTOR, 0); > + __vmwrite(GUEST_ES_SELECTOR, 0); > + __vmwrite(GUEST_FS_SELECTOR, 0); > + __vmwrite(GUEST_GS_SELECTOR, 0); > + > + __vmwrite(GUEST_CS_BASE, 0); > + __vmwrite(GUEST_CS_LIMIT, ~0u); > + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ > + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); > + > + __vmwrite(GUEST_DS_BASE, 0); > + __vmwrite(GUEST_DS_LIMIT, ~0u); > + __vmwrite(GUEST_DS_AR_BYTES, 0xc093); /* read/write, accessed */ > + > + __vmwrite(GUEST_SS_BASE, 0); > + __vmwrite(GUEST_SS_LIMIT, ~0u); > + __vmwrite(GUEST_SS_AR_BYTES, 0xc093); /* read/write, accessed */ > + > + __vmwrite(GUEST_ES_BASE, 0); > + __vmwrite(GUEST_ES_LIMIT, ~0u); > + __vmwrite(GUEST_ES_AR_BYTES, 0xc093); /* read/write, accessed */ > + > + __vmwrite(GUEST_FS_BASE, 0); > + __vmwrite(GUEST_FS_LIMIT, ~0u); > + __vmwrite(GUEST_FS_AR_BYTES, 0xc093); /* read/write, accessed */ > + > + __vmwrite(GUEST_GS_BASE, 0); > + __vmwrite(GUEST_GS_LIMIT, ~0u); > + __vmwrite(GUEST_GS_AR_BYTES, 0xc093); /* read/write, accessed */ > + > + __vmwrite(GUEST_GDTR_BASE, 0); > + __vmwrite(GUEST_GDTR_LIMIT, 0); > + > + __vmwrite(GUEST_LDTR_BASE, 0); > + __vmwrite(GUEST_LDTR_LIMIT, 0); > + __vmwrite(GUEST_LDTR_AR_BYTES, 0x82); /* LDT */ > + __vmwrite(GUEST_LDTR_SELECTOR, 0); > + > + /* Guest TSS. */ > + __vmwrite(GUEST_TR_BASE, 0); > + __vmwrite(GUEST_TR_LIMIT, 0xff); > + __vmwrite(GUEST_TR_AR_BYTES, 0x8b); /* 32-bit TSS (busy) */ > + > + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); > + __vmwrite(GUEST_DR7, 0); > + __vmwrite(VMCS_LINK_POINTER, ~0UL); > + > + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); > + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0); > + > + v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK | (1U << TRAP_debug) | > + (1U << TRAP_int3) | (1U << TRAP_no_device); > + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); > + > + /* Set WP bit so rdonly pages are not written from CPL 0. */ > + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP; > + __vmwrite(GUEST_CR0, tmpval); > + __vmwrite(CR0_READ_SHADOW, tmpval); > + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = tmpval; > + > + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); > + __vmwrite(GUEST_CR4, tmpval); > + __vmwrite(CR4_READ_SHADOW, tmpval); > + v->arch.hvm_vcpu.guest_cr[4] = tmpval; > + > + __vmwrite(CR0_GUEST_HOST_MASK, ~0UL); > + __vmwrite(CR4_GUEST_HOST_MASK, ~0UL); > + > + v->arch.hvm_vmx.vmx_realmode = 0; > + > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); > + > + rdmsrl(MSR_IA32_CR_PAT, host_pat); > + __vmwrite(HOST_PAT, host_pat); > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); > + > + /* The paging mode is updated for PVH by arch_set_info_guest(). */ > + > + return 0; > +}The majority of this function seems to be duplicating code in construct_vmcs(), but in a different order so that it''s very difficult to tell which is which. Wouldn''t it be better to just sprinkle if(is_pvh_domain()) around consrtuct_vmcs? Duplicating the code like this not only makes it more difficult to read, masking potential mistakes in making the new function, but it also means that two places will need to be modified to take advantage of any new features. -George
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> > + /* 3. Secondary Processor-based controls (Intel SDM: resvd bits are 0): */ > + v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT; > + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID;In construct_vmcs(), it says, "Disable VPID for now; we decide when to enable it on VMENTER". Is this different somehow for PVH guests compared to HVM guests? -George
George Dunlap
2013-Aug-09 13:44 UTC
Re: [V10 PATCH 21/23] PVH xen: VMX support of PVH guest creation/destruction
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch implements the vmx portion of the guest create, ie > vcpu and domain initialization. Some changes to support the destroy path. > > Change in V10: > - Don''t call vmx_domain_initialise / vmx_domain_destroy for PVH. > - Do not set hvm_vcpu.guest_efer here in vmx.c. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/vmx/vmx.c | 28 ++++++++++++++++++++++++++++ > 1 files changed, 28 insertions(+), 0 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index 80109c1..f6ea39a 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -1076,6 +1076,28 @@ static void vmx_update_host_cr3(struct vcpu *v) > vmx_vmcs_exit(v); > } > > +/* > + * PVH guest never causes CR3 write vmexit. This is called during the guest > + * setup. > + */ > +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) > +{ > + vmx_vmcs_enter(v); > + switch ( cr ) > + { > + case 3: > + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); > + hvm_asid_flush_vcpu(v); > + break; > + > + default: > + printk(XENLOG_ERR > + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", > + v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP)); > + } > + vmx_vmcs_exit(v); > +}This function seems almost completely pointless. In the case of CR3, it basically does exactly what the function below does. It avoids maybe doing something pointless, like vmx_load_ptrs(), but that should be harmless, right? This patch could be taken out entirely, or replaced with a simple if(is_pvh_vcpu(v)) ASSERT(cr == 3); -George
George Dunlap
2013-Aug-09 14:14 UTC
Re: [V10 PATCH 22/23] PVH xen: preparatory patch for the pvh vmexit handler patch
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This is a preparatory patch for the next pvh vmexit handler patch.I don''t see any reason to have this as a separate patch -- it just makes a couple of functions public and adds a useless function call. -George
Konrad Rzeszutek Wilk
2013-Aug-09 14:44 UTC
Re: [V10 PATCH 22/23] PVH xen: preparatory patch for the pvh vmexit handler patch
On Fri, Aug 09, 2013 at 03:14:47PM +0100, George Dunlap wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > This is a preparatory patch for the next pvh vmexit handler patch. > > I don''t see any reason to have this as a separate patch -- it just > makes a couple of functions public and adds a useless function call.Its scaffolding code. It should make it easier for the patches that follow this to not have extra code changes. Aka, no code changes patch, but just moving stuff around.
George Dunlap
2013-Aug-09 14:47 UTC
Re: [V10 PATCH 22/23] PVH xen: preparatory patch for the pvh vmexit handler patch
On 09/08/13 15:44, Konrad Rzeszutek Wilk wrote:> On Fri, Aug 09, 2013 at 03:14:47PM +0100, George Dunlap wrote: >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >>> This is a preparatory patch for the next pvh vmexit handler patch. >> I don''t see any reason to have this as a separate patch -- it just >> makes a couple of functions public and adds a useless function call. > Its scaffolding code. It should make it easier for the patches that > follow this to not have extra code changes. > > Aka, no code changes patch, but just moving stuff around.That is true of the earlier "prep" patches in the series, but not this one. Unlike those, this doesn''t move any code around; all it does, as I said, is make some functions public and add a call to an empty function. There''s no reason not to merge this into patch #23. -George
Konrad Rzeszutek Wilk
2013-Aug-09 18:10 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
> > That sounds like a good change to me. Jan, you OK with this? > > It''s worse performance wise, but better maintenance wise, so I > guess I don''t really object (but also am not too happy with it).Jan this is confusing. We are looking for you and Keir as guidance as both of you are the maintainers and have the final say in the x86 architecture. Which way would you like it?
Keir Fraser
2013-Aug-09 21:15 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On 09/08/2013 19:10, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>>> That sounds like a good change to me. Jan, you OK with this? >> >> It''s worse performance wise, but better maintenance wise, so I >> guess I don''t really object (but also am not too happy with it). > > Jan this is confusing. We are looking for you and Keir as guidance > as both of you are the maintainers and have the final say in the x86 > architecture. > > Which way would you like it?The way Mukesh already has it is fine. I see no problem with passing in both the selector value and the enum. Both are readily producible by all callers it seems. Yes it''s a bit redundant but seems not too ugly to me. It''s really a very minor point, and I''d Ack this either way. -- Keir
Mukesh Rathor
2013-Aug-09 23:41 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Thu, 08 Aug 2013 07:56:41 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Mon, 05 Aug 2013 12:10:15 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > >> > vcpu_guest_context *ctxtp) +{ > >> > + if ( v->vcpu_id == 0 ) > >> > + return 0; > >> > + > >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > >> > + return -EINVAL; > >> > + > >> > + vmx_vmcs_enter(v); > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > >> > >> Just noticed: Aren''t you mixing up entries and bytes here? > > > > Right: > > > > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); > > > > Any formatting issues here? I don''t see in coding style, and see > > both code where there is a space around ''*'' and not. > > The inner parentheses are superfluous. > > CODING_STYLE is pretty explicit about there needing to be white > space around operators: "Spaces are placed [...], and around > binary operators (except the structure access operators, ''.'' and > ''->'')." > > > Also, when setting the limit, do we need to worry about the G flag? > > or for that matter, D/B whether segment is growing up or down? > > It appears we don''t need to worry about that for LDT, but not sure > > reading the SDMs.. > > The D/B bit doesn''t matter for LDT (and TSS), but the G bit would.Ugh, to find the G bit, I need to walk the GDT to find the LDT descriptor. Walking the GDT to look for system descriptor means mapping guest gdt pages as I go thru the table, and also the system descriptor sizes are different for 32bit vs IA-32e modes adding extra code... All that just doesn''t seem worth it to me for supporting LDT during vcpu bringup. Keir, do you have any thoughts? Basically, I''m trying to support VCPUOP_initialise here, which is used by a PV guest boot vcpu to set context of another vcpu it''s trying to bring up. In retrospect, I should have just created VCPUOP_initialise_pvh with limited fields needed for PVH. (We already ignore bunch of stuff for PVH from VCPUOP_initialise like trap_ctxt, event_callback*, syscall_callback*, etc...). But anyways, can''t we just document VCPUOP_initialise that only following fields are relevant and honored for PVH: gdt.pvh.addr/limit, and ctxtp->user_regs.cs/ds/ss (And others used in arch_set_info_guest like user_regs, flags,...) Since we are loading gdtr and selectors cs/ds/ss, we should also load the hidden fields for cs/ds/ss. That IMO is plenty enough support for the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs base, etc first thing in it''s HVM container. What do you all think? Jan, sorry for going in circles on this. thanks Mukesh
On Fri, 9 Aug 2013 14:39:19 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > > > + /* 3. Secondary Processor-based controls (Intel SDM: resvd > > bits are 0): */ > > + v->arch.hvm_vmx.secondary_exec_control > > SECONDARY_EXEC_ENABLE_EPT; > > + v->arch.hvm_vmx.secondary_exec_control |> > SECONDARY_EXEC_ENABLE_VPID; > > In construct_vmcs(), it says, "Disable VPID for now; we decide when to > enable it on VMENTER". Is this different somehow for PVH guests > compared to HVM guests? > > -GeorgeWe just require VPID support, (we can remove it in future): if ( !cpu_has_vmx_secondary_exec_control ) { printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); return -ENOSYS; } -Mukesh
On Fri, 9 Aug 2013 11:25:36 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > This patch contains vmcs changes related for PVH, mainly creating a > > VMCS for PVH guest......> > + v->arch.hvm_vmx.vmx_realmode = 0; > > + > > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); > > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); > > + > > + rdmsrl(MSR_IA32_CR_PAT, host_pat); > > + __vmwrite(HOST_PAT, host_pat); > > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); > > + > > + /* The paging mode is updated for PVH by > > arch_set_info_guest(). */ + > > + return 0; > > +} > > The majority of this function seems to be duplicating code in > construct_vmcs(), but in a different order so that it''s very difficult > to tell which is which. Wouldn''t it be better to just sprinkle > if(is_pvh_domain()) around consrtuct_vmcs?Nah, just makes the function extremely messy! Other maintainers I consulted with were OK with making it a separate function. The function is mostly orderded by vmx sections in the intel SDM. mukesh
Mukesh Rathor
2013-Aug-10 01:59 UTC
Re: [V10 PATCH 21/23] PVH xen: VMX support of PVH guest creation/destruction
On Fri, 9 Aug 2013 14:44:48 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > This patch implements the vmx portion of the guest create, ie > > vcpu and domain initialization. Some changes to support the destroy > > path. > > > > Change in V10: > > - Don''t call vmx_domain_initialise / vmx_domain_destroy for PVH. > > - Do not set hvm_vcpu.guest_efer here in vmx.c. > > > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > > --- > > xen/arch/x86/hvm/vmx/vmx.c | 28 ++++++++++++++++++++++++++++ > > 1 files changed, 28 insertions(+), 0 deletions(-) > > > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > > index 80109c1..f6ea39a 100644 > > --- a/xen/arch/x86/hvm/vmx/vmx.c > > +++ b/xen/arch/x86/hvm/vmx/vmx.c > > @@ -1076,6 +1076,28 @@ static void vmx_update_host_cr3(struct vcpu > > *v) vmx_vmcs_exit(v); > > } > > > > +/* > > + * PVH guest never causes CR3 write vmexit. This is called during > > the guest > > + * setup. > > + */ > > +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) > > +{ > > + vmx_vmcs_enter(v); > > + switch ( cr ) > > + { > > + case 3: > > + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); > > + hvm_asid_flush_vcpu(v); > > + break; > > + > > + default: > > + printk(XENLOG_ERR > > + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", > > + v->domain->domain_id, v->vcpu_id, cr, > > __vmread(GUEST_RIP)); > > + } > > + vmx_vmcs_exit(v); > > +} > > This function seems almost completely pointless. In the case of CR3, > it basically does exactly what the function below does. It avoids > maybe doing something pointless, like vmx_load_ptrs(), but that should > be harmless, right?Harmless, nonetheless pointless paying small penalty calling it. Such things are subjective and matter of personal opinions, and vary from maintainer to maintainer.... I don''t care either way. If it helps this patch end it''s misery, I''ll make the change! Mukesh
Mukesh Rathor
2013-Aug-10 02:13 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On Thu, 8 Aug 2013 09:55:26 +0100 George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 08/08/13 02:49, Mukesh Rathor wrote: > > On Wed, 7 Aug 2013 12:29:13 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote: > >>> This patch supports invalid op emulation for PVH by calling > >>> appropriate copy macros and and HVM function to inject PF. > >>> > >>> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > >>> Reviewed-by: Jan Beulich <jbeulich@suse.com> > >>> --- > >>> xen/arch/x86/traps.c | 17 ++++++++++++++--- > >>> xen/include/asm-x86/traps.h | 1 + > >> Why make this non-static? No one is using this in this patch. If > >> a later patch needs it, you should make it non-static there, so we > >> can decide at that point if making it non-static is merited or not. > > Sigh! Originally, it was that way, but then to keep that patch from > > getting too big, it got moved here after few versions. We are making > > emulation available for outside the PV, ie, to PVH. > > As far as I''m concerned, the size of the patch itself is immaterial; > the only important question, regarding how to break down patches > (just like in breaking down functions), is how easy or difficult it > is to understand the whole thing. > > Now it''s typically the case that long patches are hard to understand, > and that breaking them down into smaller chunks makes them easier to > read. But a division like this, where you''ve moved some random hunk > into a different patch with which it has no logical relation, makes > the series *harder* to understand, not easier. > > Additionally, as the series evolves, it makes it difficult to keep > all of the dependencies straight. Suppose you changed your approach > for that future patch so that you didn''t need this public anymore. > You, and all the reviewers, could easily forget about the dependency, > since it''s in a separate patch which may have already been classified > as "OK".But that would happen even if the function was static. Say I make changes in function for PVH, dont'' make it public. Now I forget to use it, the function has been changed already? We make it public to be used by future patch, I''ll add which patch is using it to make it easier to understand. Not making it public makes possible another comment -- why the change if it can''t be used by another PVH module anyways. Can''t please all reviewers simultaneously!!! All my life so far, all reviews are done by one person per file, and that makes so much sense..... this is hell!!! Mukesh
Jan Beulich
2013-Aug-12 07:38 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
>>> On 10.08.13 at 04:13, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 8 Aug 2013 09:55:26 +0100 > George Dunlap <george.dunlap@eu.citrix.com> wrote: > >> On 08/08/13 02:49, Mukesh Rathor wrote: >> > On Wed, 7 Aug 2013 12:29:13 +0100 >> > George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> > >> >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> >> <mukesh.rathor@oracle.com> wrote: >> >>> This patch supports invalid op emulation for PVH by calling >> >>> appropriate copy macros and and HVM function to inject PF. >> >>> >> >>> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> Reviewed-by: Jan Beulich <jbeulich@suse.com> >> >>> --- >> >>> xen/arch/x86/traps.c | 17 ++++++++++++++--- >> >>> xen/include/asm-x86/traps.h | 1 + >> >> Why make this non-static? No one is using this in this patch. If >> >> a later patch needs it, you should make it non-static there, so we >> >> can decide at that point if making it non-static is merited or not. >> > Sigh! Originally, it was that way, but then to keep that patch from >> > getting too big, it got moved here after few versions. We are making >> > emulation available for outside the PV, ie, to PVH. >> >> As far as I''m concerned, the size of the patch itself is immaterial; >> the only important question, regarding how to break down patches >> (just like in breaking down functions), is how easy or difficult it >> is to understand the whole thing. >> >> Now it''s typically the case that long patches are hard to understand, >> and that breaking them down into smaller chunks makes them easier to >> read. But a division like this, where you''ve moved some random hunk >> into a different patch with which it has no logical relation, makes >> the series *harder* to understand, not easier. >> >> Additionally, as the series evolves, it makes it difficult to keep >> all of the dependencies straight. Suppose you changed your approach >> for that future patch so that you didn''t need this public anymore. >> You, and all the reviewers, could easily forget about the dependency, >> since it''s in a separate patch which may have already been classified >> as "OK". > > But that would happen even if the function was static. Say I make > changes in function for PVH, dont'' make it public. Now I forget to > use it, the function has been changed already? > > We make it public to be used by future patch, I''ll add which patch is > using it to make it easier to understand. Not making it public makes > possible another comment -- why the change if it can''t be used by > another PVH module anyways. Can''t please all reviewers > simultaneously!!! All my life so far, all reviews are done by one > person per file, and that makes so much sense..... this is hell!!!Yes, I see how this is not pleasant for you. But that''s the way it is with community projects. And I''ve seem numerous occasions where there were multiple reviewers, and one has to find a way to make all of them happy. For what it''s worth - I had pointed out the non-logical breakup of the series as an issue quite early in the process, and merely gave in realizing that your life with this is already difficult enough. Jan
>>> On 10.08.13 at 02:23, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 9 Aug 2013 11:25:36 +0100 > George Dunlap <dunlapg@umich.edu> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >> > This patch contains vmcs changes related for PVH, mainly creating a >> > VMCS for PVH guest. > ..... >> > + v->arch.hvm_vmx.vmx_realmode = 0; >> > + >> > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); >> > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); >> > + >> > + rdmsrl(MSR_IA32_CR_PAT, host_pat); >> > + __vmwrite(HOST_PAT, host_pat); >> > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); >> > + >> > + /* The paging mode is updated for PVH by >> > arch_set_info_guest(). */ + >> > + return 0; >> > +} >> >> The majority of this function seems to be duplicating code in >> construct_vmcs(), but in a different order so that it''s very difficult >> to tell which is which. Wouldn''t it be better to just sprinkle >> if(is_pvh_domain()) around consrtuct_vmcs? > > > Nah, just makes the function extremely messy! Other maintainers I > consulted with were OK with making it a separate function. The function > is mostly orderded by vmx sections in the intel SDM.But I''m sure you also appreciate the point George makes: The less code duplication, the easier maintenance will end up to be. So it really much depends on the number of if()-s needed in the earlier function. And if we''re two go with two instances, then if you re- ordered things from the original for a good reason, including a patch to also re-order the original function (so they become more similar again) would be appreciated. Plus (if you didn''t already) include a comment in both functions referring to the other (so people updating one have a fair chance of noticing that the other may need updating too). Jan
Jan Beulich
2013-Aug-12 07:54 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 08 Aug 2013 07:56:41 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Mon, 05 Aug 2013 12:10:15 +0100 >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> >> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> >> >> >>> wrote: >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct >> >> > vcpu_guest_context *ctxtp) +{ >> >> > + if ( v->vcpu_id == 0 ) >> >> > + return 0; >> >> > + >> >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) >> >> > + return -EINVAL; >> >> > + >> >> > + vmx_vmcs_enter(v); >> >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); >> >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); >> >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); >> >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); >> >> >> >> Just noticed: Aren''t you mixing up entries and bytes here? >> > >> > Right: >> > >> > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); >> > >> > Any formatting issues here? I don''t see in coding style, and see >> > both code where there is a space around ''*'' and not. >> >> The inner parentheses are superfluous. >> >> CODING_STYLE is pretty explicit about there needing to be white >> space around operators: "Spaces are placed [...], and around >> binary operators (except the structure access operators, ''.'' and >> ''->'')." >> >> > Also, when setting the limit, do we need to worry about the G flag? >> > or for that matter, D/B whether segment is growing up or down? >> > It appears we don''t need to worry about that for LDT, but not sure >> > reading the SDMs.. >> >> The D/B bit doesn''t matter for LDT (and TSS), but the G bit would. > > Ugh, to find the G bit, I need to walk the GDT to find the LDT descriptor. > Walking the GDT to look for system descriptor means mapping guest gdt > pages as I go thru the table, and also the system descriptor sizes are > different for 32bit vs IA-32e modes adding extra code... All that just > doesn''t seem worth it to me for supporting LDT during vcpu bringup.Which is why I suggested requiring the LDT to be empty.> Keir, do you have any thoughts? Basically, I''m trying to support > VCPUOP_initialise here, which is used by a PV guest boot vcpu to > set context of another vcpu it''s trying to bring up. In retrospect, I > should have just created VCPUOP_initialise_pvh with limited fields > needed for PVH. (We already ignore bunch of stuff for PVH from > VCPUOP_initialise like trap_ctxt, event_callback*, syscall_callback*, > etc...). But anyways, can''t we just document VCPUOP_initialise that > only following fields are relevant and honored for PVH: > > gdt.pvh.addr/limit, and ctxtp->user_regs.cs/ds/ss > > (And others used in arch_set_info_guest like user_regs, flags,...) > > Since we are loading gdtr and selectors cs/ds/ss, we should also load > the hidden fields for cs/ds/ss. That IMO is plenty enough support for > the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs > base, etc first thing in it''s HVM container. What do you all think?If you implement loading the hidden fields of CS, then doing the same for the LDT shouldn''t be that much more code (and if you permit a non-null LDT selector, then having it in place would even be a requirement before validating the CS selector). But again, I had already indicated that I''d be fine with requiring the state to be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS and GS holding null selectors. And CS descriptor validation done only in debug mode. Talking of the LDT selector: Iirc you modify struct vcpu_guest_context''s GDT to match PVH needs, but if I''m not mistaken you don''t do the same for the LDT - PVH would require merely a selector here, not a base/ents pair. Jan
Tim Deegan
2013-Aug-12 09:00 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 16:41 -0700 on 09 Aug (1376066498), Mukesh Rathor wrote:> On Thu, 08 Aug 2013 07:56:41 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> > > >>> wrote: > > > On Mon, 05 Aug 2013 12:10:15 +0100 > > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > >> >>> On 24.07.13 at 03:59, Mukesh Rathor <mukesh.rathor@oracle.com> > > >> >>> wrote: > > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > > >> > vcpu_guest_context *ctxtp) +{ > > >> > + if ( v->vcpu_id == 0 ) > > >> > + return 0; > > >> > + > > >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > > >> > + return -EINVAL; > > >> > + > > >> > + vmx_vmcs_enter(v); > > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > > >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > > >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > > >> > > >> Just noticed: Aren''t you mixing up entries and bytes here? > > > > > > Right: > > > > > > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); > > > > > > Any formatting issues here? I don''t see in coding style, and see > > > both code where there is a space around ''*'' and not. > > > > The inner parentheses are superfluous. > > > > CODING_STYLE is pretty explicit about there needing to be white > > space around operators: "Spaces are placed [...], and around > > binary operators (except the structure access operators, ''.'' and > > ''->'')." > > > > > Also, when setting the limit, do we need to worry about the G flag? > > > or for that matter, D/B whether segment is growing up or down? > > > It appears we don''t need to worry about that for LDT, but not sure > > > reading the SDMs.. > > > > The D/B bit doesn''t matter for LDT (and TSS), but the G bit would. > > Ugh, to find the G bit, I need to walk the GDT to find the LDT descriptor.Why so? The caller supplies you with the LDT base and range, not a segment selector. I don''t think you could find the right LDT selector by scanning the GDT anyway -- what if there were two that matched? Tim.
George Dunlap
2013-Aug-12 09:35 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On 10/08/13 03:13, Mukesh Rathor wrote:> On Thu, 8 Aug 2013 09:55:26 +0100 > George Dunlap <george.dunlap@eu.citrix.com> wrote: > >> On 08/08/13 02:49, Mukesh Rathor wrote: >>> On Wed, 7 Aug 2013 12:29:13 +0100 >>> George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>> >>>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >>>> <mukesh.rathor@oracle.com> wrote: >>>>> This patch supports invalid op emulation for PVH by calling >>>>> appropriate copy macros and and HVM function to inject PF. >>>>> >>>>> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com> >>>>> --- >>>>> xen/arch/x86/traps.c | 17 ++++++++++++++--- >>>>> xen/include/asm-x86/traps.h | 1 + >>>> Why make this non-static? No one is using this in this patch. If >>>> a later patch needs it, you should make it non-static there, so we >>>> can decide at that point if making it non-static is merited or not. >>> Sigh! Originally, it was that way, but then to keep that patch from >>> getting too big, it got moved here after few versions. We are making >>> emulation available for outside the PV, ie, to PVH. >> As far as I''m concerned, the size of the patch itself is immaterial; >> the only important question, regarding how to break down patches >> (just like in breaking down functions), is how easy or difficult it >> is to understand the whole thing. >> >> Now it''s typically the case that long patches are hard to understand, >> and that breaking them down into smaller chunks makes them easier to >> read. But a division like this, where you''ve moved some random hunk >> into a different patch with which it has no logical relation, makes >> the series *harder* to understand, not easier. >> >> Additionally, as the series evolves, it makes it difficult to keep >> all of the dependencies straight. Suppose you changed your approach >> for that future patch so that you didn''t need this public anymore. >> You, and all the reviewers, could easily forget about the dependency, >> since it''s in a separate patch which may have already been classified >> as "OK". > But that would happen even if the function was static. Say I make > changes in function for PVH, dont'' make it public. Now I forget to > use it, the function has been changed already?I said "as the patch series evolves" -- that means before it gets applied. If the hunk making it public is in the same patch that it''s used, it is much more likely to be noticed.> We make it public to be used by future patch, I''ll add which patch is > using it to make it easier to understand.Why don''t you just *move it* to the patch that''s actually using it (the last patch in the series, it would seem -- the one which implements the PVH exit handler)? In the time it took you to write this e-mail, you could have moved that one hunk to the other patch 5 times.> Not making it public makes > possible another comment -- why the change if it can''t be used by > another PVH module anyways. Can''t please all reviewers > simultaneously!!! > All my life so far, all reviews are done by one > person per file, and that makes so much sense..... this is hell!!!Did someone actually say to you, "This patch is too long, make it shorter"? I''m not asking for the moon, and I''m not trying to grind your gears. I''m just trying to help you write better patches -- ones which are easier to review, and ones which will be easier for people looking back at to figure out what''s going on. And more importantly, in comments in the rest of the series, I''m trying to make code which is easier to understand. This is important enough to me that I''d actually be willing to take your series and reformat it myself, to demonstrate what I''m talking about. It would be a good exercise for me to become familiar with the code; the only problem is that I''m not up to speed with the details of the hardware at this point. -George
George Dunlap
2013-Aug-12 09:43 UTC
Re: [V10 PATCH 12/23] PVH xen: Support privileged op emulation for PVH
On 09/08/13 07:54, Jan Beulich wrote:>>>> On 09.08.13 at 03:32, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> On Thu, 8 Aug 2013 15:18:56 +0100 >> George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>> The problem I have is that you still pass in *both* the value of >>> regs->$SEGMENT_REGISTER, *and* an enum of a segment register, and use >>> one in one case, and another in a different case. That''s just a >>> really ugly interface. >>> >>> What I''d like to see is for read_descriptor_sel() to *just* take >>> which_sel (perhaps renamed sreg or something, since it''s referring to >>> a segment register), and in the PV case, read the appropriate segment >>> register, then calling read_descriptor(). Then you don''t have this >>> crazy thing where you set two variables (sel and which_cs) all over >>> the place. >> >> Hmm... lemme make sure I understand precisely, what you mean is >> something like: >> >> static int read_descriptor_sel(enum x86_segment which_sel, >> struct vcpu *v, >> const struct cpu_user_regs *regs, >> unsigned long *base, >> unsigned long *limit, >> unsigned int *ar, >> unsigned int vm86attr) >> >> { >> uint sel; >> if (!pvh) >> { >> sel = read_pv_segreg(which_sel) >> return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); >> } >> } >> >> where read_pv_segreg() has one long case statment: >> case x86_seg_cs >> return read_segment_register(v, regs, cs); >> case x86_seg_cs >> return read_segment_register(v, regs, ds); >> ..... >> >> >> Then emulate_privileged_op() will not be setting data_sel, but >> only which_sel, except for one place: >> >> .... >> if ( lm_ovr == lm_seg_none || data_sel < 4 ) >> { >> switch ( lm_ovr ) >> { >> case lm_seg_none: >> ... >> >> That sounds like a good change to me. Jan, you OK with this? > It''s worse performance wise, but better maintenance wise, so I > guess I don''t really object (but also am not too happy with it).Is this a really hot path? It does mean going through a bit of extra code in the simple version. In theory one could do something with arrays or something to make that to avoid it. In any case, I think the interface in the patch is really ugly, but I''ll leave it up to Keir and Jan what they want to do. -George
On Sat, Aug 10, 2013 at 1:23 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Fri, 9 Aug 2013 11:25:36 +0100 > George Dunlap <dunlapg@umich.edu> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >> > This patch contains vmcs changes related for PVH, mainly creating a >> > VMCS for PVH guest. > ..... >> > + v->arch.hvm_vmx.vmx_realmode = 0; >> > + >> > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); >> > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); >> > + >> > + rdmsrl(MSR_IA32_CR_PAT, host_pat); >> > + __vmwrite(HOST_PAT, host_pat); >> > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); >> > + >> > + /* The paging mode is updated for PVH by >> > arch_set_info_guest(). */ + >> > + return 0; >> > +} >> >> The majority of this function seems to be duplicating code in >> construct_vmcs(), but in a different order so that it''s very difficult >> to tell which is which. Wouldn''t it be better to just sprinkle >> if(is_pvh_domain()) around consrtuct_vmcs? > > > Nah, just makes the function extremely messy! Other maintainers I > consulted with were OK with making it a separate function. The function > is mostly orderded by vmx sections in the intel SDM.Does it? Messier than the domain building functions where we also do a lot of if(is_pvh_domain())''s? From my analysis, most of the differences are because the HVM code allows two ways of doing things (e.g., either HAP or shadow) while you only allow one. These include: - disabling TSC exiting (if in the correct mode, the HVM code would also do this) - Disabling invlpg and cr3 load/store exiting (same for HVM in HAP mode) - Unconditionally enabling secondary controls (HVM will enable if present) - Enabling the MSR bitmap (HVM will enable if present) - Updating cpu_based_exec_control directly (HVM has a function that switches between this and something required for nested virt) - Unconditionally enabling VPID (HVM will enable it somewhere else if appropriate) - &c &c As far as I can tell, *substantial* differences mostly have to do with starting in 64-bit mode, and not supporting debug registers at the moment; and those should easily be able to be put in if() statements. -George
On Mon, Aug 12, 2013 at 11:15 AM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Sat, Aug 10, 2013 at 1:23 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> On Fri, 9 Aug 2013 11:25:36 +0100 >> George Dunlap <dunlapg@umich.edu> wrote: >> >>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >>> <mukesh.rathor@oracle.com> wrote: >>> > This patch contains vmcs changes related for PVH, mainly creating a >>> > VMCS for PVH guest. >> ..... >>> > + v->arch.hvm_vmx.vmx_realmode = 0; >>> > + >>> > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); >>> > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); >>> > + >>> > + rdmsrl(MSR_IA32_CR_PAT, host_pat); >>> > + __vmwrite(HOST_PAT, host_pat); >>> > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); >>> > + >>> > + /* The paging mode is updated for PVH by >>> > arch_set_info_guest(). */ + >>> > + return 0; >>> > +} >>> >>> The majority of this function seems to be duplicating code in >>> construct_vmcs(), but in a different order so that it''s very difficult >>> to tell which is which. Wouldn''t it be better to just sprinkle >>> if(is_pvh_domain()) around consrtuct_vmcs? >> >> >> Nah, just makes the function extremely messy! Other maintainers I >> consulted with were OK with making it a separate function. The function >> is mostly orderded by vmx sections in the intel SDM. > > Does it? Messier than the domain building functions where we also do > a lot of if(is_pvh_domain())''s? > > From my analysis, most of the differences are because the HVM code > allows two ways of doing things (e.g., either HAP or shadow) while you > only allow one. These include: > - disabling TSC exiting (if in the correct mode, the HVM code would > also do this) > - Disabling invlpg and cr3 load/store exiting (same for HVM in HAP mode) > - Unconditionally enabling secondary controls (HVM will enable if present) > - Enabling the MSR bitmap (HVM will enable if present) > - Updating cpu_based_exec_control directly (HVM has a function that > switches between this and something required for nested virt) > - Unconditionally enabling VPID (HVM will enable it somewhere else if > appropriate) > - &c &cI should have also said, since you would be calling pvh_check_requirements() at the beginning of the shared function anyway, all of these are guaranteed to set things up as required by PVH. -George
George Dunlap
2013-Aug-12 10:23 UTC
Re: [V10 PATCH 21/23] PVH xen: VMX support of PVH guest creation/destruction
On Sat, Aug 10, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Fri, 9 Aug 2013 14:44:48 +0100 > George Dunlap <dunlapg@umich.edu> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >> > This patch implements the vmx portion of the guest create, ie >> > vcpu and domain initialization. Some changes to support the destroy >> > path. >> > >> > Change in V10: >> > - Don''t call vmx_domain_initialise / vmx_domain_destroy for PVH. >> > - Do not set hvm_vcpu.guest_efer here in vmx.c. >> > >> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >> > --- >> > xen/arch/x86/hvm/vmx/vmx.c | 28 ++++++++++++++++++++++++++++ >> > 1 files changed, 28 insertions(+), 0 deletions(-) >> > >> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c >> > index 80109c1..f6ea39a 100644 >> > --- a/xen/arch/x86/hvm/vmx/vmx.c >> > +++ b/xen/arch/x86/hvm/vmx/vmx.c >> > @@ -1076,6 +1076,28 @@ static void vmx_update_host_cr3(struct vcpu >> > *v) vmx_vmcs_exit(v); >> > } >> > >> > +/* >> > + * PVH guest never causes CR3 write vmexit. This is called during >> > the guest >> > + * setup. >> > + */ >> > +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) >> > +{ >> > + vmx_vmcs_enter(v); >> > + switch ( cr ) >> > + { >> > + case 3: >> > + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); >> > + hvm_asid_flush_vcpu(v); >> > + break; >> > + >> > + default: >> > + printk(XENLOG_ERR >> > + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", >> > + v->domain->domain_id, v->vcpu_id, cr, >> > __vmread(GUEST_RIP)); >> > + } >> > + vmx_vmcs_exit(v); >> > +} >> >> This function seems almost completely pointless. In the case of CR3, >> it basically does exactly what the function below does. It avoids >> maybe doing something pointless, like vmx_load_ptrs(), but that should >> be harmless, right? > > Harmless, nonetheless pointless paying small penalty calling it. Such > things are subjective and matter of personal opinions, and vary from > maintainer to maintainer.... I don''t care either way. If it helps this > patch end it''s misery, I''ll make the change!No, it''s not pointless. Having extra code means more instruction cache misses, so it''s as likely as not to slow things down rather than speed things up. And in any case, making the code easier to read and maintain is absolutely worth an extra dozen cycles every time a domain is created. -George
Tim Deegan
2013-Aug-12 10:24 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 08:54 +0100 on 12 Aug (1376297674), Jan Beulich wrote:> >>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > Since we are loading gdtr and selectors cs/ds/ss, we should also load > > the hidden fields for cs/ds/ss. That IMO is plenty enough support for > > the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs > > base, etc first thing in it''s HVM container. What do you all think? > > If you implement loading the hidden fields of CS, then doing the > same for the LDT shouldn''t be that much more code (and if you > permit a non-null LDT selector, then having it in place would even > be a requirement before validating the CS selector). But again, > I had already indicated that I''d be fine with requiring the state to > be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS > and GS holding null selectors. And CS descriptor validation done > only in debug mode.If you''re going that way, please go the whole hog: - _all_ of cs/ss/ds/es/fs/gs arguments required to be null (and so documented, and enforced). - GDT base/limit loaded from the args. - LDT base/limit args required (documented, enforced) to be zero. - Guest launches with a flat 32/64bit segments set up in the hidden part of all segments (or I guess on 32-bit you could have all but CS invalid). Then it can load its real segment state after boot. That way we don''t have the weird constraints on the layout/contents of the guest''s GDT or on its segment descriptors. Would that be OK? Tim.
Jan Beulich
2013-Aug-12 11:04 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 12.08.13 at 12:24, Tim Deegan <tim@xen.org> wrote: > At 08:54 +0100 on 12 Aug (1376297674), Jan Beulich wrote: >> >>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > Since we are loading gdtr and selectors cs/ds/ss, we should also load >> > the hidden fields for cs/ds/ss. That IMO is plenty enough support for >> > the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs >> > base, etc first thing in it''s HVM container. What do you all think? >> >> If you implement loading the hidden fields of CS, then doing the >> same for the LDT shouldn''t be that much more code (and if you >> permit a non-null LDT selector, then having it in place would even >> be a requirement before validating the CS selector). But again, >> I had already indicated that I''d be fine with requiring the state to >> be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS >> and GS holding null selectors. And CS descriptor validation done >> only in debug mode. > > If you''re going that way, please go the whole hog: > - _all_ of cs/ss/ds/es/fs/gs arguments required to be null > (and so documented, and enforced). > - GDT base/limit loaded from the args. > - LDT base/limit args required (documented, enforced) to be zero. > - Guest launches with a flat 32/64bit segments set up in the > hidden part of all segments (or I guess on 32-bit you could have all > but CS invalid). Then it can load its real segment state after boot. > > That way we don''t have the weird constraints on the layout/contents > of the guest''s GDT or on its segment descriptors. > > Would that be OK?I don''t think CS = null is valid. And similarly, for future 32-bit PVH SS = null wouldn''t be valid, and DS = null as well as ES = null would likely be a bad idea there. Jan
I did a quick edit to see what a unified function would look like (hasn''t even been compile tested). I think it looks just fine. -George PVH xen: vmcs related changes This patch contains vmcs changes related for PVH, mainly creating a VMCS for PVH guest. This version modified to unify the PVH and HVM vmcs construction functions. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 36f167f..0dc1ce5 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -634,7 +634,7 @@ void vmx_vmcs_exit(struct vcpu *v) { /* Don''t confuse vmx_do_resume (for @v or @current!) */ vmx_clear_vmcs(v); - if ( is_hvm_vcpu(current) ) + if ( !is_pv_vcpu(current) ) vmx_load_vmcs(current); spin_unlock(&v->arch.hvm_vmx.vmcs_lock); @@ -856,6 +856,54 @@ static void vmx_set_common_host_vmcs_fields(struct vcpu *v) __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); } +static int pvh_check_requirements(struct vcpu *v) +{ + u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + + if ( !paging_mode_hap(v->domain) ) + { + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); + return -EINVAL; + } + if ( !cpu_has_vmx_pat ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_msr_bitmap ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_vpid ) + { + printk(XENLOG_G_INFO "PVH: CPU doesn''t have VPID support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_secondary_exec_control ) + { + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); + return -ENOSYS; + } + + if ( v->domain->arch.vtsc ) + { + printk(XENLOG_G_INFO + "At present PVH only supports the default timer mode\n"); + return -ENOSYS; + } + + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; + if ( (tmpval & required) != required ) + { + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", + required); + return -ENOSYS; + } + + return 0; +} + static int construct_vmcs(struct vcpu *v) { struct domain *d = v->domain; @@ -864,6 +912,13 @@ static int construct_vmcs(struct vcpu *v) vmx_vmcs_enter(v); + if ( is_pvh_vcpu(v) ) + { + int rc = pvh_check_requirements(v); + if ( rc ) + return rc; + } + /* VMCS controls. */ __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); @@ -871,6 +926,20 @@ static int construct_vmcs(struct vcpu *v) if ( d->arch.vtsc ) v->arch.hvm_vmx.exec_control |= CPU_BASED_RDTSC_EXITING; + if ( is_pvh_vcpu(v) ) + { + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; + + /* ? */ + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; + + ASSERT(v->arch.hvm_vmx.exec_conrol & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) + ASSERT(v->arch.hvm_vmx.exec_conrol & CPU_BASED_RDTSC_EXITING) + ASSERT(v->arch.hvm_vmx.exec_conrol & CPU_BASED_ACTIVATE_MSR_BITMAP) + } + + v->arch.hvm_vmx.secondary_exec_control = vmx_secondary_exec_control; /* Disable VPID for now: we decide when to enable it on VMENTER. */ @@ -900,7 +969,30 @@ static int construct_vmcs(struct vcpu *v) /* Do not enable Monitor Trap Flag unless start single step debug */ v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + if ( is_pvh_vcpu(v) ) + { + /* FIXME: Just disable the things we need to disable */ + v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT; + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID; + v->arch.hvm_vmx.secondary_exec_control |SECONDARY_EXEC_PAUSE_LOOP_EXITING; + } + vmx_update_cpu_exec_control(v); + + if ( is_pvh_vcpu(v) ) + { + /* + * Note: we run with default VM_ENTRY_LOAD_DEBUG_CTLS of 1, which means + * upon vmentry, the cpu reads/loads VMCS.DR7 and VMCS.DEBUGCTLS, and not + * use the host values. 0 would cause it to not use the VMCS values. + */ + vmentry_ctl &= ~VM_ENTRY_LOAD_GUEST_EFER; + vmentry_ctl &= ~VM_ENTRY_SMM; + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; + /* PVH 32bitfixme. */ + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ + } + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); @@ -933,6 +1025,8 @@ static int construct_vmcs(struct vcpu *v) vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, MSR_TYPE_R | MSR_TYPE_W); if ( cpu_has_vmx_pat && paging_mode_hap(d) ) vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, MSR_TYPE_R | MSR_TYPE_W); + if ( is_pvh_domain(v) ) + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, msr_type); } /* I/O access bitmap. */ @@ -980,6 +1074,17 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(GUEST_ACTIVITY_STATE, 0); + if ( is_pvh_domain(v) ) + { + /* These are sorta irrelevant as we load the discriptors directly. */ + __vmwrite(GUEST_CS_SELECTOR, 0); + __vmwrite(GUEST_DS_SELECTOR, 0); + __vmwrite(GUEST_SS_SELECTOR, 0); + __vmwrite(GUEST_ES_SELECTOR, 0); + __vmwrite(GUEST_FS_SELECTOR, 0); + __vmwrite(GUEST_GS_SELECTOR, 0); + } + /* Guest segment bases. */ __vmwrite(GUEST_ES_BASE, 0); __vmwrite(GUEST_SS_BASE, 0); @@ -1002,7 +1107,11 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(GUEST_DS_AR_BYTES, 0xc093); __vmwrite(GUEST_FS_AR_BYTES, 0xc093); __vmwrite(GUEST_GS_AR_BYTES, 0xc093); - __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ + if ( is_pvh_domain(v) ) + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); + else + __vmwrite(GUEST_CS_AR_BYTES, 0xc09b); /* exec/read, accessed */ /* Guest IDT. */ __vmwrite(GUEST_IDTR_BASE, 0); @@ -1028,16 +1137,24 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(VMCS_LINK_POINTER, ~0UL); v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK - | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault)) - | (1U << TRAP_no_device); + | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault)) + | (is_pvh_domain(d) ? (1U << TRAP_debug) | (1U << TRAP_int3) : 0 + | (1U << TRAP_no_device); vmx_update_exception_bitmap(v); - v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET; + v->arch.hvm_vcpu.guest_cr[0] = is_pvh_domain(v) ? + ( X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP ) + : ( X86_CR0_PE | X86_CR0_ET ); hvm_update_guest_cr(v, 0); - v->arch.hvm_vcpu.guest_cr[4] = 0; + v->arch.hvm_vcpu.guest_cr[4] = is_pvh_domain(v) ? + real_cr4_to_pv_guest_cr4(mmu_cr4_features) + : 0; hvm_update_guest_cr(v, 4); + if ( is_pvh_domain(v) ) + v->arch.hvm_vmx.vmx_realmode = 0; + if ( cpu_has_vmx_tpr_shadow ) { __vmwrite(VIRTUAL_APIC_PAGE_ADDR, @@ -1294,6 +1411,9 @@ void vmx_do_resume(struct vcpu *v) hvm_asid_flush_vcpu(v); } + if ( is_pvh_vcpu(v) ) + reset_stack_and_jump(vmx_asm_do_vmentry); + debug_state = v->domain->debugger_attached || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_INT3] || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP]; @@ -1477,7 +1597,7 @@ static void vmcs_dump(unsigned char ch) for_each_domain ( d ) { - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) continue; printk("\n>>> Domain %d <<<\n", d->domain_id); for_each_vcpu ( d, v ) On Mon, Aug 12, 2013 at 11:17 AM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Mon, Aug 12, 2013 at 11:15 AM, George Dunlap > <George.Dunlap@eu.citrix.com> wrote: >> On Sat, Aug 10, 2013 at 1:23 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >>> On Fri, 9 Aug 2013 11:25:36 +0100 >>> George Dunlap <dunlapg@umich.edu> wrote: >>> >>>> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >>>> <mukesh.rathor@oracle.com> wrote: >>>> > This patch contains vmcs changes related for PVH, mainly creating a >>>> > VMCS for PVH guest. >>> ..... >>>> > + v->arch.hvm_vmx.vmx_realmode = 0; >>>> > + >>>> > + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); >>>> > + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); >>>> > + >>>> > + rdmsrl(MSR_IA32_CR_PAT, host_pat); >>>> > + __vmwrite(HOST_PAT, host_pat); >>>> > + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); >>>> > + >>>> > + /* The paging mode is updated for PVH by >>>> > arch_set_info_guest(). */ + >>>> > + return 0; >>>> > +} >>>> >>>> The majority of this function seems to be duplicating code in >>>> construct_vmcs(), but in a different order so that it''s very difficult >>>> to tell which is which. Wouldn''t it be better to just sprinkle >>>> if(is_pvh_domain()) around consrtuct_vmcs? >>> >>> >>> Nah, just makes the function extremely messy! Other maintainers I >>> consulted with were OK with making it a separate function. The function >>> is mostly orderded by vmx sections in the intel SDM. >> >> Does it? Messier than the domain building functions where we also do >> a lot of if(is_pvh_domain())''s? >> >> From my analysis, most of the differences are because the HVM code >> allows two ways of doing things (e.g., either HAP or shadow) while you >> only allow one. These include: >> - disabling TSC exiting (if in the correct mode, the HVM code would >> also do this) >> - Disabling invlpg and cr3 load/store exiting (same for HVM in HAP mode) >> - Unconditionally enabling secondary controls (HVM will enable if present) >> - Enabling the MSR bitmap (HVM will enable if present) >> - Updating cpu_based_exec_control directly (HVM has a function that >> switches between this and something required for nested virt) >> - Unconditionally enabling VPID (HVM will enable it somewhere else if >> appropriate) >> - &c &c > > I should have also said, since you would be calling > pvh_check_requirements() at the beginning of the shared function > anyway, all of these are guaranteed to set things up as required by > PVH. > > -George_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Mon, Aug 12, 2013 at 12:22 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> + ASSERT(v->arch.hvm_vmx.exec_conrol & > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) > + ASSERT(v->arch.hvm_vmx.exec_conrol & CPU_BASED_RDTSC_EXITING) > + ASSERT(v->arch.hvm_vmx.exec_conrol & CPU_BASED_ACTIVATE_MSR_BITMAP) > + } > +Obviously not compile tested, if I forgot a bunch of semicolons. :-) The point was just to say, it''s really not that messy. -George
Tim Deegan
2013-Aug-12 11:53 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 12:04 +0100 on 12 Aug (1376309065), Jan Beulich wrote:> >>> On 12.08.13 at 12:24, Tim Deegan <tim@xen.org> wrote: > > At 08:54 +0100 on 12 Aug (1376297674), Jan Beulich wrote: > >> >>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> > Since we are loading gdtr and selectors cs/ds/ss, we should also load > >> > the hidden fields for cs/ds/ss. That IMO is plenty enough support for > >> > the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs > >> > base, etc first thing in it''s HVM container. What do you all think? > >> > >> If you implement loading the hidden fields of CS, then doing the > >> same for the LDT shouldn''t be that much more code (and if you > >> permit a non-null LDT selector, then having it in place would even > >> be a requirement before validating the CS selector). But again, > >> I had already indicated that I''d be fine with requiring the state to > >> be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS > >> and GS holding null selectors. And CS descriptor validation done > >> only in debug mode. > > > > If you''re going that way, please go the whole hog: > > - _all_ of cs/ss/ds/es/fs/gs arguments required to be null > > (and so documented, and enforced). > > - GDT base/limit loaded from the args. > > - LDT base/limit args required (documented, enforced) to be zero. > > - Guest launches with a flat 32/64bit segments set up in the > > hidden part of all segments (or I guess on 32-bit you could have all > > but CS invalid). Then it can load its real segment state after boot. > > > > That way we don''t have the weird constraints on the layout/contents > > of the guest''s GDT or on its segment descriptors. > > > > Would that be OK? > > I don''t think CS = null is valid.Ah, you''re right, that''s explicilty disallowed. :( Xen could set any non-null descriptor (still requiring the caller to specify null descriptors and documenting that the descriptor registers will be undefined until the guest loads them on the new vcpu). If we really require the selectors to match the GDT contents then we either have to constrain the selector/GDT contents (a horrible interface) or properly emulate the loads (which surely we already have code in Xen to do). Tim.
Jan Beulich
2013-Aug-12 13:24 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 12.08.13 at 13:53, Tim Deegan <tim@xen.org> wrote: > At 12:04 +0100 on 12 Aug (1376309065), Jan Beulich wrote: >> >>> On 12.08.13 at 12:24, Tim Deegan <tim@xen.org> wrote: >> > At 08:54 +0100 on 12 Aug (1376297674), Jan Beulich wrote: >> >> >>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> >> > Since we are loading gdtr and selectors cs/ds/ss, we should also load >> >> > the hidden fields for cs/ds/ss. That IMO is plenty enough support for >> >> > the vcpu to come up, and the vcpu itself can then load ldtr, fs base, gs >> >> > base, etc first thing in it''s HVM container. What do you all think? >> >> >> >> If you implement loading the hidden fields of CS, then doing the >> >> same for the LDT shouldn''t be that much more code (and if you >> >> permit a non-null LDT selector, then having it in place would even >> >> be a requirement before validating the CS selector). But again, >> >> I had already indicated that I''d be fine with requiring the state to >> >> be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS >> >> and GS holding null selectors. And CS descriptor validation done >> >> only in debug mode. >> > >> > If you''re going that way, please go the whole hog: >> > - _all_ of cs/ss/ds/es/fs/gs arguments required to be null >> > (and so documented, and enforced). >> > - GDT base/limit loaded from the args. >> > - LDT base/limit args required (documented, enforced) to be zero. >> > - Guest launches with a flat 32/64bit segments set up in the >> > hidden part of all segments (or I guess on 32-bit you could have all >> > but CS invalid). Then it can load its real segment state after boot. >> > >> > That way we don''t have the weird constraints on the layout/contents >> > of the guest''s GDT or on its segment descriptors. >> > >> > Would that be OK? >> >> I don''t think CS = null is valid. > > Ah, you''re right, that''s explicilty disallowed. :( Xen could set any > non-null descriptor (still requiring the caller to specify null > descriptors and documenting that the descriptor registers will be > undefined until the guest loads them on the new vcpu).Hmm, yes, I don''t really like starting a guest with inconsistent state, but it''s an option.> If we really require the selectors to match the GDT contents then we > either have to constrain the selector/GDT contents (a horrible > interface) or properly emulate the loads (which surely we already have > code in Xen to do).protmode_load_seg() in the x86 emulation code appears to be the only one. But it doesn''t seem impossible to leverage this here. Jan
>>> On 12.08.13 at 13:22, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > I did a quick edit to see what a unified function would look like > (hasn''t even been compile tested). I think it looks just fine.Indeed. Jan
George Dunlap
2013-Aug-12 16:00 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch contains vmx exit handler for PVH guest. Note it contains > a macro dbgp1 to print vmexit reasons and a lot of other data > to go with it. It can be enabled by setting pvhdbg to 1. This can be > very useful debugging for the first few months of testing, after which > it can be removed at the maintainer''s discretion. > > Changes in V2: > - Move non VMX generic code to arch/x86/hvm/pvh.c > - Remove get_gpr_ptr() and use existing decode_register() instead. > - Defer call to pvh vmx exit handler until interrupts are enabled. So the > caller vmx_pvh_vmexit_handler() handles the NMI/EXT-INT/TRIPLE_FAULT now. > - Fix the CPUID (wrongly) clearing bit 24. No need to do this now, set > the correct feature bits in CR4 during vmcs creation. > - Fix few hard tabs. > > Changes in V3: > - Lot of cleanup and rework in PVH vm exit handler. > - add parameter to emulate_forced_invalid_op(). > > Changes in V5: > - Move pvh.c and emulate_forced_invalid_op related changes to another patch. > - Formatting. > - Remove vmx_pvh_read_descriptor(). > - Use SS DPL instead of CS.RPL for CPL. > - Remove pvh_user_cpuid() and call pv_cpuid for user mode also. > > Changes in V6: > - Replace domain_crash_synchronous() with domain_crash(). > > Changes in V7: > - Don''t read all selectors on every vmexit. Do that only for the > IO instruction vmexit. > - Add couple checks and set guest_cr[4] in access_cr4(). > - Add period after all comments in case that''s an issue. > - Move making pv_cpuid and emulate_privileged_op public here. > > Changes in V8: > - Mainly, don''t read selectors on vmexit. The macros now come to VMCS > to read selectors on demand.Overall I have the same comment here as I had for the VMCS patch: the code looks 98% identical. Substantial differences seem to be: - emulation of privileged ops - cpuid - cr4 handling It seems like it would be much better to share the codepath and just put "is_pvh_domain()" in the places where it needs to be different. More comments / questions inline.> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/vmx/pvh.c | 451 +++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 450 insertions(+), 1 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/pvh.c b/xen/arch/x86/hvm/vmx/pvh.c > index 8e61d23..ba11967 100644 > --- a/xen/arch/x86/hvm/vmx/pvh.c > +++ b/xen/arch/x86/hvm/vmx/pvh.c > @@ -20,9 +20,458 @@ > #include <asm/hvm/nestedhvm.h> > #include <asm/xstate.h> > > -/* Implemented in the next patch */ > +#ifndef NDEBUG > +static int pvhdbg = 0; > +#define dbgp1(...) do { (pvhdbg == 1) ? printk(__VA_ARGS__) : 0; } while ( 0 ) > +#else > +#define dbgp1(...) ((void)0) > +#endif > + > +/* Returns : 0 == msr read successfully. */ > +static int vmxit_msr_read(struct cpu_user_regs *regs) > +{ > + u64 msr_content = 0; > + > + switch ( regs->ecx ) > + { > + case MSR_IA32_MISC_ENABLE: > + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); > + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | > + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; > + break; > + > + default: > + /* PVH fixme: see hvm_msr_read_intercept(). */ > + rdmsrl(regs->ecx, msr_content); > + break;So at the moment you basically pass through all MSR reads (adding BTS_UNAVAIL and PEBS_UNAVAIL to MISC_ENABLE), but send MSR writes through the hvm code? That sounds like it''s asking for trouble...> + } > + regs->eax = (uint32_t)msr_content; > + regs->edx = (uint32_t)(msr_content >> 32); > + vmx_update_guest_eip(); > + > + dbgp1("msr read c:%lx a:%lx d:%lx RIP:%lx RSP:%lx\n", regs->ecx, regs->eax, > + regs->edx, regs->rip, regs->rsp); > + > + return 0; > +} > + > +/* Returns : 0 == msr written successfully. */ > +static int vmxit_msr_write(struct cpu_user_regs *regs) > +{ > + uint64_t msr_content = (uint32_t)regs->eax | ((uint64_t)regs->edx << 32); > + > + dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx, > + regs->eax, regs->edx); > + > + if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY ) > + { > + vmx_update_guest_eip(); > + return 0; > + } > + return 1; > +} > + > +static int vmxit_debug(struct cpu_user_regs *regs) > +{ > + struct vcpu *vp = current; > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > + > + write_debugreg(6, exit_qualification | 0xffff0ff0); > + > + /* gdbsx or another debugger. Never pause dom0. */ > + if ( vp->domain->domain_id != 0 && vp->domain->debugger_attached ) > + domain_pause_for_debugger(); > + else > + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE);Hmm, strangely enough, the HVM handler for this doesn''t seem to deliver this exception -- or if it does, I can''t quite figure out where. What you have here seems like the correct thing to do, but I would be interested in knowing the reason for the HVM behavior.> + > + return 0; > +} > + > +/* Returns: rc == 0: handled the MTF vmexit. */ > +static int vmxit_mtf(struct cpu_user_regs *regs) > +{ > + struct vcpu *vp = current; > + int rc = -EINVAL, ss = vp->arch.hvm_vcpu.single_step; > + > + vp->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; > + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, vp->arch.hvm_vmx.exec_control); > + vp->arch.hvm_vcpu.single_step = 0; > + > + if ( vp->domain->debugger_attached && ss ) > + { > + domain_pause_for_debugger(); > + rc = 0; > + } > + return rc; > +} > + > +static int vmxit_int3(struct cpu_user_regs *regs) > +{ > + int ilen = vmx_get_instruction_length(); > + struct vcpu *vp = current; > + struct hvm_trap trap_info = { > + .vector = TRAP_int3, > + .type = X86_EVENTTYPE_SW_EXCEPTION, > + .error_code = HVM_DELIVER_NO_ERROR_CODE, > + .insn_len = ilen > + }; > + > + /* gdbsx or another debugger. Never pause dom0. */ > + if ( vp->domain->domain_id != 0 && vp->domain->debugger_attached ) > + { > + regs->eip += ilen; > + dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id()); > + current->arch.gdbsx_vcpu_event = TRAP_int3; > + domain_pause_for_debugger(); > + return 0; > + } > + hvm_inject_trap(&trap_info); > + > + return 0; > +} > + > +/* Just like HVM, PVH should be using "cpuid" from the kernel mode. */ > +static int vmxit_invalid_op(struct cpu_user_regs *regs) > +{ > + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) ) > + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); > + > + return 0; > +} > + > +/* Returns: rc == 0: handled the exception. */ > +static int vmxit_exception(struct cpu_user_regs *regs) > +{ > + int vector = (__vmread(VM_EXIT_INTR_INFO)) & INTR_INFO_VECTOR_MASK; > + int rc = -ENOSYS;The vmx code here has some handler for faults that happen during a guest IRET -- is that an issue for PVH?> + > + dbgp1(" EXCPT: vec:%d cs:%lx r.IP:%lx\n", vector, > + __vmread(GUEST_CS_SELECTOR), regs->eip); > + > + switch ( vector ) > + { > + case TRAP_debug: > + rc = vmxit_debug(regs); > + break; > + > + case TRAP_int3: > + rc = vmxit_int3(regs); > + break; > + > + case TRAP_invalid_op: > + rc = vmxit_invalid_op(regs); > + break; > + > + case TRAP_no_device: > + hvm_funcs.fpu_dirty_intercept(); > + rc = 0; > + break; > + > + default: > + printk(XENLOG_G_WARNING > + "PVH: Unhandled trap:%d. IP:%lx\n", vector, regs->eip); > + } > + return rc; > +} > + > +static int vmxit_vmcall(struct cpu_user_regs *regs) > +{ > + if ( hvm_do_hypercall(regs) != HVM_HCALL_preempted ) > + vmx_update_guest_eip(); > + return 0; > +} > + > +/* Returns: rc == 0: success. */ > +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) > +{ > + struct vcpu *vp = current; > + > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + unsigned long new_cr0 = *regp; > + unsigned long old_cr0 = __vmread(GUEST_CR0); > + > + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp); > + if ( (u32)new_cr0 != new_cr0 ) > + { > + printk(XENLOG_G_WARNING > + "Guest setting upper 32 bits in CR0: %lx", new_cr0); > + return -EPERM; > + } > + > + new_cr0 &= ~HVM_CR0_GUEST_RESERVED_BITS; > + /* ET is reserved and should always be 1. */ > + new_cr0 |= X86_CR0_ET; > + > + /* A pvh is not expected to change to real mode. */ > + if ( (new_cr0 & (X86_CR0_PE | X86_CR0_PG)) !> + (X86_CR0_PG | X86_CR0_PE) ) > + { > + printk(XENLOG_G_WARNING > + "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0); > + return -EPERM; > + } > + /* TS going from 1 to 0 */ > + if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) == 0) ) > + vmx_fpu_enter(vp); > + > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = new_cr0; > + __vmwrite(GUEST_CR0, new_cr0); > + __vmwrite(CR0_READ_SHADOW, new_cr0); > + } > + else > + *regp = __vmread(GUEST_CR0);The HVM code here just uses hvm_vcpu.guest_cr[] -- is there any reason not to do the same here? And in any case, shouldn''t it be CR0_READ_SHADOW?> + > + return 0; > +} > + > +/* Returns: rc == 0: success. */ > +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) > +{ > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + struct vcpu *vp = current; > + u64 old_val = __vmread(GUEST_CR4); > + u64 new = *regp; > + > + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) > + { > + printk(XENLOG_G_WARNING > + "PVH guest attempts to set reserved bit in CR4: %lx", new); > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > + } > + > + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) > + { > + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " > + "EFER.LMA is set"); > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > + } > + > + vp->arch.hvm_vcpu.guest_cr[4] = new; > + > + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) ) > + vpid_sync_all();Is it actually allowed for a PVH guest to change modes like this? I realize that at the moment you''re only supporting HAP, but that may not always be true; would it make sense to call paging_update_paging_modes() here instead?> + > + __vmwrite(CR4_READ_SHADOW, new); > + > + new &= ~X86_CR4_PAE; /* PVH always runs with hap enabled. */ > + new |= X86_CR4_VMXE | X86_CR4_MCE; > + __vmwrite(GUEST_CR4, new);Should you be updating hvm_vcpu.hw_cr[4] to this value?> + } > + else > + *regp = __vmread(CR4_READ_SHADOW);Same as above re guest_cr[]> + > + return 0; > +} > + > +/* Returns: rc == 0: success, else -errno. */ > +static int vmxit_cr_access(struct cpu_user_regs *regs) > +{ > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > + uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification); > + int cr, rc = -EINVAL; > + > + switch ( acc_typ ) > + { > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR: > + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR: > + { > + uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification); > + uint64_t *regp = decode_register(gpr, regs, 0); > + cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification); > + > + if ( regp == NULL ) > + break; > + > + switch ( cr ) > + { > + case 0: > + rc = access_cr0(regs, acc_typ, regp); > + break; > + > + case 3: > + printk(XENLOG_G_ERR "PVH: unexpected cr3 vmexit. rip:%lx\n", > + regs->rip); > + domain_crash(current->domain); > + break; > + > + case 4: > + rc = access_cr4(regs, acc_typ, regp); > + break; > + } > + if ( rc == 0 ) > + vmx_update_guest_eip(); > + break; > + } > + > + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: > + { > + struct vcpu *vp = current; > + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & ~X86_CR0_TS; > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = cr0; > + > + vmx_fpu_enter(vp); > + __vmwrite(GUEST_CR0, cr0); > + __vmwrite(CR0_READ_SHADOW, cr0); > + vmx_update_guest_eip(); > + rc = 0; > + } > + } > + return rc; > +} > + > +/* > + * Note: A PVH guest sets IOPL natively by setting bits in the eflags, and not > + * via hypercalls used by a PV. > + */ > +static int vmxit_io_instr(struct cpu_user_regs *regs) > +{ > + struct segment_register seg; > + int requested = (regs->rflags & X86_EFLAGS_IOPL) >> 12; > + int curr_lvl = (regs->rflags & X86_EFLAGS_VM) ? 3 : 0; > + > + if ( curr_lvl == 0 ) > + { > + hvm_get_segment_register(current, x86_seg_ss, &seg); > + curr_lvl = seg.attr.fields.dpl; > + } > + if ( requested >= curr_lvl && emulate_privileged_op(regs) ) > + return 0; > + > + hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code); > + return 0; > +} > + > +static int pvh_ept_handle_violation(unsigned long qualification, > + paddr_t gpa, struct cpu_user_regs *regs) > +{ > + unsigned long gla, gfn = gpa >> PAGE_SHIFT; > + p2m_type_t p2mt; > + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt); > + > + printk(XENLOG_G_ERR "EPT violation %#lx (%c%c%c/%c%c%c), " > + "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx RSP:0x%lx\n", > + qualification, > + (qualification & EPT_READ_VIOLATION) ? ''r'' : ''-'', > + (qualification & EPT_WRITE_VIOLATION) ? ''w'' : ''-'', > + (qualification & EPT_EXEC_VIOLATION) ? ''x'' : ''-'', > + (qualification & EPT_EFFECTIVE_READ) ? ''r'' : ''-'', > + (qualification & EPT_EFFECTIVE_WRITE) ? ''w'' : ''-'', > + (qualification & EPT_EFFECTIVE_EXEC) ? ''x'' : ''-'', > + gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp); > + > + ept_walk_table(current->domain, gfn); > + > + if ( qualification & EPT_GLA_VALID ) > + { > + gla = __vmread(GUEST_LINEAR_ADDRESS); > + printk(XENLOG_G_ERR " --- GLA %#lx\n", gla); > + } > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > +} > + > +/* > + * Main vm exit handler for PVH . Called from vmx_vmexit_handler(). > + * Note: vmx_asm_vmexit_handler updates rip/rsp/eflags in regs{} struct. > + */ > void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) > { > + unsigned long exit_qualification; > + unsigned int exit_reason = __vmread(VM_EXIT_REASON); > + int rc=0, ccpu = smp_processor_id(); > + struct vcpu *v = current; > + > + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx EFLAGS:%lx CR0:%lx\n", > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, > + __vmread(GUEST_CR0)); > + > + switch ( (uint16_t)exit_reason ) > + { > + /* NMI and machine_check are handled by the caller, we handle rest here */ > + case EXIT_REASON_EXCEPTION_NMI: /* 0 */ > + rc = vmxit_exception(regs); > + break; > + > + case EXIT_REASON_EXTERNAL_INTERRUPT: /* 1 */ > + break; /* handled in vmx_vmexit_handler() */This seems to be a weird way to do things, but I see this is what they do in vmx_vmexit_handler() as well, so I guess it makes sense to follow suit. What about EXIT_REASON_TRIPLE_FAULT? [End of in-line comments]> + > + case EXIT_REASON_PENDING_VIRT_INTR: /* 7 */ > + /* Disable the interrupt window. */ > + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING; > + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); > + break; > + > + case EXIT_REASON_CPUID: /* 10 */ > + pv_cpuid(regs); > + vmx_update_guest_eip(); > + break; > + > + case EXIT_REASON_HLT: /* 12 */ > + vmx_update_guest_eip(); > + hvm_hlt(regs->eflags); > + break; > + > + case EXIT_REASON_VMCALL: /* 18 */ > + rc = vmxit_vmcall(regs); > + break; > + > + case EXIT_REASON_CR_ACCESS: /* 28 */ > + rc = vmxit_cr_access(regs); > + break; > + > + case EXIT_REASON_DR_ACCESS: /* 29 */ > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + vmx_dr_access(exit_qualification, regs); > + break; > + > + case EXIT_REASON_IO_INSTRUCTION: /* 30 */ > + vmxit_io_instr(regs); > + break; > + > + case EXIT_REASON_MSR_READ: /* 31 */ > + rc = vmxit_msr_read(regs); > + break; > + > + case EXIT_REASON_MSR_WRITE: /* 32 */ > + rc = vmxit_msr_write(regs); > + break; > + > + case EXIT_REASON_MONITOR_TRAP_FLAG: /* 37 */ > + rc = vmxit_mtf(regs); > + break; > + > + case EXIT_REASON_MCE_DURING_VMENTRY: /* 41 */ > + break; /* handled in vmx_vmexit_handler() */ > + > + case EXIT_REASON_EPT_VIOLATION: /* 48 */ > + { > + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + rc = pvh_ept_handle_violation(exit_qualification, gpa, regs); > + break; > + } > + > + default: > + rc = 1; > + printk(XENLOG_G_ERR > + "PVH: Unexpected exit reason:%#x\n", exit_reason); > + } > + > + if ( rc ) > + { > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + printk(XENLOG_G_WARNING > + "PVH: [%d] exit_reas:%d %#x qual:%ld 0x%lx cr0:0x%016lx\n", > + ccpu, exit_reason, exit_reason, exit_qualification, > + exit_qualification, __vmread(GUEST_CR0)); > + printk(XENLOG_G_WARNING "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n", > + regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3)); > + domain_crash(v->domain); > + } > } > > /* > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-Aug-12 16:13 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Mon, Aug 12, 2013 at 5:00 PM, George Dunlap <dunlapg@umich.edu> wrote:>> +static int vmxit_debug(struct cpu_user_regs *regs) >> +{ >> + struct vcpu *vp = current; >> + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); >> + >> + write_debugreg(6, exit_qualification | 0xffff0ff0); >> + >> + /* gdbsx or another debugger. Never pause dom0. */ >> + if ( vp->domain->domain_id != 0 && vp->domain->debugger_attached ) >> + domain_pause_for_debugger(); >> + else >> + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE); > > Hmm, strangely enough, the HVM handler for this doesn''t seem to > deliver this exception -- or if it does, I can''t quite figure out > where. What you have here seems like the correct thing to do, but I > would be interested in knowing the reason for the HVM behavior.On further thought, this is probably wrong: HVM is probably configured to exit on this trap only when a debugger *is* attached; so if it''s not attached, something is very wrong, and the domain should probably crash.
George Dunlap
2013-Aug-12 16:21 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> +static int pvh_ept_handle_violation(unsigned long qualification, > + paddr_t gpa, struct cpu_user_regs *regs) > +{ > + unsigned long gla, gfn = gpa >> PAGE_SHIFT; > + p2m_type_t p2mt; > + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt); > + > + printk(XENLOG_G_ERR "EPT violation %#lx (%c%c%c/%c%c%c), " > + "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx RSP:0x%lx\n", > + qualification, > + (qualification & EPT_READ_VIOLATION) ? ''r'' : ''-'', > + (qualification & EPT_WRITE_VIOLATION) ? ''w'' : ''-'', > + (qualification & EPT_EXEC_VIOLATION) ? ''x'' : ''-'', > + (qualification & EPT_EFFECTIVE_READ) ? ''r'' : ''-'', > + (qualification & EPT_EFFECTIVE_WRITE) ? ''w'' : ''-'', > + (qualification & EPT_EFFECTIVE_EXEC) ? ''x'' : ''-'', > + gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp); > + > + ept_walk_table(current->domain, gfn); > + > + if ( qualification & EPT_GLA_VALID ) > + { > + gla = __vmread(GUEST_LINEAR_ADDRESS); > + printk(XENLOG_G_ERR " --- GLA %#lx\n", gla); > + } > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > + return 0; > +}Similar to the TRAP_debug issue -- the HVM code here crashes the guest; as there is unlikely to be anything the guest can do to fix things up at this point, that is almost certainly the right thing to do. -George
George Dunlap
2013-Aug-12 17:30 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Mon, Aug 12, 2013 at 5:00 PM, George Dunlap <dunlapg@umich.edu> wrote:> Overall I have the same comment here as I had for the VMCS patch: the > code looks 98% identical. Substantial differences seem to be: > - emulation of privileged ops > - cpuid > - cr4 handling > > It seems like it would be much better to share the codepath and just > put "is_pvh_domain()" in the places where it needs to be different.So below is a patch which, I think, should be functionally mostly equivalent to the patch you have -- but a *lot* shorter, and also *very* clear about how PVH is different than normal HVM. I think this is definitely the better approach. -George PVH xen: introduce vmexit handler for PVH This version has unified PVH and HVM vmexit handlers. A couple of notes: - No check for cr3 accesses; that''s a HAP/shadow issue, not a PVH one - debug trap and ept violation cause guest crash now, as with HVM - Don''t know what to do if a hcall returns invalidate - Don''t know what to do on task switch - Removed single_step=0 on MTF; may not be correct. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index c742d7b..e9f9ef6 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1776,7 +1776,17 @@ int hvm_set_cr0(unsigned long value) (value & (X86_CR0_PE | X86_CR0_PG)) == X86_CR0_PG ) goto gpf; - if ( (value & X86_CR0_PG) && !(old_value & X86_CR0_PG) ) + + + /* A pvh is not expected to change to real mode. */ + if ( is_pvh_vcpu(v) + && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) ) + { + printk(XENLOG_G_WARNING + "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0); + goto gpf + } + else if ( (value & X86_CR0_PG) && !(old_value & X86_CR0_PG) ) { if ( v->arch.hvm_vcpu.guest_efer & EFER_LME ) { @@ -1953,6 +1963,11 @@ int hvm_set_cr4(unsigned long value) * Modifying CR4.{PSE,PAE,PGE,SMEP}, or clearing CR4.PCIDE * invalidate all TLB entries. */ + /* + * PVH: I assume this is suitable -- it subsumes the conditions + * from the custom PVH handler: + * (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) + */ if ( ((old_cr ^ value) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE | X86_CR4_SMEP)) || (!(value & X86_CR4_PCIDE) && (old_cr & X86_CR4_PCIDE)) ) diff --git a/xen/arch/x86/hvm/vmx/pvh.c b/xen/arch/x86/hvm/vmx/pvh.c index 8e61d23..a5a8ee1 100644 --- a/xen/arch/x86/hvm/vmx/pvh.c +++ b/xen/arch/x86/hvm/vmx/pvh.c @@ -20,10 +20,6 @@ #include <asm/hvm/nestedhvm.h> #include <asm/xstate.h> -/* Implemented in the next patch */ -void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) -{ -} /* * Set vmcs fields in support of vcpu_op -> VCPUOP_initialise hcall. Called diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index bbfa130..37ec385 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1244,6 +1244,12 @@ static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr) */ v->arch.hvm_vcpu.hw_cr[4] &= ~X86_CR4_SMEP; } + if ( is_pvh_vcpu(v) ) + { + /* What is this for? */ + v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VMXE | X86_CR4_MCE; + } + __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]); __vmwrite(CR4_READ_SHADOW, v->arch.hvm_vcpu.guest_cr[4]); break; @@ -2242,7 +2248,8 @@ static void ept_handle_violation(unsigned long qualification, paddr_t gpa) __trace_var(TRC_HVM_NPF, 0, sizeof(_d), &_d); } - ret = hvm_hap_nested_page_fault(gpa, + ret = is_pvh_domain(d) ? 0 : + hvm_hap_nested_page_fault(gpa, qualification & EPT_GLA_VALID ? 1 : 0, qualification & EPT_GLA_VALID ? __vmread(GUEST_LINEAR_ADDRESS) : ~0ull, @@ -2490,12 +2497,6 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) ) return vmx_failed_vmentry(exit_reason, regs); - if ( is_pvh_vcpu(v) ) - { - vmx_pvh_vmexit_handler(regs); - return; - } - if ( v->arch.hvm_vmx.vmx_realmode ) { /* Put RFLAGS back the way the guest wants it */ @@ -2654,8 +2655,16 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) /* Already handled above. */ break; case TRAP_invalid_op: - HVMTRACE_1D(TRAP, vector); - vmx_vmexit_ud_intercept(regs); + if ( is_pvh_domain(d) ) + { + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) ) + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); + } + else + { + HVMTRACE_1D(TRAP, vector); + vmx_vmexit_ud_intercept(regs); + } break; default: HVMTRACE_1D(TRAP, vector); @@ -2685,6 +2694,8 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) }; int32_t ecode = -1, source; + /* PVH FIXME: What to do? */ + exit_qualification = __vmread(EXIT_QUALIFICATION); source = (exit_qualification >> 30) & 3; /* Vectored event should fill in interrupt information. */ @@ -2704,8 +2715,8 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) break; } case EXIT_REASON_CPUID: + is_pvh_domain(d) ? pv_cpuid(regs) : vmx_do_cpuid(regs); vmx_update_guest_eip(); /* Safe: CPUID */ - vmx_do_cpuid(regs); break; case EXIT_REASON_HLT: vmx_update_guest_eip(); /* Safe: HLT */ @@ -2731,6 +2742,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) if ( rc != HVM_HCALL_preempted ) { vmx_update_guest_eip(); /* Safe: VMCALL */ + /* PVH FIXME: What to do? */ if ( rc == HVM_HCALL_invalidate ) send_invalidate_req(); } @@ -2750,7 +2762,28 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) case EXIT_REASON_MSR_READ: { uint64_t msr_content; - if ( hvm_msr_read_intercept(regs->ecx, &msr_content) == X86EMUL_OKAY ) + if ( is_pvh_vcpu(v) ) + { + u64 msr_content = 0; + + switch ( regs->ecx ) + { + case MSR_IA32_MISC_ENABLE: + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; + break; + + default: + /* PVH fixme: see hvm_msr_read_intercept(). */ + rdmsrl(regs->ecx, msr_content); + break; + } + regs->eax = (uint32_t)msr_content; + regs->edx = (uint32_t)(msr_content >> 32); + vmx_update_guest_eip(); /* Safe: RDMSR */ + } + else if ( hvm_msr_read_intercept(regs->ecx, &msr_content) =X86EMUL_OKAY ) { regs->eax = (uint32_t)msr_content; regs->edx = (uint32_t)(msr_content >> 32); @@ -2853,21 +2886,42 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) } case EXIT_REASON_IO_INSTRUCTION: - exit_qualification = __vmread(EXIT_QUALIFICATION); - if ( exit_qualification & 0x10 ) + if ( is_pvh_vcpu(v) ) { - /* INS, OUTS */ - if ( !handle_mmio() ) - hvm_inject_hw_exception(TRAP_gp_fault, 0); + /* + * Note: A PVH guest sets IOPL natively by setting bits in + * the eflags, and not via hypercalls used by a PV. + */ + struct segment_register seg; + int requested = (regs->rflags & X86_EFLAGS_IOPL) >> 12; + int curr_lvl = (regs->rflags & X86_EFLAGS_VM) ? 3 : 0; + + if ( curr_lvl == 0 ) + { + hvm_get_segment_register(current, x86_seg_ss, &seg); + curr_lvl = seg.attr.fields.dpl; + } + if ( requested < curr_lvl || !emulate_privileged_op(regs) ) + hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code); } else { - /* IN, OUT */ - uint16_t port = (exit_qualification >> 16) & 0xFFFF; - int bytes = (exit_qualification & 0x07) + 1; - int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE; - if ( handle_pio(port, bytes, dir) ) - vmx_update_guest_eip(); /* Safe: IN, OUT */ + exit_qualification = __vmread(EXIT_QUALIFICATION); + if ( exit_qualification & 0x10 ) + { + /* INS, OUTS */ + if ( !handle_mmio() ) + hvm_inject_hw_exception(TRAP_gp_fault, 0); + } + else + { + /* IN, OUT */ + uint16_t port = (exit_qualification >> 16) & 0xFFFF; + int bytes = (exit_qualification & 0x07) + 1; + int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE; + if ( handle_pio(port, bytes, dir) ) + vmx_update_guest_eip(); /* Safe: IN, OUT */ + } } break; @@ -2890,6 +2944,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) case EXIT_REASON_MONITOR_TRAP_FLAG: v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; vmx_update_cpu_exec_control(v); + /* PVH code set hvm_vcpu.single_step = 0 -- is that necessary? */ if ( v->arch.hvm_vcpu.single_step ) { hvm_memory_event_single_step(regs->eip); if ( v->domain->debugger_attached ) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Mukesh Rathor
2013-Aug-13 00:02 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 12 Aug 2013 10:00:36 +0100 Tim Deegan <tim@xen.org> wrote:> At 16:41 -0700 on 09 Aug (1376066498), Mukesh Rathor wrote: > > On Thu, 08 Aug 2013 07:56:41 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > >>> On 08.08.13 at 03:05, Mukesh Rathor <mukesh.rathor@oracle.com> > > > >>> wrote: > > > > On Mon, 05 Aug 2013 12:10:15 +0100 > > > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > > > >> >>> On 24.07.13 at 03:59, Mukesh Rathor > > > >> >>> <mukesh.rathor@oracle.com> wrote: > > > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > > > >> > vcpu_guest_context *ctxtp) +{ > > > >> > + if ( v->vcpu_id == 0 ) > > > >> > + return 0; > > > >> > + > > > >> > + if ( !(ctxtp->flags & VGCF_in_kernel) ) > > > >> > + return -EINVAL; > > > >> > + > > > >> > + vmx_vmcs_enter(v); > > > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > > > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > > > >> > + __vmwrite(GUEST_LDTR_BASE, ctxtp->ldt_base); > > > >> > + __vmwrite(GUEST_LDTR_LIMIT, ctxtp->ldt_ents); > > > >> > > > >> Just noticed: Aren''t you mixing up entries and bytes here? > > > > > > > > Right: > > > > > > > > __vmwrite(GUEST_LDTR_LIMIT, (ctxtp->ldt_ents * 8 - 1) ); > > > > > > > > Any formatting issues here? I don''t see in coding style, and see > > > > both code where there is a space around ''*'' and not. > > > > > > The inner parentheses are superfluous. > > > > > > CODING_STYLE is pretty explicit about there needing to be white > > > space around operators: "Spaces are placed [...], and around > > > binary operators (except the structure access operators, ''.'' and > > > ''->'')." > > > > > > > Also, when setting the limit, do we need to worry about the G > > > > flag? or for that matter, D/B whether segment is growing up or > > > > down? It appears we don''t need to worry about that for LDT, but > > > > not sure reading the SDMs.. > > > > > > The D/B bit doesn''t matter for LDT (and TSS), but the G bit would. > > > > Ugh, to find the G bit, I need to walk the GDT to find the LDT > > descriptor. > > Why so? The caller supplies you with the LDT base and range, not a > segment selector. I don''t think you could find the right LDT selector > by scanning the GDT anyway -- what if there were two that matched?Because the range is interpreted in bytes or pages depending on the G bit in the descriptor. But, it seems acceptable to require it be null, so i''ll just do that. Mukesh
Mukesh Rathor
2013-Aug-13 00:27 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 12 Aug 2013 08:54:34 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 10.08.13 at 01:41, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Thu, 08 Aug 2013 07:56:41 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > >......> > Ugh, to find the G bit, I need to walk the GDT to find the LDT > > descriptor. Walking the GDT to look for system descriptor means > > mapping guest gdt pages as I go thru the table, and also the system > > descriptor sizes are different for 32bit vs IA-32e modes adding > > extra code... All that just doesn''t seem worth it to me for > > supporting LDT during vcpu bringup. > > Which is why I suggested requiring the LDT to be empty.I don''t recall you doing that, wish I had noted that. OK, I''ll just require ldt base/limit to be null.> > Keir, do you have any thoughts? Basically, I''m trying to support > > VCPUOP_initialise here, which is used by a PV guest boot vcpu to > > set context of another vcpu it''s trying to bring up. In retrospect, > > I should have just created VCPUOP_initialise_pvh with limited fields > > needed for PVH. (We already ignore bunch of stuff for PVH from > > VCPUOP_initialise like trap_ctxt, event_callback*, > > syscall_callback*, etc...). But anyways, can''t we just document > > VCPUOP_initialise that only following fields are relevant and > > honored for PVH: > > > > gdt.pvh.addr/limit, and ctxtp->user_regs.cs/ds/ss > > > > (And others used in arch_set_info_guest like user_regs, flags,...) > > > > Since we are loading gdtr and selectors cs/ds/ss, we should also > > load the hidden fields for cs/ds/ss. That IMO is plenty enough > > support for the vcpu to come up, and the vcpu itself can then load > > ldtr, fs base, gs base, etc first thing in it''s HVM container. What > > do you all think? > > If you implement loading the hidden fields of CS, then doing the > same for the LDT shouldn''t be that much more code (and if you > permit a non-null LDT selector, then having it in place would even > be a requirement before validating the CS selector). But again, > I had already indicated that I''d be fine with requiring the state to > be truly minimal: CS -> flat 64-bit code descriptor, SS, DS, ES, FS > and GS holding null selectors. And CS descriptor validation done > only in debug mode.It seems so unlikely that any guest would *require* LDT table set on boot, that I''ll just check for NULL like you said above. We can always enhance in future. CS->flat is an option, but it would require special code in linux. It would be nice to keep that code same for both PV and PVH in linux.> Talking of the LDT selector: Iirc you modify struct > vcpu_guest_context''s GDT to match PVH needs, but if I''m not > mistaken you don''t do the same for the LDT - PVH would require > merely a selector here, not a base/ents pair.I think that''s way overkill, may make sense for PV, but PVH can just do it first thing in it''s boot code. Let me crank out some code and propose it. I''d like PVH to make 4.4 if possible. thanks mukesh
Mukesh Rathor
2013-Aug-13 00:28 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Mon, 12 Aug 2013 14:24:32 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 12.08.13 at 13:53, Tim Deegan <tim@xen.org> wrote: > > At 12:04 +0100 on 12 Aug (1376309065), Jan Beulich wrote: > >> >>> On 12.08.13 at 12:24, Tim Deegan <tim@xen.org> wrote: > >> > At 08:54 +0100 on 12 Aug (1376297674), Jan Beulich wrote: > >> >> >>> On 10.08.13 at 01:41, Mukesh Rathor........> > > > Ah, you''re right, that''s explicilty disallowed. :( Xen could set > > any non-null descriptor (still requiring the caller to specify null > > descriptors and documenting that the descriptor registers will be > > undefined until the guest loads them on the new vcpu). > > Hmm, yes, I don''t really like starting a guest with inconsistent > state, but it''s an option. > > > If we really require the selectors to match the GDT contents then we > > either have to constrain the selector/GDT contents (a horrible > > interface) or properly emulate the loads (which surely we already > > have code in Xen to do). > > protmode_load_seg() in the x86 emulation code appears to be the > only one. But it doesn''t seem impossible to leverage this here.hvm_load_segment_selector() will do it. Mukesh
Tim Deegan
2013-Aug-13 08:51 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
At 17:02 -0700 on 12 Aug (1376326969), Mukesh Rathor wrote:> On Mon, 12 Aug 2013 10:00:36 +0100 Tim Deegan <tim@xen.org> wrote: > > At 16:41 -0700 on 09 Aug (1376066498), Mukesh Rathor wrote: > > > Ugh, to find the G bit, I need to walk the GDT to find the LDT > > > descriptor. > > > > Why so? The caller supplies you with the LDT base and range, not a > > segment selector. I don''t think you could find the right LDT selector > > by scanning the GDT anyway -- what if there were two that matched? > > Because the range is interpreted in bytes or pages depending on the G > bit in the descriptor.The range is specified in _entries_. Unless you intend to change the interface, in which case, you could just ask for the descriptor. Tim.
Jan Beulich
2013-Aug-13 10:49 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
>>> On 13.08.13 at 02:27, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Mon, 12 Aug 2013 08:54:34 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: > It seems so unlikely that any guest would *require* LDT table set on > boot, that I''ll just check for NULL like you said above. We can always > enhance in future. CS->flat is an option, but it would require special > code in linux. It would be nice to keep that code same for both PV and > PVH in linux.Why would using a flat 64-bit CS require special code in Linux? It''s already running on a flat 64-bit CS...>> Talking of the LDT selector: Iirc you modify struct >> vcpu_guest_context''s GDT to match PVH needs, but if I''m not >> mistaken you don''t do the same for the LDT - PVH would require >> merely a selector here, not a base/ents pair. > > I think that''s way overkill, may make sense for PV, but PVH can > just do it first thing in it''s boot code.If you think so, then you''re not clear of the implications, including that this - being part of the hypervisor/guest interface - needs to be fixed _now_ rather than later. What I think you neglect here is that struct vcpu_guest_context is also used for save/restore, and there you _need_ to properly represent the LDT. So you have two choices: Keep the structure the way it is and require the guest to do LDT management the PV way (which will be very hard, as you have no spare selectors to use for it), or do things the HVM way (having the guest use the LDT instruction, implying that it''s the selector and nothing else that needs to be saved/restored). If you go that latter, more natural route, then you next need to (immediately) decide whether the extra context will also be save/ restored HVM-like (via struct hvm_hw_cpu), in which case the LDT related fields mentioned above are just unused on the PVH case (and you''d want to validate that they''re zeroed). Jan
Mukesh Rathor
2013-Aug-14 01:13 UTC
Re: [V10 PATCH 11/23] PVH xen: support invalid op emulation for PVH
On Mon, 12 Aug 2013 08:38:34 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 10.08.13 at 04:13, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Thu, 8 Aug 2013 09:55:26 +0100 > > George Dunlap <george.dunlap@eu.citrix.com> wrote: > > > >> On 08/08/13 02:49, Mukesh Rathor wrote: > >> > On Wed, 7 Aug 2013 12:29:13 +0100 > >> > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> >.......> For what it''s worth - I had pointed out the non-logical breakup of > the series as an issue quite early in the process, and merely gave > in realizing that your life with this is already difficult enough.True, and as soon I realized that patches had to be logically organized, and not by file as I was used to, I completely reorged all the patches to have them logically broken as best as possible. thanks mukesh
Mukesh Rathor
2013-Aug-14 02:21 UTC
Re: [V10 PATCH 09/23] PVH xen: introduce pvh_set_vcpu_info() and vmx_pvh_set_vcpu_info()
On Tue, 13 Aug 2013 11:49:54 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 13.08.13 at 02:27, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Mon, 12 Aug 2013 08:54:34 +0100 "Jan Beulich"..> If you think so, then you''re not clear of the implications, including > that this - being part of the hypervisor/guest interface - needs to > be fixed _now_ rather than later. What I think you neglect here is > that struct vcpu_guest_context is also used for save/restore, and > there you _need_ to properly represent the LDT. So you have two > choices: Keep the structure the way it is and require the guest to > do LDT management the PV way (which will be very hard, as you > have no spare selectors to use for it), or do things the HVM way > (having the guest use the LDT instruction, implying that it''s the > selector and nothing else that needs to be saved/restored). > > If you go that latter, more natural route, then you next need to > (immediately) decide whether the extra context will also be save/ > restored HVM-like (via struct hvm_hw_cpu), in which case the LDT > related fields mentioned above are just unused on the PVH case > (and you''d want to validate that they''re zeroed).I completely intend for save/restore to be the HVM way, that would make most sense. This brings back to patch version 0, LDT related fields are unused on PVH :). thanks, mukesh
George Dunlap
2013-Aug-15 15:51 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Wed, Aug 7, 2013 at 10:54 AM, Tim Deegan <tim@xen.org> wrote:> At 17:37 -0700 on 06 Aug (1375810631), Mukesh Rathor wrote: >> On Fri, 26 Jul 2013 11:45:19 +0100 >> Tim Deegan <tim@xen.org> wrote: >> >> > At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote: >> > > On Thu, 25 Jul 2013 17:28:40 +0100 >> > > Tim Deegan <tim@xen.org> wrote: >> > > >> > > > At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: >> > > > > +/* Just like HVM, PVH should be using "cpuid" from the kernel >> > > > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs >> > > > > *regs) +{ >> > > > > + if ( guest_kernel_mode(current, regs) >> > > > > || !emulate_forced_invalid_op(regs) ) >> > > > > + hvm_inject_hw_exception(TRAP_invalid_op, >> > > > > HVM_DELIVER_NO_ERROR_CODE); >> > > > >> > > > Was this discussed before? It seems harsh to stop kernel-mode >> > > > code from using the pv cpuid operation if it wants to. In >> > > > particular, what about loadable kernel modules? >> > > >> > > Yes, few times on the xen mailing list. The only PVH guest, linux >> > > as of now, the pv ops got rewired to use native cpuid, which is >> > > how hvm does it. >> > >> > Yes, but presumably you want to make it easy for other PV guests to >> > port to PVH too? >> >> True, but how would not allowing kernel mode emulation impede that? >> I fail to understand why a new kernel would wanna use xen signature >> emulation over just plain cpuid instruction? > > I''m talking about existing PV kernel code that already uses PV CPUID. > And in particular what if that kernel code is in a device driver, or > even a third-party loadable module? Porting the core kernel from PV to > PVH shouldn''t break that code if it doesn''t have to. > > But TBH my objection is really more aesthetic than anything else. > Restricting the PV CPUID instruction here adds another ragged edge in > the ABI that all kernel writers have to think about, and for little or > no benfit. And, to be clear, I object to it and this patch does not > have my Ack.Are you in particular saying that you think PVH guests should be allowed to use the PV CPUID as a prerequisite for the patch series to go in? Correct me if I''m wrong, but: 1. CPUID is the only forced_emulated op at the moment 2. The kernel is the only caller we can reasonably expect to use the forced_emulated ops If so, then if *were* to go Mukesh''s way on this, we should just forget the whole forced_emulated op thing anyway. It seems to me that one of the advantages of using HVM containers is not needing to do any tricks like this. I''m inclined to say we should not do any forced_emulated ops at all to begin with; and to add support if we actually find that there are callers that need it. -George
Tim Deegan
2013-Aug-15 16:37 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
At 16:51 +0100 on 15 Aug (1376585469), George Dunlap wrote:> On Wed, Aug 7, 2013 at 10:54 AM, Tim Deegan <tim@xen.org> wrote: > > At 17:37 -0700 on 06 Aug (1375810631), Mukesh Rathor wrote: > >> On Fri, 26 Jul 2013 11:45:19 +0100 > >> Tim Deegan <tim@xen.org> wrote: > >> > >> > At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote: > >> > > On Thu, 25 Jul 2013 17:28:40 +0100 > >> > > Tim Deegan <tim@xen.org> wrote: > >> > > > >> > > > At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: > >> > > > > +/* Just like HVM, PVH should be using "cpuid" from the kernel > >> > > > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs > >> > > > > *regs) +{ > >> > > > > + if ( guest_kernel_mode(current, regs) > >> > > > > || !emulate_forced_invalid_op(regs) ) > >> > > > > + hvm_inject_hw_exception(TRAP_invalid_op, > >> > > > > HVM_DELIVER_NO_ERROR_CODE); > >> > > > > >> > > > Was this discussed before? It seems harsh to stop kernel-mode > >> > > > code from using the pv cpuid operation if it wants to. In > >> > > > particular, what about loadable kernel modules? > >> > > > >> > > Yes, few times on the xen mailing list. The only PVH guest, linux > >> > > as of now, the pv ops got rewired to use native cpuid, which is > >> > > how hvm does it. > >> > > >> > Yes, but presumably you want to make it easy for other PV guests to > >> > port to PVH too? > >> > >> True, but how would not allowing kernel mode emulation impede that? > >> I fail to understand why a new kernel would wanna use xen signature > >> emulation over just plain cpuid instruction? > > > > I''m talking about existing PV kernel code that already uses PV CPUID. > > And in particular what if that kernel code is in a device driver, or > > even a third-party loadable module? Porting the core kernel from PV to > > PVH shouldn''t break that code if it doesn''t have to. > > > > But TBH my objection is really more aesthetic than anything else. > > Restricting the PV CPUID instruction here adds another ragged edge in > > the ABI that all kernel writers have to think about, and for little or > > no benfit. And, to be clear, I object to it and this patch does not > > have my Ack. > > Are you in particular saying that you think PVH guests should be > allowed to use the PV CPUID as a prerequisite for the patch series to > go in? > > Correct me if I''m wrong, but: > 1. CPUID is the only forced_emulated op at the moment > 2. The kernel is the only caller we can reasonably expect to use the > forced_emulated opsNo - the forced-invalid-op CPUID is used from userspace too, e.g. in tools/misc/xen-detect.c. The current patch keeps support for userspace but not for kernel users. I would prefer to keep it for both, on the grounds that even in the kernel there may be users of it that the person porting the kernel does not control (e.g. in third-party modules). Tim.> If so, then if *were* to go Mukesh''s way on this, we should just > forget the whole forced_emulated op thing anyway. > > It seems to me that one of the advantages of using HVM containers is > not needing to do any tricks like this. I''m inclined to say we should > not do any forced_emulated ops at all to begin with; and to add > support if we actually find that there are callers that need it. > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-Aug-15 16:44 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On 15/08/13 17:37, Tim Deegan wrote:> At 16:51 +0100 on 15 Aug (1376585469), George Dunlap wrote: >> On Wed, Aug 7, 2013 at 10:54 AM, Tim Deegan <tim@xen.org> wrote: >>> At 17:37 -0700 on 06 Aug (1375810631), Mukesh Rathor wrote: >>>> On Fri, 26 Jul 2013 11:45:19 +0100 >>>> Tim Deegan <tim@xen.org> wrote: >>>> >>>>> At 19:30 -0700 on 25 Jul (1374780657), Mukesh Rathor wrote: >>>>>> On Thu, 25 Jul 2013 17:28:40 +0100 >>>>>> Tim Deegan <tim@xen.org> wrote: >>>>>> >>>>>>> At 18:59 -0700 on 23 Jul (1374605971), Mukesh Rathor wrote: >>>>>>>> +/* Just like HVM, PVH should be using "cpuid" from the kernel >>>>>>>> mode. */ +static int vmxit_invalid_op(struct cpu_user_regs >>>>>>>> *regs) +{ >>>>>>>> + if ( guest_kernel_mode(current, regs) >>>>>>>> || !emulate_forced_invalid_op(regs) ) >>>>>>>> + hvm_inject_hw_exception(TRAP_invalid_op, >>>>>>>> HVM_DELIVER_NO_ERROR_CODE); >>>>>>> Was this discussed before? It seems harsh to stop kernel-mode >>>>>>> code from using the pv cpuid operation if it wants to. In >>>>>>> particular, what about loadable kernel modules? >>>>>> Yes, few times on the xen mailing list. The only PVH guest, linux >>>>>> as of now, the pv ops got rewired to use native cpuid, which is >>>>>> how hvm does it. >>>>> Yes, but presumably you want to make it easy for other PV guests to >>>>> port to PVH too? >>>> True, but how would not allowing kernel mode emulation impede that? >>>> I fail to understand why a new kernel would wanna use xen signature >>>> emulation over just plain cpuid instruction? >>> I''m talking about existing PV kernel code that already uses PV CPUID. >>> And in particular what if that kernel code is in a device driver, or >>> even a third-party loadable module? Porting the core kernel from PV to >>> PVH shouldn''t break that code if it doesn''t have to. >>> >>> But TBH my objection is really more aesthetic than anything else. >>> Restricting the PV CPUID instruction here adds another ragged edge in >>> the ABI that all kernel writers have to think about, and for little or >>> no benfit. And, to be clear, I object to it and this patch does not >>> have my Ack. >> Are you in particular saying that you think PVH guests should be >> allowed to use the PV CPUID as a prerequisite for the patch series to >> go in? >> >> Correct me if I''m wrong, but: >> 1. CPUID is the only forced_emulated op at the moment >> 2. The kernel is the only caller we can reasonably expect to use the >> forced_emulated ops > No - the forced-invalid-op CPUID is used from userspace too, e.g. in > tools/misc/xen-detect.c. The current patch keeps support for userspace > but not for kernel users. I would prefer to keep it for both, on the > grounds that even in the kernel there may be users of it that the person > porting the kernel does not control (e.g. in third-party modules).Ah, right. That makes sense then -- we''d want to have the same toolstack binaries for PV and PVH. But then if we''re going to have it, I agree that there''s no reason not to just go ahead and make it available in all modes. We can just deprecate it in kernel mode. -George
George Dunlap
2013-Aug-16 10:18 UTC
Re: [V10 PATCH 01/23] PVH xen: Add readme docs/misc/pvh-readme.txt
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > docs/misc/pvh-readme.txt | 56 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 56 insertions(+), 0 deletions(-) > create mode 100644 docs/misc/pvh-readme.txt > > diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt > new file mode 100644 > index 0000000..3b14aa7 > --- /dev/null > +++ b/docs/misc/pvh-readme.txt > @@ -0,0 +1,56 @@ > + > +PVH : an x86 PV guest running in an HVM container. HAP is required for PVH. > + > +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ > + > +At present the only PVH guest is an x86 64bit PV linux. Patches are at: > + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.gitOne thing that''s missing from this is how to actually create a PVH guest. I''m guessing there''s a way to get the xl toolstack to create a PVH domain with HAP enabled? -George
George Dunlap
2013-Aug-16 13:17 UTC
Re: [V10 PATCH 01/23] PVH xen: Add readme docs/misc/pvh-readme.txt
On Fri, Aug 16, 2013 at 11:18 AM, George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >> --- >> docs/misc/pvh-readme.txt | 56 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 files changed, 56 insertions(+), 0 deletions(-) >> create mode 100644 docs/misc/pvh-readme.txt >> >> diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt >> new file mode 100644 >> index 0000000..3b14aa7 >> --- /dev/null >> +++ b/docs/misc/pvh-readme.txt >> @@ -0,0 +1,56 @@ >> + >> +PVH : an x86 PV guest running in an HVM container. HAP is required for PVH. >> + >> +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ >> + >> +At present the only PVH guest is an x86 64bit PV linux. Patches are at: >> + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > > One thing that''s missing from this is how to actually create a PVH > guest. I''m guessing there''s a way to get the xl toolstack to create a > PVH domain with HAP enabled?Also, which branch has the most recent PVH kernel to test with V10 of the hypervisor series? -George
Konrad Rzeszutek Wilk
2013-Aug-16 14:11 UTC
Re: [V10 PATCH 01/23] PVH xen: Add readme docs/misc/pvh-readme.txt
On Fri, Aug 16, 2013 at 02:17:47PM +0100, George Dunlap wrote:> On Fri, Aug 16, 2013 at 11:18 AM, George Dunlap <dunlapg@umich.edu> wrote: > > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > >> --- > >> docs/misc/pvh-readme.txt | 56 ++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 files changed, 56 insertions(+), 0 deletions(-) > >> create mode 100644 docs/misc/pvh-readme.txt > >> > >> diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt > >> new file mode 100644 > >> index 0000000..3b14aa7 > >> --- /dev/null > >> +++ b/docs/misc/pvh-readme.txt > >> @@ -0,0 +1,56 @@ > >> + > >> +PVH : an x86 PV guest running in an HVM container. HAP is required for PVH. > >> + > >> +See: http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ > >> + > >> +At present the only PVH guest is an x86 64bit PV linux. Patches are at: > >> + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > > > > One thing that''s missing from this is how to actually create a PVH > > guest. I''m guessing there''s a way to get the xl toolstack to create a > > PVH domain with HAP enabled? > > Also, which branch has the most recent PVH kernel to test with V10 of > the hypervisor series?stable/pvh.v8. But you need to merge it against the latest. I would recommend you do: git checkout v3.9 git merge stable/pvh.v8 there are some conflicts but they are pretty simple to resolve.> > -George
George Dunlap
2013-Aug-16 15:32 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> This patch mostly contains changes to arch/x86/domain.c to allow for a PVH > domain creation. The new function pvh_set_vcpu_info(), introduced in the > previous patch, is called here to set some guest context in the VMCS. > This patch also changes the context_switch code in the same file to follow > HVM behaviour for PVH. > > Changes in V2: > - changes to read_segment_register() moved to this patch. > - The other comment was to create NULL functions for pvh_set_vcpu_info > and pvh_read_descriptor which are implemented in later patch, but since > I disable PVH creation until all patches are checked in, it is not needed. > But it helps breaking down of patches. > > Changes in V3: > - Fix read_segment_register() macro to make sure args are evaluated once, > and use # instead of STR for name in the macro. > > Changes in V4: > - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been > moved out of pv_vcpu struct. > - rename hvm_pvh_* functions to hvm_*. > > Changes in V5: > - remove pvh_read_descriptor(). > > Changes in V7: > - remove hap_update_cr3() and read_segment_register changes from here. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> > --- > xen/arch/x86/domain.c | 56 ++++++++++++++++++++++++++++++++---------------- > xen/arch/x86/mm.c | 3 ++ > 2 files changed, 40 insertions(+), 19 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index c361abf..fccb4ee 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) > > vmce_init_vcpu(v); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > rc = hvm_vcpu_initialise(v); > goto done; > @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) > > vcpu_destroy_fpu(v); > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > hvm_vcpu_destroy(v); > else > xfree(v->arch.pv_vcpu.trap_ctxt); > @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > int rc = -ENOMEM; > > d->arch.hvm_domain.hap_enabled > - is_hvm_domain(d) && > + !is_pv_domain(d) && > hvm_funcs.hap_supported && > (domcr_flags & DOMCRF_hap); > d->arch.hvm_domain.mem_sharing_enabled = 0; > @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > mapcache_domain_init(d); > > HYPERVISOR_COMPAT_VIRT_START(d) > - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; > + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; > > if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) > goto fail; > @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) > } > spin_lock_init(&d->arch.e820_lock); > > - if ( is_hvm_domain(d) ) > + if ( !is_pv_domain(d) ) > { > if ( (rc = hvm_domain_initialise(d)) != 0 ) > { > @@ -651,7 +651,7 @@ int arch_set_info_guest( > #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) > flags = c(flags); > > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > { > if ( !compat ) > { > @@ -704,7 +704,7 @@ int arch_set_info_guest( > v->fpu_initialised = !!(flags & VGCF_I387_VALID); > > v->arch.flags &= ~TF_kernel_mode; > - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) > + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) > v->arch.flags |= TF_kernel_mode; > > v->arch.vgc_flags = flags; > @@ -719,7 +719,7 @@ int arch_set_info_guest( > if ( !compat ) > { > memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); > - if ( !is_hvm_vcpu(v) ) > + if ( is_pv_vcpu(v) ) > memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, > sizeof(c.nat->trap_ctxt)); > } > @@ -735,10 +735,13 @@ int arch_set_info_guest( > > v->arch.user_regs.eflags |= 2; > > - if ( is_hvm_vcpu(v) ) > + if ( !is_pv_vcpu(v) ) > { > hvm_set_info_guest(v); > - goto out; > + if ( is_hvm_vcpu(v) || v->is_initialised ) > + goto out; > + else > + goto pvh_skip_pv_stuff; > } > > init_int80_direct_trap(v); > @@ -853,6 +856,7 @@ int arch_set_info_guest( > > set_bit(_VPF_in_reset, &v->pause_flags); > > + pvh_skip_pv_stuff:Any idea what this set_bit(_VPF_in_reset) stuff is? It looks like it''s set above, and cleared down near the bottom of the function if nothing gets screwed up. It seems like if that set/clear pair is important, then PVH should do them both as well, shouldn''t it? -George
Jan Beulich
2013-Aug-16 16:11 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
>>> On 16.08.13 at 17:32, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> > wrote: >> This patch mostly contains changes to arch/x86/domain.c to allow for a PVH >> domain creation. The new function pvh_set_vcpu_info(), introduced in the >> previous patch, is called here to set some guest context in the VMCS. >> This patch also changes the context_switch code in the same file to follow >> HVM behaviour for PVH. >> >> Changes in V2: >> - changes to read_segment_register() moved to this patch. >> - The other comment was to create NULL functions for pvh_set_vcpu_info >> and pvh_read_descriptor which are implemented in later patch, but since >> I disable PVH creation until all patches are checked in, it is not > needed. >> But it helps breaking down of patches. >> >> Changes in V3: >> - Fix read_segment_register() macro to make sure args are evaluated once, >> and use # instead of STR for name in the macro. >> >> Changes in V4: >> - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been >> moved out of pv_vcpu struct. >> - rename hvm_pvh_* functions to hvm_*. >> >> Changes in V5: >> - remove pvh_read_descriptor(). >> >> Changes in V7: >> - remove hap_update_cr3() and read_segment_register changes from here. >> >> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> >> Reviewed-by: Jan Beulich <jbeulich@suse.com> >> --- >> xen/arch/x86/domain.c | 56 ++++++++++++++++++++++++++++++++---------------- >> xen/arch/x86/mm.c | 3 ++ >> 2 files changed, 40 insertions(+), 19 deletions(-) >> >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index c361abf..fccb4ee 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) >> >> vmce_init_vcpu(v); >> >> - if ( is_hvm_domain(d) ) >> + if ( !is_pv_domain(d) ) >> { >> rc = hvm_vcpu_initialise(v); >> goto done; >> @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) >> >> vcpu_destroy_fpu(v); >> >> - if ( is_hvm_vcpu(v) ) >> + if ( !is_pv_vcpu(v) ) >> hvm_vcpu_destroy(v); >> else >> xfree(v->arch.pv_vcpu.trap_ctxt); >> @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int > domcr_flags) >> int rc = -ENOMEM; >> >> d->arch.hvm_domain.hap_enabled >> - is_hvm_domain(d) && >> + !is_pv_domain(d) && >> hvm_funcs.hap_supported && >> (domcr_flags & DOMCRF_hap); >> d->arch.hvm_domain.mem_sharing_enabled = 0; >> @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int > domcr_flags) >> mapcache_domain_init(d); >> >> HYPERVISOR_COMPAT_VIRT_START(d) >> - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; >> + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; >> >> if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) >> goto fail; >> @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int > domcr_flags) >> } >> spin_lock_init(&d->arch.e820_lock); >> >> - if ( is_hvm_domain(d) ) >> + if ( !is_pv_domain(d) ) >> { >> if ( (rc = hvm_domain_initialise(d)) != 0 ) >> { >> @@ -651,7 +651,7 @@ int arch_set_info_guest( >> #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) >> flags = c(flags); >> >> - if ( !is_hvm_vcpu(v) ) >> + if ( is_pv_vcpu(v) ) >> { >> if ( !compat ) >> { >> @@ -704,7 +704,7 @@ int arch_set_info_guest( >> v->fpu_initialised = !!(flags & VGCF_I387_VALID); >> >> v->arch.flags &= ~TF_kernel_mode; >> - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) >> + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) >> v->arch.flags |= TF_kernel_mode; >> >> v->arch.vgc_flags = flags; >> @@ -719,7 +719,7 @@ int arch_set_info_guest( >> if ( !compat ) >> { >> memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); >> - if ( !is_hvm_vcpu(v) ) >> + if ( is_pv_vcpu(v) ) >> memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, >> sizeof(c.nat->trap_ctxt)); >> } >> @@ -735,10 +735,13 @@ int arch_set_info_guest( >> >> v->arch.user_regs.eflags |= 2; >> >> - if ( is_hvm_vcpu(v) ) >> + if ( !is_pv_vcpu(v) ) >> { >> hvm_set_info_guest(v); >> - goto out; >> + if ( is_hvm_vcpu(v) || v->is_initialised ) >> + goto out; >> + else >> + goto pvh_skip_pv_stuff; >> } >> >> init_int80_direct_trap(v); >> @@ -853,6 +856,7 @@ int arch_set_info_guest( >> >> set_bit(_VPF_in_reset, &v->pause_flags); >> >> + pvh_skip_pv_stuff: > > Any idea what this set_bit(_VPF_in_reset) stuff is? It looks like > it''s set above, and cleared down near the bottom of the function if > nothing gets screwed up.This is related to the preemptible vCPU reset (which arch_set_info_guest() just re-uses), making sure that while there is an incomplete state update for a vCPU 8because it may have got preempted) the vCPU can''t be unpaused.> It seems like if that set/clear pair is important, then PVH should do > them both as well, shouldn''t it?I thought I had checked this once - does it now bypass one of the two? But then again, this is all about PV memory management, so perhaps it was that way when I checked, and I decided it was fine. Jan
Mukesh Rathor
2013-Aug-16 21:39 UTC
Re: [V10 PATCH 01/23] PVH xen: Add readme docs/misc/pvh-readme.txt
On Fri, 16 Aug 2013 11:18:51 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > > --- > > docs/misc/pvh-readme.txt | 56 > > ++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 56 > > insertions(+), 0 deletions(-) create mode 100644 > > docs/misc/pvh-readme.txt > > > > diff --git a/docs/misc/pvh-readme.txt b/docs/misc/pvh-readme.txt > > new file mode 100644 > > index 0000000..3b14aa7 > > --- /dev/null > > +++ b/docs/misc/pvh-readme.txt > > @@ -0,0 +1,56 @@ > > + > > +PVH : an x86 PV guest running in an HVM container. HAP is required > > for PVH. + > > +See: > > http://blog.xen.org/index.php/2012/10/23/the-paravirtualization-spectrum-part-1-the-ends-of-the-spectrum/ > > + +At present the only PVH guest is an x86 64bit PV linux. Patches > > are at: > > + git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > > One thing that''s missing from this is how to actually create a PVH > guest. I''m guessing there''s a way to get the xl toolstack to create a > PVH domain with HAP enabled?Further down: "The initial phase targets the booting of a 64bit UP/SMP linux guest in PVH mode. This is done by adding: pvh=1 in the config file. xl, and not xm, is.." So, just set pvh=1 and hap=1 in the config file. That will do it. Mukesh
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) > hvm_asid_flush_vcpu(v); > } > > + if ( is_pvh_vcpu(v) ) > + reset_stack_and_jump(vmx_asm_do_vmentry); > +This skips the debugger stuff, but also skips hvm_do_resume(). hvm_do_resume() has timer and ioreq stuff that''s not needed for PVH, but it also has code to "Inject a pending hw/sw trap". Might that code not be needed? -George
On Mon, Aug 19, 2013 at 5:00 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) >> hvm_asid_flush_vcpu(v); >> } >> >> + if ( is_pvh_vcpu(v) ) >> + reset_stack_and_jump(vmx_asm_do_vmentry); >> + > > This skips the debugger stuff, but also skips hvm_do_resume(). > hvm_do_resume() has timer and ioreq stuff that''s not needed for PVH, > but it also has code to "Inject a pending hw/sw trap". Might that > code not be needed?hvm_do_resume() also has check_wakeup_from_wait() -- that seems like it would apply to PVH VMs as well, no? -George
On Mon, 19 Aug 2013 17:00:53 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) > > hvm_asid_flush_vcpu(v); > > } > > > > + if ( is_pvh_vcpu(v) ) > > + reset_stack_and_jump(vmx_asm_do_vmentry); > > + > > This skips the debugger stuff, but also skips hvm_do_resume(). > hvm_do_resume() has timer and ioreq stuff that''s not needed for PVH, > but it also has code to "Inject a pending hw/sw trap". Might that > code not be needed?We inject exceptions directly for PVH, like in vmxit_int3() if the exception doesn''t belong to supported debugger, we inject it into the guest. PVH supports gdbsx debugger, but not the external debugger which seems to set the vector to be injected (for HVMOP_inject_trap). -Mukesh
On Mon, 19 Aug 2013 17:03:11 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Mon, Aug 19, 2013 at 5:00 PM, George Dunlap > <George.Dunlap@eu.citrix.com> wrote: > > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > > <mukesh.rathor@oracle.com> wrote: > >> @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) > >> hvm_asid_flush_vcpu(v); > >> } > >> > >> + if ( is_pvh_vcpu(v) ) > >> + reset_stack_and_jump(vmx_asm_do_vmentry); > >> + > > > > This skips the debugger stuff, but also skips hvm_do_resume(). > > hvm_do_resume() has timer and ioreq stuff that''s not needed for PVH, > > but it also has code to "Inject a pending hw/sw trap". Might that > > code not be needed? > > hvm_do_resume() also has check_wakeup_from_wait() -- that seems like > it would apply to PVH VMs as well, no?Right, to support VMI, virt machine introspection, we''d need that. I should add that to the list of things TBD for PVH. It may be as simple as just doing that for PVH, but I''d need to study VMI in bit more details. thanks, mukesh
Mukesh Rathor
2013-Aug-20 00:52 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Fri, 16 Aug 2013 17:11:21 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 16.08.13 at 17:32, George Dunlap <George.Dunlap@eu.citrix.com> > >>> wrote: > > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor......> >> > >> set_bit(_VPF_in_reset, &v->pause_flags); > >> > >> + pvh_skip_pv_stuff: > > > > Any idea what this set_bit(_VPF_in_reset) stuff is? It looks like > > it''s set above, and cleared down near the bottom of the function if > > nothing gets screwed up. > > This is related to the preemptible vCPU reset (which > arch_set_info_guest() just re-uses), making sure that while there > is an incomplete state update for a vCPU 8because it may have got > preempted) the vCPU can''t be unpaused. > > > It seems like if that set/clear pair is important, then PVH should > > do them both as well, shouldn''t it? > > I thought I had checked this once - does it now bypass one of the > two? > > But then again, this is all about PV memory management, so perhaps > it was that way when I checked, and I decided it was fine.Ok, I''ll just leave it as it is then. Setting it might confuse someone why it''s being set for PVH. Clearing is harmless anyways :). thanks mukesh
George Dunlap
2013-Aug-20 09:29 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Tue, Aug 20, 2013 at 1:52 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Fri, 16 Aug 2013 17:11:21 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 16.08.13 at 17:32, George Dunlap <George.Dunlap@eu.citrix.com> >> >>> wrote: >> > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > ...... >> >> >> >> set_bit(_VPF_in_reset, &v->pause_flags); >> >> >> >> + pvh_skip_pv_stuff: >> > >> > Any idea what this set_bit(_VPF_in_reset) stuff is? It looks like >> > it''s set above, and cleared down near the bottom of the function if >> > nothing gets screwed up. >> >> This is related to the preemptible vCPU reset (which >> arch_set_info_guest() just re-uses), making sure that while there >> is an incomplete state update for a vCPU 8because it may have got >> preempted) the vCPU can''t be unpaused. >> >> > It seems like if that set/clear pair is important, then PVH should >> > do them both as well, shouldn''t it? >> >> I thought I had checked this once - does it now bypass one of the >> two? >> >> But then again, this is all about PV memory management, so perhaps >> it was that way when I checked, and I decided it was fine. > > Ok, I''ll just leave it as it is then. Setting it might confuse someone > why it''s being set for PVH. Clearing is harmless anyways :).I think much more confusing is skipping the set. Part of the reason I was asking is that I was looking at re-organizing the function so that all the stuff common to PVH and PV were at the top; Then instead of the goto, you would just have two return''s, one for HVM, one for PVH, at the appropriate place; I backed off and asked when I saw this. -George
George Dunlap
2013-Aug-20 14:13 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c > index bff05d9..19a085c 100644 > --- a/xen/arch/x86/mm/hap/hap.c > +++ b/xen/arch/x86/mm/hap/hap.c > @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int do_locking) > const struct paging_mode * > hap_paging_get_mode(struct vcpu *v) > { > - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : > + /* PVH 32bitfixme. */ > + return is_pvh_vcpu(v) ? &hap_paging_long_mode : > + !hvm_paging_enabled(v) ? &hap_paging_real_mode : > hvm_long_mode_enabled(v) ? &hap_paging_long_mode : > hvm_pae_enabled(v) ? &hap_paging_pae_mode : > &hap_paging_protected_mode;This shouldn''t be necessary, right? The PVH code should ensure that for 64-bit PVH guests, hvm_long_mode_enabled() is always true, right? -George
On 08/19/2013 11:21 PM, Mukesh Rathor wrote:> On Mon, 19 Aug 2013 17:00:53 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >>> @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) >>> hvm_asid_flush_vcpu(v); >>> } >>> >>> + if ( is_pvh_vcpu(v) ) >>> + reset_stack_and_jump(vmx_asm_do_vmentry); >>> + >> >> This skips the debugger stuff, but also skips hvm_do_resume(). >> hvm_do_resume() has timer and ioreq stuff that''s not needed for PVH, >> but it also has code to "Inject a pending hw/sw trap". Might that >> code not be needed? > > We inject exceptions directly for PVH, like in vmxit_int3() if the > exception doesn''t belong to supported debugger, we inject it into the > guest. PVH supports gdbsx debugger, but not the external debugger which > seems to set the vector to be injected (for HVMOP_inject_trap).But the HVM VMX code does exactly the same thing for int3. When is this code triggered in the HVM case, and why is it not necessary for the PVH case? -George
Mukesh Rathor
2013-Aug-20 21:32 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On Tue, 20 Aug 2013 15:13:10 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c > > index bff05d9..19a085c 100644 > > --- a/xen/arch/x86/mm/hap/hap.c > > +++ b/xen/arch/x86/mm/hap/hap.c > > @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int > > do_locking) const struct paging_mode * > > hap_paging_get_mode(struct vcpu *v) > > { > > - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : > > + /* PVH 32bitfixme. */ > > + return is_pvh_vcpu(v) ? &hap_paging_long_mode : > > + !hvm_paging_enabled(v) ? &hap_paging_real_mode : > > hvm_long_mode_enabled(v) ? &hap_paging_long_mode : > > hvm_pae_enabled(v) ? &hap_paging_pae_mode : > > &hap_paging_protected_mode; > > This shouldn''t be necessary, right? The PVH code should ensure that > for 64-bit PVH guests, hvm_long_mode_enabled() is always true, right?Right, 64bit PVH always will be in long mode. However, with 32bit PVH, this check will change, so best to leave it here. thanks mukesh
Mukesh Rathor
2013-Aug-20 21:41 UTC
Re: [V10 PATCH 18/23] PVH xen: add hypercall support for PVH
On Thu, 8 Aug 2013 10:20:39 +0100 George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 08/08/13 03:12, Mukesh Rathor wrote: > > On Wed, 7 Aug 2013 17:43:54 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote:....> >>> +/* PVH 32bitfixme. */ > >>> +static hvm_hypercall_t *const > >>> pvh_hypercall64_table[NR_hypercalls] = { > >>> + HYPERCALL(platform_op), > >>> + HYPERCALL(memory_op), > >>> + HYPERCALL(xen_version), > >>> + HYPERCALL(console_io), > >>> + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t > >>> *)hvm_grant_table_op, > >>> + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t > >>> *)hvm_vcpu_op, > >>> + HYPERCALL(mmuext_op), > >>> + HYPERCALL(xsm_op), > >>> + HYPERCALL(sched_op), > >>> + HYPERCALL(event_channel_op), > >>> + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t > >>> *)hvm_physdev_op, > >>> + HYPERCALL(hvm_op), > >>> + HYPERCALL(sysctl), > >>> + HYPERCALL(domctl) > >>> +}; > >> It would be nice if this list were in the same order as the other > >> lists, so that it is easy to figure out what calls are common and > >> what calls are different. > > These are ordered by the hcall number, and assists in the debug. > > That makes sense. What about adding a "prep" patch which > re-organizes the other lists by hcall number? I''m not particular > about which order, I just think they should be the same.Jan is going to redo this anyways, so I''m gonna leave as is. thanks, mukesh
On Tue, 20 Aug 2013 15:27:12 +0100 George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 08/19/2013 11:21 PM, Mukesh Rathor wrote: > > On Mon, 19 Aug 2013 17:00:53 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote: > >>> @@ -1294,6 +1534,9 @@ void vmx_do_resume(struct vcpu *v) > >>> hvm_asid_flush_vcpu(v); > >>> } > >>> > >>> + if ( is_pvh_vcpu(v) ) > >>> + reset_stack_and_jump(vmx_asm_do_vmentry); > >>> + > >> > >> This skips the debugger stuff, but also skips hvm_do_resume(). > >> hvm_do_resume() has timer and ioreq stuff that''s not needed for > >> PVH, but it also has code to "Inject a pending hw/sw trap". Might > >> that code not be needed? > > > > We inject exceptions directly for PVH, like in vmxit_int3() if the > > exception doesn''t belong to supported debugger, we inject it into > > the guest. PVH supports gdbsx debugger, but not the external > > debugger which seems to set the vector to be injected (for > > HVMOP_inject_trap). > > But the HVM VMX code does exactly the same thing for int3. When is > this code triggered in the HVM case, and why is it not necessary for > the PVH case?Like implied above, for HVMOP_inject_trap which I believe is some external debugger which sets the vector to be injected. -M
George Dunlap
2013-Aug-21 08:37 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On 20/08/13 22:32, Mukesh Rathor wrote:> On Tue, 20 Aug 2013 15:13:10 +0100 > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor >> <mukesh.rathor@oracle.com> wrote: >>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c >>> index bff05d9..19a085c 100644 >>> --- a/xen/arch/x86/mm/hap/hap.c >>> +++ b/xen/arch/x86/mm/hap/hap.c >>> @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int >>> do_locking) const struct paging_mode * >>> hap_paging_get_mode(struct vcpu *v) >>> { >>> - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : >>> + /* PVH 32bitfixme. */ >>> + return is_pvh_vcpu(v) ? &hap_paging_long_mode : >>> + !hvm_paging_enabled(v) ? &hap_paging_real_mode : >>> hvm_long_mode_enabled(v) ? &hap_paging_long_mode : >>> hvm_pae_enabled(v) ? &hap_paging_pae_mode : >>> &hap_paging_protected_mode; >> This shouldn''t be necessary, right? The PVH code should ensure that >> for 64-bit PVH guests, hvm_long_mode_enabled() is always true, right? > Right, 64bit PVH always will be in long mode. However, with 32bit PVH, > this check will change, so best to leave it here.How will it change? In that case, won''t hvm_long_mode() return false, but hvm_pae_enabled() return true, and you''ll get hap_paging_pae_mode (which I assume is what you would want)? In any case, if it''s not needed now, it shouldn''t be introduced now. I''ve taken it out of my copy. -George
Mukesh Rathor
2013-Aug-22 01:44 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Mon, 12 Aug 2013 17:00:36 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor....> > Changes in V8: > > - Mainly, don''t read selectors on vmexit. The macros now come to > > VMCS to read selectors on demand. > > Overall I have the same comment here as I had for the VMCS patch: the > code looks 98% identical. Substantial differences seem to be: > - emulation of privileged ops > - cpuid > - cr4 handling > > It seems like it would be much better to share the codepath and just > put "is_pvh_domain()" in the places where it needs to be different.Depends, and again could be argued either way. The intercepts are so few for PVH that having a lightweight external handler makes it much easier to follow and debug. Also, PVH doesn''t carry lot of the baggage of HVM, given we require HAP. Other maintainers I asked had also suggested making it a separate function.> > ((uint64_t)regs->edx << 32); + > > + dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx, > > + regs->eax, regs->edx); > > + > > + if ( hvm_msr_write_intercept(regs->ecx, msr_content) => > X86EMUL_OKAY ) > > + { > > + vmx_update_guest_eip(); > > + return 0; > > + } > > + return 1; > > +} > > + > > +static int vmxit_debug(struct cpu_user_regs *regs) > > +{ > > + struct vcpu *vp = current; > > + unsigned long exit_qualification > > __vmread(EXIT_QUALIFICATION); + > > + write_debugreg(6, exit_qualification | 0xffff0ff0); > > + > > + /* gdbsx or another debugger. Never pause dom0. */ > > + if ( vp->domain->domain_id != 0 && > > vp->domain->debugger_attached ) > > + domain_pause_for_debugger(); > > + else > > + hvm_inject_hw_exception(TRAP_debug, > > HVM_DELIVER_NO_ERROR_CODE); > > Hmm, strangely enough, the HVM handler for this doesn''t seem to > deliver this exception -- or if it does, I can''t quite figure out > where. What you have here seems like the correct thing to do, but I > would be interested in knowing the reason for the HVM behavior.HVM doesn''t intercept this trap unless MTF is not available. We just keep things simple for PVH. Incase of MTF, we just won''t get here. ..> > +/* Just like HVM, PVH should be using "cpuid" from the kernel > > mode. */ +static int vmxit_invalid_op(struct cpu_user_regs *regs) > > +{ > > + if ( guest_kernel_mode(current, regs) > > || !emulate_forced_invalid_op(regs) ) > > + hvm_inject_hw_exception(TRAP_invalid_op, > > HVM_DELIVER_NO_ERROR_CODE); + > > + return 0; > > +} > > + > > +/* Returns: rc == 0: handled the exception. */ > > +static int vmxit_exception(struct cpu_user_regs *regs) > > +{ > > + int vector = (__vmread(VM_EXIT_INTR_INFO)) & > > INTR_INFO_VECTOR_MASK; > > + int rc = -ENOSYS; > > The vmx code here has some handler for faults that happen during a > guest IRET -- is that an issue for PVH?Hmmm... possibly! But reading the SDMs on this is making my head spin. Lets not hold the series while I investigate this.> > + return -EPERM; > > + } > > + /* TS going from 1 to 0 */ > > + if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) => > 0) ) > > + vmx_fpu_enter(vp); > > + > > + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] > > = new_cr0; > > + __vmwrite(GUEST_CR0, new_cr0); > > + __vmwrite(CR0_READ_SHADOW, new_cr0); > > + } > > + else > > + *regp = __vmread(GUEST_CR0); > > The HVM code here just uses hvm_vcpu.guest_cr[] -- is there any reason > not to do the same here? And in any case, shouldn''t it be > CR0_READ_SHADOW?They are all the same for PVH.> > + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) > > + { > > + printk(XENLOG_G_WARNING "Guest cleared CR4.PAE while " > > + "EFER.LMA is set"); > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > + return 0; > > + } > > + > > + vp->arch.hvm_vcpu.guest_cr[4] = new; > > + > > + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | > > X86_CR4_PAE) ) > > + vpid_sync_all(); > > Is it actually allowed for a PVH guest to change modes like this?The 64bit guest should only change the PGE.> I realize that at the moment you''re only supporting HAP, but that may > not always be true; would it make sense to call > paging_update_paging_modes() here instead?Lets do it in steps. When we support other modes, we can always update this. Right now, we dont'' really keep track of guest CR3 because we require HAP. Wanting PVH without HAP in future seems extermely low probability to me at this time. We have lot more work to do for PVH, like migration etc.. and keeping things simple will only help us IMHO.> This seems to be a weird way to do things, but I see this is what they > do in vmx_vmexit_handler() as well, so I guess it makes sense to > follow suit. > > What about EXIT_REASON_TRIPLE_FAULT?Would result in domain crash (just like HVM). thanks mukesh
Mukesh Rathor
2013-Aug-22 01:46 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Mon, 12 Aug 2013 17:21:41 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > +static int pvh_ept_handle_violation(unsigned long qualification, > > + paddr_t gpa, struct > > cpu_user_regs *regs) +{ > > + unsigned long gla, gfn = gpa >> PAGE_SHIFT; > > + p2m_type_t p2mt; > > + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, > > &p2mt); + > > + printk(XENLOG_G_ERR "EPT violation %#lx (%c%c%c/%c%c%c), " > > + "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx > > RSP:0x%lx\n", > > + qualification, > > + (qualification & EPT_READ_VIOLATION) ? ''r'' : ''-'', > > + (qualification & EPT_WRITE_VIOLATION) ? ''w'' : ''-'', > > + (qualification & EPT_EXEC_VIOLATION) ? ''x'' : ''-'', > > + (qualification & EPT_EFFECTIVE_READ) ? ''r'' : ''-'', > > + (qualification & EPT_EFFECTIVE_WRITE) ? ''w'' : ''-'', > > + (qualification & EPT_EFFECTIVE_EXEC) ? ''x'' : ''-'', > > + gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp); > > + > > + ept_walk_table(current->domain, gfn); > > + > > + if ( qualification & EPT_GLA_VALID ) > > + { > > + gla = __vmread(GUEST_LINEAR_ADDRESS); > > + printk(XENLOG_G_ERR " --- GLA %#lx\n", gla); > > + } > > + hvm_inject_hw_exception(TRAP_gp_fault, 0); > > + return 0; > > +} > > Similar to the TRAP_debug issue -- the HVM code here crashes the > guest; as there is unlikely to be anything the guest can do to fix > things up at this point, that is almost certainly the right thing to > do.The advantage of GP injection is the guest gets a chance to print debug info. Often linux will print the stacks of all vcpus before crashing itself on GP. Mukesh
Mukesh Rathor
2013-Aug-22 23:22 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Mon, 12 Aug 2013 17:00:36 +0100 George Dunlap <dunlapg@umich.edu> wrote:> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > > +/* Returns : 0 == msr read successfully. */ > > +static int vmxit_msr_read(struct cpu_user_regs *regs) > > +{ > > + u64 msr_content = 0; > > + > > + switch ( regs->ecx ) > > + { > > + case MSR_IA32_MISC_ENABLE: > > + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); > > + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | > > + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; > > + break; > > + > > + default: > > + /* PVH fixme: see hvm_msr_read_intercept(). */ > > + rdmsrl(regs->ecx, msr_content); > > + break; > > So at the moment you basically pass through all MSR reads (adding > BTS_UNAVAIL and PEBS_UNAVAIL to MISC_ENABLE), but send MSR writes > through the hvm code? > > That sounds like it''s asking for trouble...Hence the fixme there. I intended to come back to this during AMD port because of the differences between vmx_ and svm_ msr reads. In general, we should have less interecepts for PVH, eg, there should be no MSR_IA32_CR_PAT intercept. If there is nothing specific for VMX and SVM for PVH, then perhaps a generic solution with may be a union or new data struct for PVH.... Anyways, lets divide and conquer by coming back to this. ...> > + > > + __vmwrite(CR4_READ_SHADOW, new); > > + > > + new &= ~X86_CR4_PAE; /* PVH always runs with hap > > enabled. */ > > + new |= X86_CR4_VMXE | X86_CR4_MCE; > > + __vmwrite(GUEST_CR4, new); > > Should you be updating hvm_vcpu.hw_cr[4] to this value?We dont'' use hw_cr[4] for PVH anywhere. I added a comment.> > + } > > + else > > + *regp = __vmread(CR4_READ_SHADOW); > > Same as above re guest_cr[] >We do set it few lines above: vp->arch.hvm_vcpu.guest_cr[4] = new; -Mukesh
Mukesh Rathor
2013-Aug-22 23:24 UTC
Re: [V10 PATCH 10/23] PVH xen: domain create, context switch related code changes
On Tue, 20 Aug 2013 10:29:24 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Tue, Aug 20, 2013 at 1:52 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > On Fri, 16 Aug 2013 17:11:21 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 16.08.13 at 17:32, George Dunlap > >> >>> <George.Dunlap@eu.citrix.com> wrote: > >> > On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > > ...... > >> >> > >> >> set_bit(_VPF_in_reset, &v->pause_flags); > >> >> > >> >> + pvh_skip_pv_stuff: > >> > > >> > Any idea what this set_bit(_VPF_in_reset) stuff is? It looks > >> > like it''s set above, and cleared down near the bottom of the > >> > function if nothing gets screwed up. > >> > >> This is related to the preemptible vCPU reset (which > >> arch_set_info_guest() just re-uses), making sure that while there > >> is an incomplete state update for a vCPU 8because it may have got > >> preempted) the vCPU can''t be unpaused. > >> > >> > It seems like if that set/clear pair is important, then PVH > >> > should do them both as well, shouldn''t it? > >> > >> I thought I had checked this once - does it now bypass one of the > >> two? > >> > >> But then again, this is all about PV memory management, so perhaps > >> it was that way when I checked, and I decided it was fine. > > > > Ok, I''ll just leave it as it is then. Setting it might confuse > > someone why it''s being set for PVH. Clearing is harmless anyways :). > > I think much more confusing is skipping the set. Part of the reason I > was asking is that I was looking at re-organizing the function so that > all the stuff common to PVH and PV were at the top; Then instead of > the goto, you would just have two return''s, one for HVM, one for PVH, > at the appropriate place; I backed off and asked when I saw this.Ok, i moved the goto target above the set. So it gets set for PVH also. thanks, mukesh
Mukesh Rathor
2013-Aug-22 23:27 UTC
Re: [V10 PATCH 14/23] PVH xen: additional changes to support PVH guest creation and execution.
On Wed, 21 Aug 2013 09:37:35 +0100 George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 20/08/13 22:32, Mukesh Rathor wrote: > > On Tue, 20 Aug 2013 15:13:10 +0100 > > George Dunlap <George.Dunlap@eu.citrix.com> wrote: > > > >> On Wed, Jul 24, 2013 at 2:59 AM, Mukesh Rathor > >> <mukesh.rathor@oracle.com> wrote: > >>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c > >>> index bff05d9..19a085c 100644 > >>> --- a/xen/arch/x86/mm/hap/hap.c > >>> +++ b/xen/arch/x86/mm/hap/hap.c > >>> @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int > >>> do_locking) const struct paging_mode * > >>> hap_paging_get_mode(struct vcpu *v) > >>> { > >>> - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : > >>> + /* PVH 32bitfixme. */ > >>> + return is_pvh_vcpu(v) ? &hap_paging_long_mode : > >>> + !hvm_paging_enabled(v) ? &hap_paging_real_mode : > >>> hvm_long_mode_enabled(v) ? &hap_paging_long_mode : > >>> hvm_pae_enabled(v) ? &hap_paging_pae_mode : > >>> &hap_paging_protected_mode; > >> This shouldn''t be necessary, right? The PVH code should ensure > >> that for 64-bit PVH guests, hvm_long_mode_enabled() is always > >> true, right? > > Right, 64bit PVH always will be in long mode. However, with 32bit > > PVH, this check will change, so best to leave it here. > > How will it change? In that case, won''t hvm_long_mode() return > false, but hvm_pae_enabled() return true, and you''ll get > hap_paging_pae_mode (which I assume is what you would want)? > > In any case, if it''s not needed now, it shouldn''t be introduced now. > I''ve taken it out of my copy.Ok, I removed it too from V11 coming up soon. thanks Mukesh
Jan Beulich
2013-Aug-23 07:16 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
>>> On 23.08.13 at 01:22, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > In general, we should have less interecepts for PVH, eg, there should > be no MSR_IA32_CR_PAT intercept.How that? All memory management supposedly is HVM-like, and PAT is an integral part of memory management. Unless I''m mistaken, not intercepting PAT writes would mean you allow the guest access to the physical MSR, which surely is wrong. Jan
Mukesh Rathor
2013-Aug-23 22:51 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
On Fri, 23 Aug 2013 08:16:40 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 23.08.13 at 01:22, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > In general, we should have less interecepts for PVH, eg, there > > should be no MSR_IA32_CR_PAT intercept. > > How that? All memory management supposedly is HVM-like, and > PAT is an integral part of memory management. Unless I''m > mistaken, not intercepting PAT writes would mean you allow the > guest access to the physical MSR, which surely is wrong.Here''s HVM code: if ( cpu_has_vmx_pat && paging_mode_hap(d) ) vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, MSR_TYPE_R | MSR_TYPE_W); We require both for PVH, see pvh_check_requirements. My understanding, guest would write to GUEST_PAT and not HOST_PAT. thanks Mukesh
Jan Beulich
2013-Aug-26 08:09 UTC
Re: [V10 PATCH 23/23] PVH xen: introduce vmexit handler for PVH
>>> On 24.08.13 at 00:51, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 23 Aug 2013 08:16:40 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 23.08.13 at 01:22, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > In general, we should have less interecepts for PVH, eg, there >> > should be no MSR_IA32_CR_PAT intercept. >> >> How that? All memory management supposedly is HVM-like, and >> PAT is an integral part of memory management. Unless I''m >> mistaken, not intercepting PAT writes would mean you allow the >> guest access to the physical MSR, which surely is wrong. > > Here''s HVM code: > > if ( cpu_has_vmx_pat && paging_mode_hap(d) ) > vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, > MSR_TYPE_R | MSR_TYPE_W); > > We require both for PVH, see pvh_check_requirements. My understanding, > guest would write to GUEST_PAT and not HOST_PAT.Ah, right, sorry - I forgot that VMX already has built in separation of them. Jan