Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 00/18][V7]: PVH xen: Phase I, Version 7 patches...
Hi all, This is V7 of PVH patches for xen. These are xen changes to support boot of a 64bit PVH domU guest. Built on top of unstable git c/s: a12d15d8c1d512a4ed6498b39f9058f69a1c1f6c New in V7: - Dropped all dom0 patches from V6. - Dropped tool changes. They will need to be broken down into multiple patches as asked by Ian C, and I''ll submit them separately. - Reorg of patches to make them smaller and more centered around a logical change, instead of focussing more around changes in a file. Coming in future after this is done, two patchsets: - 1) tools changes and 2) dom0 changes. Phase I: - Establish a baseline of something working. Note, HAP is required for PVH. Repeating from before: As a result of V3, there were two new action items on the linux side before it will boot as PVH: 1)MSI-X fixup and 2)load KERNEL_CS righ after gdt switch. As a result of V5 a new fixme: - MMIO ranges above the highest covered e820 address must be mapped for dom0. Following fixme''s exist in the code: - Add support for more memory types in arch/x86/hvm/mtrr.c. - arch/x86/time.c: support more tsc modes. - check_guest_io_breakpoint(): check/add support for IO breakpoint. - implement arch_get_info_guest() for pvh. - vmxit_msr_read(): during AMD port go thru hvm_msr_read_intercept() again. - verify bp matching on emulated instructions will work same as HVM for PVH guest. see instruction_done() and check_guest_io_breakpoint(). Following remain to be done for PVH: - AMD port. - Avail PVH dom0 of posted interrupts. (This will be a big win). - 32bit support in both linux and xen. Xen changes are tagged "32bitfixme". - Add support for monitoring guest behavior. See hvm_memory_event* functions in hvm.c - Change xl to support other modes other than "phy:". - Hotplug support - Migration of PVH guests. Thanks for all the help, Mukesh
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 01/18] PVH xen: turn gdb_frames/gdt_ents into union
Changes in V2: - Add __XEN_INTERFACE_VERSION__ Changes in V3: - Rename union to ''gdt'' and rename field names. Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- tools/libxc/xc_domain_restore.c | 8 ++++---- tools/libxc/xc_domain_save.c | 6 +++--- xen/arch/x86/domain.c | 12 ++++++------ xen/arch/x86/domctl.c | 12 ++++++------ xen/include/public/arch-x86/xen.h | 14 ++++++++++++++ 5 files changed, 33 insertions(+), 19 deletions(-) diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c index f53ff88..6e8c42f 100644 --- a/tools/libxc/xc_domain_restore.c +++ b/tools/libxc/xc_domain_restore.c @@ -2055,15 +2055,15 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, munmap(start_info, PAGE_SIZE); } /* Uncanonicalise each GDT frame number. */ - if ( GET_FIELD(ctxt, gdt_ents) > 8192 ) + if ( GET_FIELD(ctxt, gdt.pv.num_ents) > 8192 ) { ERROR("GDT entry count out of range"); goto out; } - for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt_ents); j++ ) + for ( j = 0; (512*j) < GET_FIELD(ctxt, gdt.pv.num_ents); j++ ) { - pfn = GET_FIELD(ctxt, gdt_frames[j]); + pfn = GET_FIELD(ctxt, gdt.pv.frames[j]); if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) { @@ -2071,7 +2071,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, j, (unsigned long)pfn); goto out; } - SET_FIELD(ctxt, gdt_frames[j], ctx->p2m[pfn]); + SET_FIELD(ctxt, gdt.pv.frames[j], ctx->p2m[pfn]); } /* Uncanonicalise the page table base pointer. */ pfn = UNFOLD_CR3(GET_FIELD(ctxt, ctrlreg[3])); diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index ff76626..97cf64a 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1900,15 +1900,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter } /* Canonicalise each GDT frame number. */ - for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ ) + for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt.pv.num_ents); j++ ) { - mfn = GET_FIELD(&ctxt, gdt_frames[j]); + mfn = GET_FIELD(&ctxt, gdt.pv.frames[j]); if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn) ) { ERROR("GDT frame is not in range of pseudophys map"); goto out; } - SET_FIELD(&ctxt, gdt_frames[j], mfn_to_pfn(mfn)); + SET_FIELD(&ctxt, gdt.pv.frames[j], mfn_to_pfn(mfn)); } /* Canonicalise the page table base pointer. */ diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 48f3487..bc12d04 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -780,8 +780,8 @@ int arch_set_info_guest( } for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i ) - fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt_frames[i]); - fail |= v->arch.pv_vcpu.gdt_ents != c(gdt_ents); + fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt.pv.frames[i]); + fail |= v->arch.pv_vcpu.gdt_ents != c(gdt.pv.num_ents); fail |= v->arch.pv_vcpu.ldt_base != c(ldt_base); fail |= v->arch.pv_vcpu.ldt_ents != c(ldt_ents); @@ -830,17 +830,17 @@ int arch_set_info_guest( d->vm_assist = c(vm_assist); if ( !compat ) - rc = (int)set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents); + rc = (int)set_gdt(v, c.nat->gdt.pv.frames, c.nat->gdt.pv.num_ents); else { unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames)]; - unsigned int n = (c.cmp->gdt_ents + 511) / 512; + unsigned int n = (c.cmp->gdt.pv.num_ents + 511) / 512; if ( n > ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames) ) return -EINVAL; for ( i = 0; i < n; ++i ) - gdt_frames[i] = c.cmp->gdt_frames[i]; - rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt_ents); + gdt_frames[i] = c.cmp->gdt.pv.frames[i]; + rc = (int)set_gdt(v, gdt_frames, c.cmp->gdt.pv.num_ents); } if ( rc != 0 ) return rc; diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index c2a04c4..f87d6ab 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -1300,12 +1300,12 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c) c(ldt_base = v->arch.pv_vcpu.ldt_base); c(ldt_ents = v->arch.pv_vcpu.ldt_ents); for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i ) - c(gdt_frames[i] = v->arch.pv_vcpu.gdt_frames[i]); - BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt_frames) !- ARRAY_SIZE(c.cmp->gdt_frames)); - for ( ; i < ARRAY_SIZE(c.nat->gdt_frames); ++i ) - c(gdt_frames[i] = 0); - c(gdt_ents = v->arch.pv_vcpu.gdt_ents); + c(gdt.pv.frames[i] = v->arch.pv_vcpu.gdt_frames[i]); + BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt.pv.frames) !+ ARRAY_SIZE(c.cmp->gdt.pv.frames)); + for ( ; i < ARRAY_SIZE(c.nat->gdt.pv.frames); ++i ) + c(gdt.pv.frames[i] = 0); + c(gdt.pv.num_ents = v->arch.pv_vcpu.gdt_ents); c(kernel_ss = v->arch.pv_vcpu.kernel_ss); c(kernel_sp = v->arch.pv_vcpu.kernel_sp); for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.ctrlreg); ++i ) diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h index b7f6a51..25c8519 100644 --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -170,7 +170,21 @@ struct vcpu_guest_context { struct cpu_user_regs user_regs; /* User-level CPU registers */ struct trap_info trap_ctxt[256]; /* Virtual IDT */ unsigned long ldt_base, ldt_ents; /* LDT (linear address, # ents) */ +#if __XEN_INTERFACE_VERSION__ < 0x00040400 unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */ +#else + union { + struct { + /* GDT (machine frames, # ents) */ + unsigned long frames[16], num_ents; + } pv; + struct { + /* PVH: GDTR addr and size */ + uint64_t addr; + uint16_t limit; + } pvh; + } gdt; +#endif unsigned long kernel_ss, kernel_sp; /* Virtual TSS (only SS1/SP1) */ /* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */ unsigned long ctrlreg[8]; /* CR0-CR7 (control registers) */ -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 02/18] PVH xen: add params to read_segment_register
In this preparatory patch, read_segment_register macro is changed to take vcpu and regs parameters. No functionality change. Changes in V2: None Changes in V3: - Replace read_sreg with read_segment_register Changes in V7: - Don''t make emulate_privileged_op() public here. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain.c | 8 ++++---- xen/arch/x86/traps.c | 26 ++++++++++++-------------- xen/arch/x86/x86_64/traps.c | 16 ++++++++-------- xen/include/asm-x86/system.h | 2 +- 4 files changed, 25 insertions(+), 27 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index bc12d04..d530964 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1202,10 +1202,10 @@ static void save_segments(struct vcpu *v) struct cpu_user_regs *regs = &v->arch.user_regs; unsigned int dirty_segment_mask = 0; - regs->ds = read_segment_register(ds); - regs->es = read_segment_register(es); - regs->fs = read_segment_register(fs); - regs->gs = read_segment_register(gs); + regs->ds = read_segment_register(v, regs, ds); + regs->es = read_segment_register(v, regs, es); + regs->fs = read_segment_register(v, regs, fs); + regs->gs = read_segment_register(v, regs, gs); if ( regs->ds ) dirty_segment_mask |= DIRTY_DS; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 57dbd0c..378ef0a 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -1831,8 +1831,6 @@ static inline uint64_t guest_misc_enable(uint64_t val) } \ (eip) += sizeof(_x); _x; }) -#define read_sreg(regs, sr) read_segment_register(sr) - static int is_cpufreq_controller(struct domain *d) { return ((cpufreq_controller == FREQCTL_dom0_kernel) && @@ -1877,7 +1875,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) goto fail; /* emulating only opcodes not allowing SS to be default */ - data_sel = read_sreg(regs, ds); + data_sel = read_segment_register(v, regs, ds); /* Legacy prefixes. */ for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) ) @@ -1895,17 +1893,17 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) data_sel = regs->cs; continue; case 0x3e: /* DS override */ - data_sel = read_sreg(regs, ds); + data_sel = read_segment_register(v, regs, ds); continue; case 0x26: /* ES override */ - data_sel = read_sreg(regs, es); + data_sel = read_segment_register(v, regs, es); continue; case 0x64: /* FS override */ - data_sel = read_sreg(regs, fs); + data_sel = read_segment_register(v, regs, fs); lm_ovr = lm_seg_fs; continue; case 0x65: /* GS override */ - data_sel = read_sreg(regs, gs); + data_sel = read_segment_register(v, regs, gs); lm_ovr = lm_seg_gs; continue; case 0x36: /* SS override */ @@ -1952,7 +1950,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) if ( !(opcode & 2) ) { - data_sel = read_sreg(regs, es); + data_sel = read_segment_register(v, regs, es); lm_ovr = lm_seg_none; } @@ -2685,22 +2683,22 @@ static void emulate_gate_op(struct cpu_user_regs *regs) ASSERT(opnd_sel); continue; case 0x3e: /* DS override */ - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x26: /* ES override */ - opnd_sel = read_sreg(regs, es); + opnd_sel = read_segment_register(v, regs, es); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x64: /* FS override */ - opnd_sel = read_sreg(regs, fs); + opnd_sel = read_segment_register(v, regs, fs); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x65: /* GS override */ - opnd_sel = read_sreg(regs, gs); + opnd_sel = read_segment_register(v, regs, gs); if ( !opnd_sel ) opnd_sel = dpl; continue; @@ -2753,7 +2751,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs) switch ( modrm & 7 ) { default: - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); break; case 4: case 5: opnd_sel = regs->ss; @@ -2781,7 +2779,7 @@ static void emulate_gate_op(struct cpu_user_regs *regs) break; } if ( !opnd_sel ) - opnd_sel = read_sreg(regs, ds); + opnd_sel = read_segment_register(v, regs, ds); switch ( modrm & 7 ) { case 0: case 2: case 4: diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index eec919a..d2f7209 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -122,10 +122,10 @@ void show_registers(struct cpu_user_regs *regs) fault_crs[0] = read_cr0(); fault_crs[3] = read_cr3(); fault_crs[4] = read_cr4(); - fault_regs.ds = read_segment_register(ds); - fault_regs.es = read_segment_register(es); - fault_regs.fs = read_segment_register(fs); - fault_regs.gs = read_segment_register(gs); + fault_regs.ds = read_segment_register(v, regs, ds); + fault_regs.es = read_segment_register(v, regs, es); + fault_regs.fs = read_segment_register(v, regs, fs); + fault_regs.gs = read_segment_register(v, regs, gs); } print_xen_info(); @@ -240,10 +240,10 @@ void do_double_fault(struct cpu_user_regs *regs) crs[2] = read_cr2(); crs[3] = read_cr3(); crs[4] = read_cr4(); - regs->ds = read_segment_register(ds); - regs->es = read_segment_register(es); - regs->fs = read_segment_register(fs); - regs->gs = read_segment_register(gs); + regs->ds = read_segment_register(current, regs, ds); + regs->es = read_segment_register(current, regs, es); + regs->fs = read_segment_register(current, regs, fs); + regs->gs = read_segment_register(current, regs, gs); printk("CPU: %d\n", cpu); _show_registers(regs, crs, CTXT_hypervisor, NULL); diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h index 6ab7d56..9bb22cb 100644 --- a/xen/include/asm-x86/system.h +++ b/xen/include/asm-x86/system.h @@ -4,7 +4,7 @@ #include <xen/lib.h> #include <xen/bitops.h> -#define read_segment_register(name) \ +#define read_segment_register(vcpu, regs, name) \ ({ u16 __sel; \ asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ __sel; \ -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 03/18] PVH xen: Move e820 fields out of pv_domain struct
This patch moves fields out of the pv_domain struct as they are used by PVH also. Changes in V6: - Don''t base on guest type the initialization and cleanup. Changes in V7: - If statement doesn''t need to be split across lines anymore. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain.c | 10 ++++------ xen/arch/x86/mm.c | 26 ++++++++++++-------------- xen/include/asm-x86/domain.h | 10 +++++----- 3 files changed, 21 insertions(+), 25 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index d530964..6c85c94 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) if ( (rc = iommu_domain_init(d)) != 0 ) goto fail; } + spin_lock_init(&d->arch.e820_lock); if ( is_hvm_domain(d) ) { @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) } } else - { /* 64-bit PV guest by default. */ d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; - spin_lock_init(&d->arch.pv_domain.e820_lock); - } - /* initialize default tsc behavior in case tools don''t */ tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); spin_lock_init(&d->arch.vtsc_lock); @@ -592,8 +589,9 @@ void arch_domain_destroy(struct domain *d) { if ( is_hvm_domain(d) ) hvm_domain_destroy(d); - else - xfree(d->arch.pv_domain.e820); + + if ( d->arch.e820 ) + xfree(d->arch.e820); free_domain_pirqs(d); if ( !is_idle_domain(d) ) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 5123860..9f58968 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4759,11 +4759,11 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) return -EFAULT; } - spin_lock(&d->arch.pv_domain.e820_lock); - xfree(d->arch.pv_domain.e820); - d->arch.pv_domain.e820 = e820; - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); + xfree(d->arch.e820); + d->arch.e820 = e820; + d->arch.nr_e820 = fmap.map.nr_entries; + spin_unlock(&d->arch.e820_lock); rcu_unlock_domain(d); return rc; @@ -4777,26 +4777,24 @@ long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) if ( copy_from_guest(&map, arg, 1) ) return -EFAULT; - spin_lock(&d->arch.pv_domain.e820_lock); + spin_lock(&d->arch.e820_lock); /* Backwards compatibility. */ - if ( (d->arch.pv_domain.nr_e820 == 0) || - (d->arch.pv_domain.e820 == NULL) ) + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -ENOSYS; } - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, - map.nr_entries) || + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || __copy_to_guest(arg, &map, 1) ) { - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return -EFAULT; } - spin_unlock(&d->arch.pv_domain.e820_lock); + spin_unlock(&d->arch.e820_lock); return 0; } diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index d79464d..c3f9f8e 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -234,11 +234,6 @@ struct pv_domain /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; - - /* Pseudophysical e820 map (XENMEM_memory_map). */ - spinlock_t e820_lock; - struct e820entry *e820; - unsigned int nr_e820; }; struct arch_domain @@ -313,6 +308,11 @@ struct arch_domain (possibly other cases in the future */ uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ uint64_t vtsc_usercount; /* not used for hvm */ + + /* Pseudophysical e820 map (XENMEM_memory_map). */ + spinlock_t e820_lock; + struct e820entry *e820; + unsigned int nr_e820; } __cacheline_aligned; #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 04/18] PVH xen: vmx related preparatory changes for PVH
This is another preparatory patch for PVH. In this patch, following functions are made available for general/public use: vmx_fpu_enter(), get_instruction_length(), update_guest_eip(), and vmx_dr_access(). There is no functionality change. Changes in V2: - prepend vmx_ to get_instruction_length and update_guest_eip. - Do not export/use vmr(). Changes in V3: - Do not change emulate_forced_invalid_op() in this patch. Changes in V7: - Drop pv_cpuid going public here. Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/vmx.c | 72 +++++++++++++++--------------------- xen/arch/x86/hvm/vmx/vvmx.c | 2 +- xen/include/asm-x86/hvm/vmx/vmcs.h | 1 + xen/include/asm-x86/hvm/vmx/vmx.h | 16 +++++++- 4 files changed, 47 insertions(+), 44 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 059d258..62cb84d 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -577,7 +577,7 @@ static int vmx_load_vmcs_ctxt(struct vcpu *v, struct hvm_hw_cpu *ctxt) return 0; } -static void vmx_fpu_enter(struct vcpu *v) +void vmx_fpu_enter(struct vcpu *v) { vcpu_restore_fpu_lazy(v); v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device); @@ -1594,24 +1594,12 @@ const struct hvm_function_table * __init start_vmx(void) return &vmx_function_table; } -/* - * Not all cases receive valid value in the VM-exit instruction length field. - * Callers must know what they''re doing! - */ -static int get_instruction_length(void) -{ - int len; - len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */ - BUG_ON((len < 1) || (len > 15)); - return len; -} - -void update_guest_eip(void) +void vmx_update_guest_eip(void) { struct cpu_user_regs *regs = guest_cpu_user_regs(); unsigned long x; - regs->eip += get_instruction_length(); /* Safe: callers audited */ + regs->eip += vmx_get_instruction_length(); /* Safe: callers audited */ regs->eflags &= ~X86_EFLAGS_RF; x = __vmread(GUEST_INTERRUPTIBILITY_INFO); @@ -1684,8 +1672,8 @@ static void vmx_do_cpuid(struct cpu_user_regs *regs) regs->edx = edx; } -static void vmx_dr_access(unsigned long exit_qualification, - struct cpu_user_regs *regs) +void vmx_dr_access(unsigned long exit_qualification, + struct cpu_user_regs *regs) { struct vcpu *v = current; @@ -2298,7 +2286,7 @@ static int vmx_handle_eoi_write(void) if ( (((exit_qualification >> 12) & 0xf) == 1) && ((exit_qualification & 0xfff) == APIC_EOI) ) { - update_guest_eip(); /* Safe: APIC data write */ + vmx_update_guest_eip(); /* Safe: APIC data write */ vlapic_EOI_set(vcpu_vlapic(current)); HVMTRACE_0D(VLAPIC); return 1; @@ -2511,7 +2499,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) HVMTRACE_1D(TRAP, vector); if ( v->domain->debugger_attached ) { - update_guest_eip(); /* Safe: INT3 */ + vmx_update_guest_eip(); /* Safe: INT3 */ current->arch.gdbsx_vcpu_event = TRAP_int3; domain_pause_for_debugger(); break; @@ -2619,7 +2607,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) */ inst_len = ((source != 3) || /* CALL, IRET, or JMP? */ (idtv_info & (1u<<10))) /* IntrType > 3? */ - ? get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0; + ? vmx_get_instruction_length() /* Safe: SDM 3B 23.2.4 */ : 0; if ( (source == 3) && (idtv_info & INTR_INFO_DELIVER_CODE_MASK) ) ecode = __vmread(IDT_VECTORING_ERROR_CODE); regs->eip += inst_len; @@ -2627,15 +2615,15 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) break; } case EXIT_REASON_CPUID: - update_guest_eip(); /* Safe: CPUID */ + vmx_update_guest_eip(); /* Safe: CPUID */ vmx_do_cpuid(regs); break; case EXIT_REASON_HLT: - update_guest_eip(); /* Safe: HLT */ + vmx_update_guest_eip(); /* Safe: HLT */ hvm_hlt(regs->eflags); break; case EXIT_REASON_INVLPG: - update_guest_eip(); /* Safe: INVLPG */ + vmx_update_guest_eip(); /* Safe: INVLPG */ exit_qualification = __vmread(EXIT_QUALIFICATION); vmx_invlpg_intercept(exit_qualification); break; @@ -2643,7 +2631,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) regs->ecx = hvm_msr_tsc_aux(v); /* fall through */ case EXIT_REASON_RDTSC: - update_guest_eip(); /* Safe: RDTSC, RDTSCP */ + vmx_update_guest_eip(); /* Safe: RDTSC, RDTSCP */ hvm_rdtsc_intercept(regs); break; case EXIT_REASON_VMCALL: @@ -2653,7 +2641,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) rc = hvm_do_hypercall(regs); if ( rc != HVM_HCALL_preempted ) { - update_guest_eip(); /* Safe: VMCALL */ + vmx_update_guest_eip(); /* Safe: VMCALL */ if ( rc == HVM_HCALL_invalidate ) send_invalidate_req(); } @@ -2663,7 +2651,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) { exit_qualification = __vmread(EXIT_QUALIFICATION); if ( vmx_cr_access(exit_qualification) == X86EMUL_OKAY ) - update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */ + vmx_update_guest_eip(); /* Safe: MOV Cn, LMSW, CLTS */ break; } case EXIT_REASON_DR_ACCESS: @@ -2677,7 +2665,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) { regs->eax = (uint32_t)msr_content; regs->edx = (uint32_t)(msr_content >> 32); - update_guest_eip(); /* Safe: RDMSR */ + vmx_update_guest_eip(); /* Safe: RDMSR */ } break; } @@ -2686,63 +2674,63 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) uint64_t msr_content; msr_content = ((uint64_t)regs->edx << 32) | (uint32_t)regs->eax; if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY ) - update_guest_eip(); /* Safe: WRMSR */ + vmx_update_guest_eip(); /* Safe: WRMSR */ break; } case EXIT_REASON_VMXOFF: if ( nvmx_handle_vmxoff(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMXON: if ( nvmx_handle_vmxon(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMCLEAR: if ( nvmx_handle_vmclear(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMPTRLD: if ( nvmx_handle_vmptrld(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMPTRST: if ( nvmx_handle_vmptrst(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMREAD: if ( nvmx_handle_vmread(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMWRITE: if ( nvmx_handle_vmwrite(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMLAUNCH: if ( nvmx_handle_vmlaunch(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_VMRESUME: if ( nvmx_handle_vmresume(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_INVEPT: if ( nvmx_handle_invept(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_INVVPID: if ( nvmx_handle_invvpid(regs) == X86EMUL_OKAY ) - update_guest_eip(); + vmx_update_guest_eip(); break; case EXIT_REASON_MWAIT_INSTRUCTION: @@ -2790,14 +2778,14 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) int bytes = (exit_qualification & 0x07) + 1; int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE; if ( handle_pio(port, bytes, dir) ) - update_guest_eip(); /* Safe: IN, OUT */ + vmx_update_guest_eip(); /* Safe: IN, OUT */ } break; case EXIT_REASON_INVD: case EXIT_REASON_WBINVD: { - update_guest_eip(); /* Safe: INVD, WBINVD */ + vmx_update_guest_eip(); /* Safe: INVD, WBINVD */ vmx_wbinvd_intercept(); break; } @@ -2829,7 +2817,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) case EXIT_REASON_XSETBV: if ( hvm_handle_xsetbv(regs->ecx, (regs->rdx << 32) | regs->_eax) == 0 ) - update_guest_eip(); /* Safe: XSETBV */ + vmx_update_guest_eip(); /* Safe: XSETBV */ break; case EXIT_REASON_APIC_WRITE: diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c index bb7688f..225de9f 100644 --- a/xen/arch/x86/hvm/vmx/vvmx.c +++ b/xen/arch/x86/hvm/vmx/vvmx.c @@ -2136,7 +2136,7 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs, tsc += __get_vvmcs(nvcpu->nv_vvmcx, TSC_OFFSET); regs->eax = (uint32_t)tsc; regs->edx = (uint32_t)(tsc >> 32); - update_guest_eip(); + vmx_update_guest_eip(); return 1; } diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h index f30e5ac..c9d7118 100644 --- a/xen/include/asm-x86/hvm/vmx/vmcs.h +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h @@ -475,6 +475,7 @@ void vmx_vmcs_switch(struct vmcs_struct *from, struct vmcs_struct *to); void vmx_set_eoi_exit_bitmap(struct vcpu *v, u8 vector); void vmx_clear_eoi_exit_bitmap(struct vcpu *v, u8 vector); int vmx_check_msr_bitmap(unsigned long *msr_bitmap, u32 msr, int access_type); +void vmx_fpu_enter(struct vcpu *v); void virtual_vmcs_enter(void *vvmcs); void virtual_vmcs_exit(void *vvmcs); u64 virtual_vmcs_vmread(void *vvmcs, u32 vmcs_encoding); diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index c33b9f9..ad341dc 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -446,6 +446,18 @@ static inline int __vmxon(u64 addr) return rc; } +/* + * Not all cases receive valid value in the VM-exit instruction length field. + * Callers must know what they''re doing! + */ +static inline int vmx_get_instruction_length(void) +{ + int len; + len = __vmread(VM_EXIT_INSTRUCTION_LEN); /* Safe: callers audited */ + BUG_ON((len < 1) || (len > 15)); + return len; +} + void vmx_get_segment_register(struct vcpu *, enum x86_segment, struct segment_register *); void vmx_inject_extint(int trap); @@ -457,7 +469,9 @@ void ept_p2m_uninit(struct p2m_domain *p2m); void ept_walk_table(struct domain *d, unsigned long gfn); void setup_ept_dump(void); -void update_guest_eip(void); +void vmx_update_guest_eip(void); +void vmx_dr_access(unsigned long exit_qualification, + struct cpu_user_regs *regs); int alloc_p2m_hap_data(struct p2m_domain *p2m); void free_p2m_hap_data(struct p2m_domain *p2m); -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 05/18] PVH xen: hvm/vmcs related preparatory changes for PVH
In this patch, some common code is factored out to create vmx_set_common_host_vmcs_fields() to be used by PVH. Also, some changes in hvm.c as hvm_domain.params is not set for PVH. Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 10 ++++--- xen/arch/x86/hvm/vmx/vmcs.c | 58 +++++++++++++++++++++++------------------- 2 files changed, 38 insertions(+), 30 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 43b6d05..118e21a 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1070,10 +1070,13 @@ int hvm_vcpu_initialise(struct vcpu *v) { int rc; struct domain *d = v->domain; - domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + domid_t dm_domid; hvm_asid_flush_vcpu(v); + spin_lock_init(&v->arch.hvm_vcpu.tm_lock); + INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); + if ( (rc = vlapic_init(v)) != 0 ) goto fail1; @@ -1084,6 +1087,8 @@ int hvm_vcpu_initialise(struct vcpu *v) && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) goto fail3; + dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; + /* Create ioreq event channel. */ rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); if ( rc < 0 ) @@ -1106,9 +1111,6 @@ int hvm_vcpu_initialise(struct vcpu *v) get_ioreq(v)->vp_eport = v->arch.hvm_vcpu.xen_port; spin_unlock(&d->arch.hvm_domain.ioreq.lock); - spin_lock_init(&v->arch.hvm_vcpu.tm_lock); - INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); - v->arch.hvm_vcpu.inject_trap.vector = -1; rc = setup_compat_arg_xlat(v); diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index ef0ee7f..43539a6 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -825,11 +825,40 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val) virtual_vmcs_exit(vvmcs); } -static int construct_vmcs(struct vcpu *v) +static void vmx_set_common_host_vmcs_fields(struct vcpu *v) { - struct domain *d = v->domain; uint16_t sysenter_cs; unsigned long sysenter_eip; + + /* Host data selectors. */ + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); + __vmwrite(HOST_FS_SELECTOR, 0); + __vmwrite(HOST_GS_SELECTOR, 0); + __vmwrite(HOST_FS_BASE, 0); + __vmwrite(HOST_GS_BASE, 0); + + /* Host control registers. */ + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); + __vmwrite(HOST_CR4, + mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); + + /* Host CS:RIP. */ + __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); + __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); + + /* Host SYSENTER CS:RIP. */ + rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); + __vmwrite(HOST_SYSENTER_CS, sysenter_cs); + rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); + __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); +} + +static int construct_vmcs(struct vcpu *v) +{ + struct domain *d = v->domain; u32 vmexit_ctl = vmx_vmexit_control; u32 vmentry_ctl = vmx_vmentry_control; @@ -932,30 +961,7 @@ static int construct_vmcs(struct vcpu *v) __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector); } - /* Host data selectors. */ - __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); - __vmwrite(HOST_FS_SELECTOR, 0); - __vmwrite(HOST_GS_SELECTOR, 0); - __vmwrite(HOST_FS_BASE, 0); - __vmwrite(HOST_GS_BASE, 0); - - /* Host control registers. */ - v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; - __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); - __vmwrite(HOST_CR4, - mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); - - /* Host CS:RIP. */ - __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); - __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); - - /* Host SYSENTER CS:RIP. */ - rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); - __vmwrite(HOST_SYSENTER_CS, sysenter_cs); - rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); - __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); + vmx_set_common_host_vmcs_fields(v); /* MSR intercepts. */ __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 06/18] PVH xen: Introduce PVH guest type and some basic changes.
This patch introduces the concept of a pvh guest. There are other basic changes like creating macros to check for pv/pvh vcpu/domain, and also modifying copy-macros to account for pvh. Finally, guest_kernel_mode is changed to boast that a PVH doesn''t need to check for TF_kernel_mode flag since the kernel runs in ring 0. Chagnes in V2: - make is_pvh/is_hvm enum instead of adding is_pvh as a new flag. - fix indentation and spacing in guest_kernel_mode macro. - add debug only BUG() in GUEST_KERNEL_RPL macro as it should no longer be called in any PVH paths. Chagnes in V3: - Rename enum fields, and add is_pv to it. - Get rid if is_hvm_or_pvh_* macros. Chagnes in V4: - Move e820 fields out of pv_domain struct. Chagnes in V5: - Move e820 changes above in V4, to a separate patch. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/debug.c | 2 +- xen/arch/x86/domain.c | 7 +++++++ xen/common/domain.c | 2 +- xen/include/asm-x86/desc.h | 6 ++++++ xen/include/asm-x86/guest_access.h | 12 ++++++------ xen/include/asm-x86/x86_64/regs.h | 9 +++++---- xen/include/public/domctl.h | 3 +++ xen/include/xen/sched.h | 21 ++++++++++++++++++--- 8 files changed, 47 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c index e67473e..167421d 100644 --- a/xen/arch/x86/debug.c +++ b/xen/arch/x86/debug.c @@ -158,7 +158,7 @@ dbg_rw_guest_mem(dbgva_t addr, dbgbyte_t *buf, int len, struct domain *dp, pagecnt = min_t(long, PAGE_SIZE - (addr & ~PAGE_MASK), len); - mfn = (dp->is_hvm + mfn = (!is_pv_domain(dp) ? dbg_hvm_va2mfn(addr, dp, toaddr, &gfn) : dbg_pv_va2mfn(addr, dp, pgd3)); if ( mfn == INVALID_MFN ) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 6c85c94..ee90b8a 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -644,6 +644,13 @@ int arch_set_info_guest( unsigned int i; int rc = 0, compat; + /* This removed when all patches are checked in and PVH is done. */ + if ( is_pvh_vcpu(v) ) + { + printk("PVH: You don''t have the correct xen version for PVH\n"); + return -EINVAL; + } + /* The context is a compat-mode one if the target domain is compat-mode; * we expect the tools to DTRT even in compat-mode callers. */ compat = is_pv_32on64_domain(d); diff --git a/xen/common/domain.c b/xen/common/domain.c index fac3470..6ece3fe 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -236,7 +236,7 @@ struct domain *domain_create( goto fail; if ( domcr_flags & DOMCRF_hvm ) - d->is_hvm = 1; + d->guest_type = is_hvm; if ( domid == 0 ) { diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h index 354b889..4eaa845 100644 --- a/xen/include/asm-x86/desc.h +++ b/xen/include/asm-x86/desc.h @@ -38,7 +38,13 @@ #ifndef __ASSEMBLY__ +#ifndef NDEBUG +/* PVH 32bitfixme : see emulate_gate_op call from do_general_protection */ +#define GUEST_KERNEL_RPL(d) (is_pvh_domain(d) ? ({ BUG(); 0; }) : \ + is_pv_32bit_domain(d) ? 1 : 3) +#else #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3) +#endif /* Fix up the RPL of a guest segment selector. */ #define __fixup_guest_selector(d, sel) \ diff --git a/xen/include/asm-x86/guest_access.h b/xen/include/asm-x86/guest_access.h index ca700c9..675dda1 100644 --- a/xen/include/asm-x86/guest_access.h +++ b/xen/include/asm-x86/guest_access.h @@ -14,27 +14,27 @@ /* Raw access functions: no type checking. */ #define raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ copy_to_user((dst), (src), (len))) #define raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ copy_from_user((dst), (src), (len))) #define raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) #define __raw_copy_to_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_to_user_hvm((dst), (src), (len)) : \ __copy_to_user((dst), (src), (len))) #define __raw_copy_from_guest(dst, src, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ copy_from_user_hvm((dst), (src), (len)) : \ __copy_from_user((dst), (src), (len))) #define __raw_clear_guest(dst, len) \ - (is_hvm_vcpu(current) ? \ + (!is_pv_vcpu(current) ? \ clear_user_hvm((dst), (len)) : \ clear_user((dst), (len))) diff --git a/xen/include/asm-x86/x86_64/regs.h b/xen/include/asm-x86/x86_64/regs.h index 3cdc702..2ea49c5 100644 --- a/xen/include/asm-x86/x86_64/regs.h +++ b/xen/include/asm-x86/x86_64/regs.h @@ -10,10 +10,11 @@ #define ring_2(r) (((r)->cs & 3) == 2) #define ring_3(r) (((r)->cs & 3) == 3) -#define guest_kernel_mode(v, r) \ - (!is_pv_32bit_vcpu(v) ? \ - (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ - (ring_1(r))) +#define guest_kernel_mode(v, r) \ + (is_pvh_vcpu(v) ? (ring_0(r)) : \ + (!is_pv_32bit_vcpu(v) ? \ + (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ + (ring_1(r)))) #define permit_softint(dpl, v, r) \ ((dpl) >= (guest_kernel_mode(v, r) ? 1 : 3)) diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 4c5b2bb..6b1aa11 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -89,6 +89,9 @@ struct xen_domctl_getdomaininfo { /* Being debugged. */ #define _XEN_DOMINF_debugged 6 #define XEN_DOMINF_debugged (1U<<_XEN_DOMINF_debugged) +/* domain is PVH */ +#define _XEN_DOMINF_pvh_guest 7 +#define XEN_DOMINF_pvh_guest (1U<<_XEN_DOMINF_pvh_guest) /* XEN_DOMINF_shutdown guest-supplied code. */ #define XEN_DOMINF_shutdownmask 255 #define XEN_DOMINF_shutdownshift 16 diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index ae6a3b8..4d5edf5 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -238,6 +238,14 @@ struct mem_event_per_domain struct mem_event_domain access; }; +/* + * PVH is a PV guest running in an HVM container. While is_hvm_* checks are + * false for it, it uses many of the HVM data structs. + */ +enum guest_type { + is_pv, is_pvh, is_hvm +}; + struct domain { domid_t domain_id; @@ -285,8 +293,8 @@ struct domain struct rangeset *iomem_caps; struct rangeset *irq_caps; - /* Is this an HVM guest? */ - bool_t is_hvm; + enum guest_type guest_type; + #ifdef HAS_PASSTHROUGH /* Does this guest need iommu mappings? */ bool_t need_iommu; @@ -464,6 +472,9 @@ struct domain *domain_create( /* DOMCRF_oos_off: dont use out-of-sync optimization for shadow page tables */ #define _DOMCRF_oos_off 4 #define DOMCRF_oos_off (1U<<_DOMCRF_oos_off) + /* DOMCRF_pvh: Create PV domain in HVM container. */ +#define _DOMCRF_pvh 5 +#define DOMCRF_pvh (1U<<_DOMCRF_pvh) /* * rcu_lock_domain_by_id() is more efficient than get_domain_by_id(). @@ -732,8 +743,12 @@ void watchdog_domain_destroy(struct domain *d); #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist)) -#define is_hvm_domain(d) ((d)->is_hvm) +#define is_pv_domain(d) ((d)->guest_type == is_pv) +#define is_pv_vcpu(v) (is_pv_domain(v->domain)) +#define is_hvm_domain(d) ((d)->guest_type == is_hvm) #define is_hvm_vcpu(v) (is_hvm_domain(v->domain)) +#define is_pvh_domain(d) ((d)->guest_type == is_pvh) +#define is_pvh_vcpu(v) (is_pvh_domain(v->domain)) #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ cpumask_weight((v)->cpu_affinity) == 1) #ifdef HAS_PASSTHROUGH -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 07/18] PVH xen: domain create, schedular related code changes
This patch mostly contains changes to arch/x86/domain.c to allow for a PVH domain creation. A new function, hvm_set_vcpu_info() is introduced to set some guest context in the VMCS. The target vmx function, vmx_pvh_set_vcpu_info is introduced later as part of vmx_pvh.c file. Lastly, this patch also changes the context_switch code in the same file to follow HVM behaviour for PVH. Changes in V2: - changes to read_segment_register() moved to this patch. - The other comment was to create NULL functions for pvh_set_vcpu_info and pvh_read_descriptor which are implemented in later patch, but since I disable PVH creation until all patches are checked in, it is not needed. But it helps breaking down of patches. Changes in V3: - Fix read_segment_register() macro to make sure args are evaluated once, and use # instead of STR for name in the macro. Changes in V4: - Remove pvh substruct in the hvm substruct, as the vcpu_info_mfn has been moved out of pv_vcpu struct. - rename hvm_pvh_* functions to hvm_*. Changes in V5: - remove pvh_read_descriptor(). Changes in V7: - remove hap_update_cr3() and read_segment_register changes from here. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain.c | 61 +++++++++++++++++++++++++++------------- xen/arch/x86/mm.c | 3 ++ xen/include/asm-x86/hvm/hvm.h | 8 +++++ 3 files changed, 52 insertions(+), 20 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index ee90b8a..04ef0e5 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -385,7 +385,7 @@ int vcpu_initialise(struct vcpu *v) vmce_init_vcpu(v); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) { rc = hvm_vcpu_initialise(v); goto done; @@ -452,7 +452,7 @@ void vcpu_destroy(struct vcpu *v) vcpu_destroy_fpu(v); - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) hvm_vcpu_destroy(v); else xfree(v->arch.pv_vcpu.trap_ctxt); @@ -464,7 +464,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) int rc = -ENOMEM; d->arch.hvm_domain.hap_enabled - is_hvm_domain(d) && + !is_pv_domain(d) && hvm_funcs.hap_supported && (domcr_flags & DOMCRF_hap); d->arch.hvm_domain.mem_sharing_enabled = 0; @@ -512,7 +512,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) mapcache_domain_init(d); HYPERVISOR_COMPAT_VIRT_START(d) - is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; + is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; if ( (rc = paging_domain_init(d, domcr_flags)) != 0 ) goto fail; @@ -555,7 +555,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) } spin_lock_init(&d->arch.e820_lock); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) { if ( (rc = hvm_domain_initialise(d)) != 0 ) { @@ -658,7 +658,7 @@ int arch_set_info_guest( #define c(fld) (compat ? (c.cmp->fld) : (c.nat->fld)) flags = c(flags); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) { if ( !compat ) { @@ -711,7 +711,7 @@ int arch_set_info_guest( v->fpu_initialised = !!(flags & VGCF_I387_VALID); v->arch.flags &= ~TF_kernel_mode; - if ( (flags & VGCF_in_kernel) || is_hvm_vcpu(v)/*???*/ ) + if ( (flags & VGCF_in_kernel) || !is_pv_vcpu(v)/*???*/ ) v->arch.flags |= TF_kernel_mode; v->arch.vgc_flags = flags; @@ -722,7 +722,7 @@ int arch_set_info_guest( if ( !compat ) { memcpy(&v->arch.user_regs, &c.nat->user_regs, sizeof(c.nat->user_regs)); - if ( !is_hvm_vcpu(v) ) + if ( is_pv_vcpu(v) ) memcpy(v->arch.pv_vcpu.trap_ctxt, c.nat->trap_ctxt, sizeof(c.nat->trap_ctxt)); } @@ -738,10 +738,13 @@ int arch_set_info_guest( v->arch.user_regs.eflags |= 2; - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) { hvm_set_info_guest(v); - goto out; + if ( is_hvm_vcpu(v) || v->is_initialised ) + goto out; + else + goto pvh_skip_pv_stuff; } init_int80_direct_trap(v); @@ -750,7 +753,10 @@ int arch_set_info_guest( v->arch.pv_vcpu.iopl = (v->arch.user_regs.eflags >> 12) & 3; v->arch.user_regs.eflags &= ~X86_EFLAGS_IOPL; - /* Ensure real hardware interrupts are enabled. */ + /* + * Ensure real hardware interrupts are enabled. Note: PVH may not have + * IDT set on all vcpus so we don''t enable IF for it yet. + */ v->arch.user_regs.eflags |= X86_EFLAGS_IF; if ( !v->is_initialised ) @@ -852,6 +858,7 @@ int arch_set_info_guest( set_bit(_VPF_in_reset, &v->pause_flags); +pvh_skip_pv_stuff: if ( !compat ) cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]); else @@ -860,7 +867,7 @@ int arch_set_info_guest( if ( !cr3_page ) rc = -EINVAL; - else if ( paging_mode_refcounts(d) ) + else if ( paging_mode_refcounts(d) || is_pvh_vcpu(v) ) /* nothing */; else if ( cr3_page == v->arch.old_guest_table ) { @@ -886,8 +893,15 @@ int arch_set_info_guest( /* handled below */; else if ( !compat ) { + /* PVH 32bitfixme. */ + if ( is_pvh_vcpu(v) ) + { + v->arch.cr3 = page_to_mfn(cr3_page); + v->arch.hvm_vcpu.guest_cr[3] = c.nat->ctrlreg[3]; + } + v->arch.guest_table = pagetable_from_page(cr3_page); - if ( c.nat->ctrlreg[1] ) + if ( c.nat->ctrlreg[1] && !is_pvh_vcpu(v) ) { cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]); cr3_page = get_page_from_gfn(d, cr3_gfn, NULL, P2M_ALLOC); @@ -942,6 +956,13 @@ int arch_set_info_guest( update_cr3(v); + if ( is_pvh_vcpu(v) ) + { + /* Guest is bringing up non-boot SMP vcpu. */ + if ( (rc=hvm_set_vcpu_info(v, c.nat)) != 0 ) + return rc; + } + out: if ( flags & VGCF_online ) clear_bit(_VPF_down, &v->pause_flags); @@ -1303,7 +1324,7 @@ static void update_runstate_area(struct vcpu *v) static inline int need_full_gdt(struct vcpu *v) { - return (!is_hvm_vcpu(v) && !is_idle_vcpu(v)); + return (is_pv_vcpu(v) && !is_idle_vcpu(v)); } static void __context_switch(void) @@ -1438,7 +1459,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next) /* Re-enable interrupts before restoring state which may fault. */ local_irq_enable(); - if ( !is_hvm_vcpu(next) ) + if ( is_pv_vcpu(next) ) { load_LDT(next); load_segments(next); @@ -1564,12 +1585,12 @@ unsigned long hypercall_create_continuation( regs->eax = op; /* Ensure the hypercall trap instruction is re-executed. */ - if ( !is_hvm_vcpu(current) ) + if ( is_pv_vcpu(current) ) regs->eip -= 2; /* re-execute ''syscall'' / ''int $xx'' */ else current->arch.hvm_vcpu.hcall_preempted = 1; - if ( !is_hvm_vcpu(current) ? + if ( is_pv_vcpu(current) ? !is_pv_32on64_vcpu(current) : (hvm_guest_x86_mode(current) == 8) ) { @@ -1837,7 +1858,7 @@ int domain_relinquish_resources(struct domain *d) return ret; } - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) { for_each_vcpu ( d, v ) { @@ -1910,7 +1931,7 @@ int domain_relinquish_resources(struct domain *d) BUG(); } - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) hvm_domain_relinquish_resources(d); return 0; @@ -1994,7 +2015,7 @@ void vcpu_mark_events_pending(struct vcpu *v) if ( already_pending ) return; - if ( is_hvm_vcpu(v) ) + if ( !is_pv_vcpu(v) ) hvm_assert_evtchn_irq(v); else vcpu_kick(v); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 9f58968..eb535a3 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4330,6 +4330,9 @@ void destroy_gdt(struct vcpu *v) int i; unsigned long pfn; + if ( is_pvh_vcpu(v) ) + return; + v->arch.pv_vcpu.gdt_ents = 0; pl1e = gdt_ldt_ptes(v->domain, v); for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ ) diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 8408420..61dc857 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -192,6 +192,8 @@ struct hvm_function_table { paddr_t *L1_gpa, unsigned int *page_order, uint8_t *p2m_acc, bool_t access_r, bool_t access_w, bool_t access_x); + /* PVH functions. */ + int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp); }; extern struct hvm_function_table hvm_funcs; @@ -325,6 +327,12 @@ static inline unsigned long hvm_get_shadow_gs_base(struct vcpu *v) return hvm_funcs.get_shadow_gs_base(v); } +static inline int hvm_set_vcpu_info(struct vcpu *v, + struct vcpu_guest_context *ctxtp) +{ + return hvm_funcs.pvh_set_vcpu_info(v, ctxtp); +} + #define is_viridian_domain(_d) \ (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN])) -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 08/18] PVH xen: support invalid op emulation for PVH
This patch supports invalid op emulation for PVH by calling appropriate copy macros and and HVM function to inject PF. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/traps.c | 16 +++++++++++++--- xen/include/asm-x86/processor.h | 1 + 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 378ef0a..d29136d 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -459,6 +459,10 @@ static void instruction_done( struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch) { regs->eip = eip; + + if ( is_pvh_vcpu(current) ) + return; + regs->eflags &= ~X86_EFLAGS_RF; if ( bpmatch || (regs->eflags & X86_EFLAGS_TF) ) { @@ -913,7 +917,7 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs) return EXCRET_fault_fixed; } -static int emulate_forced_invalid_op(struct cpu_user_regs *regs) +int emulate_forced_invalid_op(struct cpu_user_regs *regs) { char sig[5], instr[2]; unsigned long eip, rc; @@ -921,7 +925,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) eip = regs->eip; /* Check for forced emulation signature: ud2 ; .ascii "xen". */ - if ( (rc = copy_from_user(sig, (char *)eip, sizeof(sig))) != 0 ) + if ( (rc = raw_copy_from_guest(sig, (char *)eip, sizeof(sig))) != 0 ) { propagate_page_fault(eip + sizeof(sig) - rc, 0); return EXCRET_fault_fixed; @@ -931,7 +935,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs) eip += sizeof(sig); /* We only emulate CPUID. */ - if ( ( rc = copy_from_user(instr, (char *)eip, sizeof(instr))) != 0 ) + if ( ( rc = raw_copy_from_guest(instr, (char *)eip, sizeof(instr))) != 0 ) { propagate_page_fault(eip + sizeof(instr) - rc, 0); return EXCRET_fault_fixed; @@ -1076,6 +1080,12 @@ void propagate_page_fault(unsigned long addr, u16 error_code) struct vcpu *v = current; struct trap_bounce *tb = &v->arch.pv_vcpu.trap_bounce; + if ( is_pvh_vcpu(v) ) + { + hvm_inject_page_fault(error_code, addr); + return; + } + v->arch.pv_vcpu.ctrlreg[2] = addr; arch_set_cr2(v, addr); diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h index 5cdacc7..9acd9ea 100644 --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -566,6 +566,7 @@ void microcode_set_module(unsigned int); int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); int microcode_resume_cpu(int cpu); +int emulate_forced_invalid_op(struct cpu_user_regs *regs); #endif /* !__ASSEMBLY__ */ #endif /* __ASM_X86_PROCESSOR_H */ -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 09/18] PVH xen: Support privileged op emulation for PVH
This patch changes mostly traps.c to support privileged op emulation for PVH. A new function read_descriptor_sel() is introduced to read selectors for PVH. Also, modify read_segment_register macro to return selector from cpu_user_regs instead of reading natively. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/traps.c | 86 ++++++++++++++++++++++++++++++++++++----- xen/include/asm-x86/system.h | 18 +++++++-- 2 files changed, 89 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index d29136d..0caf73a 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -479,6 +479,10 @@ static unsigned int check_guest_io_breakpoint(struct vcpu *v, unsigned int width, i, match = 0; unsigned long start; + /* PVH fixme: support io breakpoint. */ + if ( is_pvh_vcpu(v) ) + return 0; + if ( !(v->arch.debugreg[5]) || !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_DE) ) return 0; @@ -1524,6 +1528,49 @@ static int read_descriptor(unsigned int sel, return 1; } +static int read_descriptor_sel(unsigned int sel, + enum x86_segment which_sel, + struct vcpu *v, + const struct cpu_user_regs *regs, + unsigned long *base, + unsigned long *limit, + unsigned int *ar, + unsigned int vm86attr) +{ + struct segment_register seg; + unsigned int long_mode = 0; + + if ( !is_pvh_vcpu(v) ) + return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); + + hvm_get_segment_register(v, x86_seg_cs, &seg); + long_mode = seg.attr.fields.l; + + if ( which_sel != x86_seg_cs ) + hvm_get_segment_register(v, which_sel, &seg); + + /* "ar" is returned packed as in segment_attributes_t. Fix it up. */ + *ar = (unsigned int)seg.attr.bytes; + *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4); + *ar = *ar << 8; + + if ( long_mode ) + { + *limit = ~0UL; + + if ( which_sel < x86_seg_fs ) + { + *base = 0UL; + return 1; + } + } + else + *limit = (unsigned long)seg.limit; + + *base = seg.base; + return 1; +} + static int read_gate_descriptor(unsigned int gate_sel, const struct vcpu *v, unsigned int *sel, @@ -1589,6 +1636,13 @@ static int guest_io_okay( int user_mode = !(v->arch.flags & TF_kernel_mode); #define TOGGLE_MODE() if ( user_mode ) toggle_guest_mode(v) + /* + * For PVH we check this in vmexit for EXIT_REASON_IO_INSTRUCTION + * and so don''t need to check again here. + */ + if ( is_pvh_vcpu(v) ) + return 1; + if ( !vm86_mode(regs) && (v->arch.pv_vcpu.iopl >= (guest_kernel_mode(v, regs) ? 1 : 3)) ) return 1; @@ -1834,7 +1888,7 @@ static inline uint64_t guest_misc_enable(uint64_t val) _ptr = (unsigned int)_ptr; \ if ( (limit) < sizeof(_x) - 1 || (eip) > (limit) - (sizeof(_x) - 1) ) \ goto fail; \ - if ( (_rc = copy_from_user(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ + if ( (_rc = raw_copy_from_guest(&_x, (type *)_ptr, sizeof(_x))) != 0 ) \ { \ propagate_page_fault(_ptr + sizeof(_x) - _rc, 0); \ goto skip; \ @@ -1851,6 +1905,7 @@ static int is_cpufreq_controller(struct domain *d) static int emulate_privileged_op(struct cpu_user_regs *regs) { + enum x86_segment which_sel; struct vcpu *v = current; unsigned long *reg, eip = regs->eip; u8 opcode, modrm_reg = 0, modrm_rm = 0, rep_prefix = 0, lock = 0, rex = 0; @@ -1873,9 +1928,10 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) void (*io_emul)(struct cpu_user_regs *) __attribute__((__regparm__(1))); uint64_t val, msr_content; - if ( !read_descriptor(regs->cs, v, regs, - &code_base, &code_limit, &ar, - _SEGMENT_CODE|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P) ) + if ( !read_descriptor_sel(regs->cs, x86_seg_cs, v, regs, + &code_base, &code_limit, &ar, + _SEGMENT_CODE|_SEGMENT_S| + _SEGMENT_DPL|_SEGMENT_P) ) goto fail; op_default = op_bytes = (ar & (_SEGMENT_L|_SEGMENT_DB)) ? 4 : 2; ad_default = ad_bytes = (ar & _SEGMENT_L) ? 8 : op_default; @@ -1886,6 +1942,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) /* emulating only opcodes not allowing SS to be default */ data_sel = read_segment_register(v, regs, ds); + which_sel = x86_seg_ds; /* Legacy prefixes. */ for ( i = 0; i < 8; i++, rex == opcode || (rex = 0) ) @@ -1901,23 +1958,29 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) continue; case 0x2e: /* CS override */ data_sel = regs->cs; + which_sel = x86_seg_cs; continue; case 0x3e: /* DS override */ data_sel = read_segment_register(v, regs, ds); + which_sel = x86_seg_ds; continue; case 0x26: /* ES override */ data_sel = read_segment_register(v, regs, es); + which_sel = x86_seg_es; continue; case 0x64: /* FS override */ data_sel = read_segment_register(v, regs, fs); + which_sel = x86_seg_fs; lm_ovr = lm_seg_fs; continue; case 0x65: /* GS override */ data_sel = read_segment_register(v, regs, gs); + which_sel = x86_seg_gs; lm_ovr = lm_seg_gs; continue; case 0x36: /* SS override */ data_sel = regs->ss; + which_sel = x86_seg_ss; continue; case 0xf0: /* LOCK */ lock = 1; @@ -1961,15 +2024,16 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) if ( !(opcode & 2) ) { data_sel = read_segment_register(v, regs, es); + which_sel = x86_seg_es; lm_ovr = lm_seg_none; } if ( !(ar & _SEGMENT_L) ) { - if ( !read_descriptor(data_sel, v, regs, - &data_base, &data_limit, &ar, - _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| - _SEGMENT_P) ) + if ( !read_descriptor_sel(data_sel, which_sel, v, regs, + &data_base, &data_limit, &ar, + _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL| + _SEGMENT_P) ) goto fail; if ( !(ar & _SEGMENT_S) || !(ar & _SEGMENT_P) || @@ -1999,9 +2063,9 @@ static int emulate_privileged_op(struct cpu_user_regs *regs) } } else - read_descriptor(data_sel, v, regs, - &data_base, &data_limit, &ar, - 0); + read_descriptor_sel(data_sel, which_sel, v, regs, + &data_base, &data_limit, &ar, + 0); data_limit = ~0UL; ar = _SEGMENT_WR|_SEGMENT_S|_SEGMENT_DPL|_SEGMENT_P; } diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h index 9bb22cb..e29b6a3 100644 --- a/xen/include/asm-x86/system.h +++ b/xen/include/asm-x86/system.h @@ -4,10 +4,20 @@ #include <xen/lib.h> #include <xen/bitops.h> -#define read_segment_register(vcpu, regs, name) \ -({ u16 __sel; \ - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ - __sel; \ +/* + * We need vcpu because during context switch, going from PVH to PV, + * in save_segments(), current has been updated to next, and no longer pointing + * to the PVH. + */ +#define read_segment_register(vcpu, regs, name) \ +({ u16 __sel; \ + struct cpu_user_regs *_regs = (regs); \ + \ + if ( is_pvh_vcpu(vcpu) && guest_mode(regs) ) \ + __sel = _regs->name; \ + else \ + asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ + __sel; \ }) #define wbinvd() \ -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 10/18] PVH xen: interrupt/event-channel delivery to PVH
PVH uses HVMIRQ_callback_vector for interrupt delivery. Also, change hvm_vcpu_has_pending_irq() as PVH doesn''t use vlapic emulation. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/irq.c | 3 +++ xen/arch/x86/hvm/vmx/intr.c | 8 ++++++-- xen/include/asm-x86/domain.h | 2 +- xen/include/asm-x86/event.h | 2 +- 4 files changed, 11 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c index 9eae5de..92fb245 100644 --- a/xen/arch/x86/hvm/irq.c +++ b/xen/arch/x86/hvm/irq.c @@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v) && vcpu_info(v, evtchn_upcall_pending) ) return hvm_intack_vector(plat->irq.callback_via.vector); + if ( is_pvh_vcpu(v) ) + return hvm_intack_none; + if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output ) return hvm_intack_pic(0); diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c index e376f3c..ce42950 100644 --- a/xen/arch/x86/hvm/vmx/intr.c +++ b/xen/arch/x86/hvm/vmx/intr.c @@ -165,6 +165,9 @@ static int nvmx_intr_intercept(struct vcpu *v, struct hvm_intack intack) { u32 ctrl; + if ( is_pvh_vcpu(v) ) + return 0; + if ( nvmx_intr_blocked(v) != hvm_intblk_none ) { enable_intr_window(v, intack); @@ -219,8 +222,9 @@ void vmx_intr_assist(void) return; } - /* Crank the handle on interrupt state. */ - pt_vector = pt_update_irq(v); + if ( !is_pvh_vcpu(v) ) + /* Crank the handle on interrupt state. */ + pt_vector = pt_update_irq(v); do { intack = hvm_vcpu_has_pending_irq(v); diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index c3f9f8e..b95314a 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -16,7 +16,7 @@ #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d)) #define is_pv_32on64_vcpu(v) (is_pv_32on64_domain((v)->domain)) -#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \ +#define is_hvm_pv_evtchn_domain(d) (!is_pv_domain(d) && \ d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector) #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain)) diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h index 06057c7..7ed5812 100644 --- a/xen/include/asm-x86/event.h +++ b/xen/include/asm-x86/event.h @@ -18,7 +18,7 @@ int hvm_local_events_need_delivery(struct vcpu *v); static inline int local_events_need_delivery(void) { struct vcpu *v = current; - return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) : + return (!is_pv_vcpu(v) ? hvm_local_events_need_delivery(v) : (vcpu_info(v, evtchn_upcall_pending) && !vcpu_info(v, evtchn_upcall_mask))); } -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 11/18] PVH xen: additional changes to support PVH guest creation and execution.
Fail creation of 32bit PVH guest. Change hap_update_cr3() to return long mode for PVH, this called during domain creation from arch_set_info_guest(). Return correct features for PVH to guest during it''s boot. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain.c | 8 ++++++++ xen/arch/x86/mm/hap/hap.c | 4 +++- xen/common/domain.c | 9 +++++++++ xen/common/domctl.c | 5 +++++ xen/common/kernel.c | 6 +++++- 5 files changed, 30 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 04ef0e5..3734d9d 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -339,6 +339,14 @@ int switch_compat(struct domain *d) if ( d == NULL ) return -EINVAL; + + if ( is_pvh_domain(d) ) + { + gdprintk(XENLOG_G_ERR, + "Xen does not currently support 32bit PVH guests\n"); + return -EINVAL; + } + if ( !may_switch_mode(d) ) return -EACCES; if ( is_pv_32on64_domain(d) ) diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index bff05d9..19a085c 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -639,7 +639,9 @@ static void hap_update_cr3(struct vcpu *v, int do_locking) const struct paging_mode * hap_paging_get_mode(struct vcpu *v) { - return !hvm_paging_enabled(v) ? &hap_paging_real_mode : + /* PVH 32bitfixme. */ + return is_pvh_vcpu(v) ? &hap_paging_long_mode : + !hvm_paging_enabled(v) ? &hap_paging_real_mode : hvm_long_mode_enabled(v) ? &hap_paging_long_mode : hvm_pae_enabled(v) ? &hap_paging_pae_mode : &hap_paging_protected_mode; diff --git a/xen/common/domain.c b/xen/common/domain.c index 6ece3fe..b4be781 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -237,6 +237,15 @@ struct domain *domain_create( if ( domcr_flags & DOMCRF_hvm ) d->guest_type = is_hvm; + else if ( domcr_flags & DOMCRF_pvh ) + { + if ( !(domcr_flags & DOMCRF_hap) ) + { + printk(XENLOG_INFO "PVH guest must have HAP on\n"); + goto fail; + } + d->guest_type = is_pvh; + } if ( domid == 0 ) { diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 9bd8f80..f9c361d 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -187,6 +187,8 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info) if ( is_hvm_domain(d) ) info->flags |= XEN_DOMINF_hvm_guest; + else if ( is_pvh_domain(d) ) + info->flags |= XEN_DOMINF_pvh_guest; xsm_security_domaininfo(d, info); @@ -443,6 +445,9 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) domcr_flags = 0; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hvm_guest ) domcr_flags |= DOMCRF_hvm; + else if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) + domcr_flags |= DOMCRF_pvh; /* PV with HAP is a PVH guest */ + if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_hap ) domcr_flags |= DOMCRF_hap; if ( op->u.createdomain.flags & XEN_DOMCTL_CDF_s3_integrity ) diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 72fb905..3bba758 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -289,7 +289,11 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( current->domain == dom0 ) fi.submap |= 1U << XENFEAT_dom0; #ifdef CONFIG_X86 - if ( !is_hvm_vcpu(current) ) + if ( is_pvh_vcpu(current) ) + fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | + (1U << XENFEAT_supervisor_mode_kernel) | + (1U << XENFEAT_hvm_callback_vector); + else if ( !is_hvm_vcpu(current) ) fi.submap |= (1U << XENFEAT_mmu_pt_update_preserve_ad) | (1U << XENFEAT_highmem_assist) | (1U << XENFEAT_gnttab_map_avail_bits); -- 1.7.2.3
PVH doesn''t use map cache. show_registers() for PVH takes the HVM path. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/domain_page.c | 10 +++++----- xen/arch/x86/x86_64/traps.c | 6 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index 9297ea0..5092fdb 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -34,7 +34,7 @@ static inline struct vcpu *mapcache_current_vcpu(void) * then it means we are running on the idle domain''s page table and must * therefore use its mapcache. */ - if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) ) + if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) ) { /* If we really are idling, perform lazy context switch now. */ if ( (v = idle_vcpu[smp_processor_id()]) == current ) @@ -71,7 +71,7 @@ void *map_domain_page(unsigned long mfn) #endif v = mapcache_current_vcpu(); - if ( !v || is_hvm_vcpu(v) ) + if ( !v || !is_pv_vcpu(v) ) return mfn_to_virt(mfn); dcache = &v->domain->arch.pv_domain.mapcache; @@ -176,7 +176,7 @@ void unmap_domain_page(const void *ptr) ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); v = mapcache_current_vcpu(); - ASSERT(v && !is_hvm_vcpu(v)); + ASSERT(v && is_pv_vcpu(v)); dcache = &v->domain->arch.pv_domain.mapcache; ASSERT(dcache->inuse); @@ -243,7 +243,7 @@ int mapcache_domain_init(struct domain *d) struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; unsigned int bitmap_pages; - if ( is_hvm_domain(d) || is_idle_domain(d) ) + if ( !is_pv_domain(d) || is_idle_domain(d) ) return 0; #ifdef NDEBUG @@ -274,7 +274,7 @@ int mapcache_vcpu_init(struct vcpu *v) unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES; unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long)); - if ( is_hvm_vcpu(v) || !dcache->inuse ) + if ( !is_pv_vcpu(v) || !dcache->inuse ) return 0; if ( ents > dcache->entries ) diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index d2f7209..bcfd740 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -85,7 +85,7 @@ void show_registers(struct cpu_user_regs *regs) enum context context; struct vcpu *v = current; - if ( is_hvm_vcpu(v) && guest_mode(regs) ) + if ( !is_pv_vcpu(v) && guest_mode(regs) ) { struct segment_register sreg; context = CTXT_hvm_guest; @@ -146,8 +146,8 @@ void vcpu_show_registers(const struct vcpu *v) const struct cpu_user_regs *regs = &v->arch.user_regs; unsigned long crs[8]; - /* No need to handle HVM for now. */ - if ( is_hvm_vcpu(v) ) + /* No need to handle HVM and PVH for now. */ + if ( !is_pv_vcpu(v) ) return; crs[0] = v->arch.pv_vcpu.ctrlreg[0]; -- 1.7.2.3
PVH only supports limited memory types in Phase I. TSC is limited to native mode only also for the moment. Finally, grant mapping of iomem for PVH hasn''t been explorted in phase I. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/mtrr.c | 8 ++++++++ xen/arch/x86/time.c | 8 ++++++++ xen/common/grant_table.c | 4 ++-- 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c index ef51a8d..b9d6411 100644 --- a/xen/arch/x86/hvm/mtrr.c +++ b/xen/arch/x86/hvm/mtrr.c @@ -693,6 +693,14 @@ uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn, ((d->vcpu == NULL) || ((v = d->vcpu[0]) == NULL)) ) return MTRR_TYPE_WRBACK; + /* PVH fixme: Add support for more memory types. */ + if ( is_pvh_domain(d) ) + { + if ( direct_mmio ) + return MTRR_TYPE_UNCACHABLE; + return MTRR_TYPE_WRBACK; + } + if ( !v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] ) return MTRR_TYPE_WRBACK; diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c index 86640f5..5b1b6bb 100644 --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1893,6 +1893,14 @@ void tsc_set_info(struct domain *d, d->arch.vtsc = 0; return; } + if ( is_pvh_domain(d) && tsc_mode != TSC_MODE_NEVER_EMULATE ) + { + /* PVH fixme: support more tsc modes. */ + printk(XENLOG_WARNING + "PVH currently does not support tsc emulation. Setting timer_mode = native\n"); + d->arch.vtsc = 0; + return; + } switch ( d->arch.tsc_mode = tsc_mode ) { diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index 3f97328..a2073d2 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -721,7 +721,7 @@ __gnttab_map_grant_ref( double_gt_lock(lgt, rgt); - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; @@ -932,7 +932,7 @@ __gnttab_unmap_common( act->pin -= GNTPIN_hstw_inc; } - if ( !is_hvm_domain(ld) && need_iommu(ld) ) + if ( is_pv_domain(ld) && need_iommu(ld) ) { unsigned int wrc, rdc; int err = 0; -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 14/18] PVH xen: Checks, asserts, and limitations for PVH
This patch adds some precautionary checks and debug asserts for PVH. Also, PVH doesn''t support any HVM type guest monitoring at present. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 13 +++++++++++++ xen/arch/x86/hvm/mtrr.c | 3 +++ xen/arch/x86/physdev.c | 13 +++++++++++++ xen/arch/x86/traps.c | 5 +++++ xen/arch/x86/x86_64/traps.c | 2 ++ 5 files changed, 36 insertions(+), 0 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 118e21a..888e1f8 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4520,8 +4520,11 @@ static int hvm_memory_event_traps(long p, uint32_t reason, return 1; } +/* PVH fixme: add support for monitoring guest behaviour in below functions. */ void hvm_memory_event_cr0(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR0], MEM_EVENT_REASON_CR0, @@ -4530,6 +4533,8 @@ void hvm_memory_event_cr0(unsigned long value, unsigned long old) void hvm_memory_event_cr3(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR3], MEM_EVENT_REASON_CR3, @@ -4538,6 +4543,8 @@ void hvm_memory_event_cr3(unsigned long value, unsigned long old) void hvm_memory_event_cr4(unsigned long value, unsigned long old) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_CR4], MEM_EVENT_REASON_CR4, @@ -4546,6 +4553,8 @@ void hvm_memory_event_cr4(unsigned long value, unsigned long old) void hvm_memory_event_msr(unsigned long msr, unsigned long value) { + if ( is_pvh_vcpu(current) ) + return; hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_MSR], MEM_EVENT_REASON_MSR, @@ -4558,6 +4567,8 @@ int hvm_memory_event_int3(unsigned long gla) unsigned long gfn; gfn = paging_gva_to_gfn(current, gla, &pfec); + if ( is_pvh_vcpu(current) ) + return 0; return hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_INT3], MEM_EVENT_REASON_INT3, @@ -4570,6 +4581,8 @@ int hvm_memory_event_single_step(unsigned long gla) unsigned long gfn; gfn = paging_gva_to_gfn(current, gla, &pfec); + if ( is_pvh_vcpu(current) ) + return 0; return hvm_memory_event_traps(current->domain->arch.hvm_domain .params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP], MEM_EVENT_REASON_SINGLESTEP, diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c index b9d6411..9b377f7 100644 --- a/xen/arch/x86/hvm/mtrr.c +++ b/xen/arch/x86/hvm/mtrr.c @@ -578,6 +578,9 @@ int32_t hvm_set_mem_pinned_cacheattr( { struct hvm_mem_pinned_cacheattr_range *range; + /* A PVH guest writes to MSR_IA32_CR_PAT natively. */ + ASSERT(!is_pvh_domain(d)); + if ( !((type == PAT_TYPE_UNCACHABLE) || (type == PAT_TYPE_WRCOMB) || (type == PAT_TYPE_WRTHROUGH) || diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 3733c7a..2fc7ae6 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -475,6 +475,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iopl: { struct physdev_set_iopl set_iopl; + + if ( is_pvh_vcpu(current) ) + { + ret = -EINVAL; + break; + } + ret = -EFAULT; if ( copy_from_guest(&set_iopl, arg, 1) != 0 ) break; @@ -488,6 +495,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) case PHYSDEVOP_set_iobitmap: { struct physdev_set_iobitmap set_iobitmap; + + if ( is_pvh_vcpu(current) ) + { + ret = -EINVAL; + break; + } ret = -EFAULT; if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 ) break; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 0caf73a..6c74e96 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -2709,6 +2709,8 @@ static void emulate_gate_op(struct cpu_user_regs *regs) unsigned long off, eip, opnd_off, base, limit; int jump; + ASSERT(!is_pvh_vcpu(v)); + /* Check whether this fault is due to the use of a call gate. */ if ( !read_gate_descriptor(regs->error_code, v, &sel, &off, &ar) || (((ar >> 13) & 3) < (regs->cs & 3)) || @@ -3325,6 +3327,9 @@ void do_device_not_available(struct cpu_user_regs *regs) BUG_ON(!guest_mode(regs)); + /* PVH should not get here. (ctrlreg is not implemented). */ + ASSERT(!is_pvh_vcpu(curr)); + vcpu_restore_fpu_lazy(curr); if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS ) diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index bcfd740..29dfe95 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -440,6 +440,8 @@ static long register_guest_callback(struct callback_register *reg) long ret = 0; struct vcpu *v = current; + ASSERT(!is_pvh_vcpu(v)); + if ( !is_canonical_address(reg->address) ) return -EINVAL; -- 1.7.2.3
This patch expands HVM hcall support to include PVH. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 56 ++++++++++++++++++++++++++++++++++++------- xen/arch/x86/x86_64/traps.c | 2 +- 2 files changed, 48 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 888e1f8..3c1597b 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3188,6 +3188,16 @@ static long hvm_vcpu_op( case VCPUOP_register_vcpu_time_memory_area: rc = do_vcpu_op(cmd, vcpuid, arg); break; + + case VCPUOP_is_up: + case VCPUOP_up: + case VCPUOP_initialise: + if ( is_pvh_vcpu(current) ) + rc = do_vcpu_op(cmd, vcpuid, arg); + else + rc = -ENOSYS; + break; + default: rc = -ENOSYS; break; @@ -3308,12 +3318,31 @@ static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = { HYPERCALL(tmem_op) }; +/* PVH 32bitfixme. */ +static hvm_hypercall_t *const pvh_hypercall64_table[NR_hypercalls] = { + HYPERCALL(platform_op), + HYPERCALL(memory_op), + HYPERCALL(xen_version), + HYPERCALL(console_io), + [ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op, + [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op, + HYPERCALL(mmuext_op), + HYPERCALL(xsm_op), + HYPERCALL(sched_op), + HYPERCALL(event_channel_op), + [ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op, + HYPERCALL(hvm_op), + HYPERCALL(sysctl), + HYPERCALL(domctl) +}; + int hvm_do_hypercall(struct cpu_user_regs *regs) { struct vcpu *curr = current; struct segment_register sreg; int mode = hvm_guest_x86_mode(curr); uint32_t eax = regs->eax; + hvm_hypercall_t **hcall_table; switch ( mode ) { @@ -3334,7 +3363,9 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) if ( (eax & 0x80000000) && is_viridian_domain(curr->domain) ) return viridian_hypercall(regs); - if ( (eax >= NR_hypercalls) || !hvm_hypercall32_table[eax] ) + if ( (eax >= NR_hypercalls) || + (is_pvh_vcpu(curr) && !pvh_hypercall64_table[eax]) || + (is_hvm_vcpu(curr) && !hvm_hypercall32_table[eax]) ) { regs->eax = -ENOSYS; return HVM_HCALL_completed; @@ -3348,17 +3379,24 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) eax, regs->rdi, regs->rsi, regs->rdx, regs->r10, regs->r8, regs->r9); + if ( is_pvh_vcpu(curr) ) + hcall_table = (hvm_hypercall_t **)pvh_hypercall64_table; + else + hcall_table = (hvm_hypercall_t **)hvm_hypercall64_table; + curr->arch.hvm_vcpu.hcall_64bit = 1; - regs->rax = hvm_hypercall64_table[eax](regs->rdi, - regs->rsi, - regs->rdx, - regs->r10, - regs->r8, - regs->r9); + regs->rax = hcall_table[eax](regs->rdi, + regs->rsi, + regs->rdx, + regs->r10, + regs->r8, + regs->r9); curr->arch.hvm_vcpu.hcall_64bit = 0; } else { + ASSERT(!is_pvh_vcpu(curr)); /* PVH 32bitfixme. */ + HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%u(%x, %x, %x, %x, %x, %x)", eax, (uint32_t)regs->ebx, (uint32_t)regs->ecx, (uint32_t)regs->edx, (uint32_t)regs->esi, @@ -3777,7 +3815,7 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) return -ESRCH; rc = -EINVAL; - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) goto param_fail; rc = xsm_hvm_param(XSM_TARGET, d, op); @@ -3949,7 +3987,7 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) break; } - if ( rc == 0 ) + if ( rc == 0 && !is_pvh_domain(d) ) { d->arch.hvm_domain.params[a.index] = a.value; diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index 29dfe95..3f44da8 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -622,7 +622,7 @@ static void hypercall_page_initialise_ring3_kernel(void *hypercall_page) void hypercall_page_initialise(struct domain *d, void *hypercall_page) { memset(hypercall_page, 0xCC, PAGE_SIZE); - if ( is_hvm_domain(d) ) + if ( !is_pv_domain(d) ) hvm_hypercall_page_initialise(d, hypercall_page); else if ( !is_pv_32bit_domain(d) ) hypercall_page_initialise_ring3_kernel(hypercall_page); -- 1.7.2.3
This patch contains vmcs changes related for PVH, mainly creating a VMCS for PVH guest. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/vmcs.c | 254 ++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 250 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 43539a6..f21571c 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -634,7 +634,7 @@ void vmx_vmcs_exit(struct vcpu *v) { /* Don''t confuse vmx_do_resume (for @v or @current!) */ vmx_clear_vmcs(v); - if ( is_hvm_vcpu(current) ) + if ( !is_pv_vcpu(current) ) vmx_load_vmcs(current); spin_unlock(&v->arch.hvm_vmx.vmcs_lock); @@ -856,6 +856,239 @@ static void vmx_set_common_host_vmcs_fields(struct vcpu *v) __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); } +static int pvh_check_requirements(struct vcpu *v) +{ + u64 required, tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + + if ( !paging_mode_hap(v->domain) ) + { + printk(XENLOG_G_INFO "HAP is required for PVH guest.\n"); + return -EINVAL; + } + if ( !cpu_has_vmx_pat ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have PAT support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_msr_bitmap ) + { + printk(XENLOG_G_INFO "PVH: CPU does not have msr bitmap\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_vpid ) + { + printk(XENLOG_G_INFO "PVH: CPU doesn''t have VPID support\n"); + return -ENOSYS; + } + if ( !cpu_has_vmx_secondary_exec_control ) + { + printk(XENLOG_G_INFO "CPU Secondary exec is required to run PVH\n"); + return -ENOSYS; + } + + if ( v->domain->arch.vtsc ) + { + printk(XENLOG_G_INFO + "At present PVH only supports the default timer mode\n"); + return -ENOSYS; + } + + required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR; + if ( (tmpval & required) != required ) + { + printk(XENLOG_G_INFO "PVH: required CR4 features not available:%lx\n", + required); + return -ENOSYS; + } + + return 0; +} + +static int pvh_construct_vmcs(struct vcpu *v) +{ + int rc, msr_type; + unsigned long *msr_bitmap; + struct domain *d = v->domain; + struct p2m_domain *p2m = p2m_get_hostp2m(d); + struct ept_data *ept = &p2m->ept; + u32 vmexit_ctl = vmx_vmexit_control; + u32 vmentry_ctl = vmx_vmentry_control; + u64 host_pat, tmpval = -1; + + if ( (rc = pvh_check_requirements(v)) ) + return rc; + + msr_bitmap = alloc_xenheap_page(); + if ( msr_bitmap == NULL ) + return -ENOMEM; + + /* 1. Pin-Based Controls: */ + __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); + + v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control; + + /* 2. Primary Processor-based controls: */ + /* + * If rdtsc exiting is turned on and it goes thru emulate_privileged_op, + * then pv_vcpu.ctrlreg must be added to the pvh struct. + */ + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING; + + v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING | + CPU_BASED_CR3_LOAD_EXITING | + CPU_BASED_CR3_STORE_EXITING); + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + v->arch.hvm_vmx.exec_control |= CPU_BASED_ACTIVATE_MSR_BITMAP; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_TPR_SHADOW; + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING; + + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); + + /* 3. Secondary Processor-based controls (Intel SDM: resvd bits are 0): */ + v->arch.hvm_vmx.secondary_exec_control = SECONDARY_EXEC_ENABLE_EPT; + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_ENABLE_VPID; + v->arch.hvm_vmx.secondary_exec_control |= SECONDARY_EXEC_PAUSE_LOOP_EXITING; + + __vmwrite(SECONDARY_VM_EXEC_CONTROL, + v->arch.hvm_vmx.secondary_exec_control); + + __vmwrite(IO_BITMAP_A, virt_to_maddr((char *)hvm_io_bitmap + 0)); + __vmwrite(IO_BITMAP_B, virt_to_maddr((char *)hvm_io_bitmap + PAGE_SIZE)); + + /* MSR bitmap for intercepts. */ + memset(msr_bitmap, ~0, PAGE_SIZE); + v->arch.hvm_vmx.msr_bitmap = msr_bitmap; + __vmwrite(MSR_BITMAP, virt_to_maddr(msr_bitmap)); + + msr_type = MSR_TYPE_R | MSR_TYPE_W; + /* Disable interecepts for MSRs that have corresponding VMCS fields. */ + vmx_disable_intercept_for_msr(v, MSR_FS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_GS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP, msr_type); + vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE, msr_type); + vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT, msr_type); + + /* + * We don''t disable intercepts for MSRs: MSR_STAR, MSR_LSTAR, MSR_CSTAR, + * and MSR_SYSCALL_MASK because we need to specify save/restore area to + * save/restore at every VM exit and entry. Instead, let the intercept + * functions save them into vmx_msr_state fields. See comment in + * vmx_restore_host_msrs(). See also vmx_restore_guest_msrs(). + */ + __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); + __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); + __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); + + __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl); + + /* + * Note: we run with default VM_ENTRY_LOAD_DEBUG_CTLS of 1, which means + * upon vmentry, the cpu reads/loads VMCS.DR7 and VMCS.DEBUGCTLS, and not + * use the host values. 0 would cause it to not use the VMCS values. + */ + vmentry_ctl &= ~VM_ENTRY_LOAD_GUEST_EFER; + vmentry_ctl &= ~VM_ENTRY_SMM; + vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR; + /* PVH 32bitfixme. */ + vmentry_ctl |= VM_ENTRY_IA32E_MODE; /* GUEST_EFER.LME/LMA ignored */ + + __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl); + + vmx_set_common_host_vmcs_fields(v); + + __vmwrite(VM_ENTRY_INTR_INFO, 0); + __vmwrite(CR3_TARGET_COUNT, 0); + __vmwrite(GUEST_ACTIVITY_STATE, 0); + + /* These are sorta irrelevant as we load the discriptors directly. */ + __vmwrite(GUEST_CS_SELECTOR, 0); + __vmwrite(GUEST_DS_SELECTOR, 0); + __vmwrite(GUEST_SS_SELECTOR, 0); + __vmwrite(GUEST_ES_SELECTOR, 0); + __vmwrite(GUEST_FS_SELECTOR, 0); + __vmwrite(GUEST_GS_SELECTOR, 0); + + __vmwrite(GUEST_CS_BASE, 0); + __vmwrite(GUEST_CS_LIMIT, ~0u); + /* CS.L == 1, exec, read/write, accessed. PVH 32bitfixme. */ + __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); + + __vmwrite(GUEST_DS_BASE, 0); + __vmwrite(GUEST_DS_LIMIT, ~0u); + __vmwrite(GUEST_DS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_SS_BASE, 0); + __vmwrite(GUEST_SS_LIMIT, ~0u); + __vmwrite(GUEST_SS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_ES_BASE, 0); + __vmwrite(GUEST_ES_LIMIT, ~0u); + __vmwrite(GUEST_ES_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_FS_BASE, 0); + __vmwrite(GUEST_FS_LIMIT, ~0u); + __vmwrite(GUEST_FS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_GS_BASE, 0); + __vmwrite(GUEST_GS_LIMIT, ~0u); + __vmwrite(GUEST_GS_AR_BYTES, 0xc093); /* read/write, accessed */ + + __vmwrite(GUEST_GDTR_BASE, 0); + __vmwrite(GUEST_GDTR_LIMIT, 0); + + __vmwrite(GUEST_LDTR_BASE, 0); + __vmwrite(GUEST_LDTR_LIMIT, 0); + __vmwrite(GUEST_LDTR_AR_BYTES, 0x82); /* LDT */ + __vmwrite(GUEST_LDTR_SELECTOR, 0); + + /* Guest TSS. */ + __vmwrite(GUEST_TR_BASE, 0); + __vmwrite(GUEST_TR_LIMIT, 0xff); + __vmwrite(GUEST_TR_AR_BYTES, 0x8b); /* 32-bit TSS (busy) */ + + __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); + __vmwrite(GUEST_DR7, 0); + __vmwrite(VMCS_LINK_POINTER, ~0UL); + + __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); + __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0); + + v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK | (1U << TRAP_debug) | + (1U << TRAP_int3) | (1U << TRAP_no_device); + __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap); + + /* Set WP bit so rdonly pages are not written from CPL 0. */ + tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP; + __vmwrite(GUEST_CR0, tmpval); + __vmwrite(CR0_READ_SHADOW, tmpval); + v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] = tmpval; + + tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features); + __vmwrite(GUEST_CR4, tmpval); + __vmwrite(CR4_READ_SHADOW, tmpval); + v->arch.hvm_vcpu.guest_cr[4] = tmpval; + + __vmwrite(CR0_GUEST_HOST_MASK, ~0UL); + __vmwrite(CR4_GUEST_HOST_MASK, ~0UL); + + v->arch.hvm_vmx.vmx_realmode = 0; + + ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); + __vmwrite(EPT_POINTER, ept_get_eptp(ept)); + + rdmsrl(MSR_IA32_CR_PAT, host_pat); + __vmwrite(HOST_PAT, host_pat); + __vmwrite(GUEST_PAT, MSR_IA32_CR_PAT_RESET); + + /* The paging mode is updated for PVH by arch_set_info_guest(). */ + + return 0; +} + static int construct_vmcs(struct vcpu *v) { struct domain *d = v->domain; @@ -864,6 +1097,13 @@ static int construct_vmcs(struct vcpu *v) vmx_vmcs_enter(v); + if ( is_pvh_vcpu(v) ) + { + int rc = pvh_construct_vmcs(v); + vmx_vmcs_exit(v); + return rc; + } + /* VMCS controls. */ __vmwrite(PIN_BASED_VM_EXEC_CONTROL, vmx_pin_based_exec_control); @@ -1281,8 +1521,11 @@ void vmx_do_resume(struct vcpu *v) vmx_clear_vmcs(v); vmx_load_vmcs(v); - hvm_migrate_timers(v); - hvm_migrate_pirqs(v); + if ( !is_pvh_vcpu(v) ) + { + hvm_migrate_timers(v); + hvm_migrate_pirqs(v); + } vmx_set_host_env(v); /* * Both n1 VMCS and n2 VMCS need to update the host environment after @@ -1294,6 +1537,9 @@ void vmx_do_resume(struct vcpu *v) hvm_asid_flush_vcpu(v); } + if ( is_pvh_vcpu(v) ) + reset_stack_and_jump(vmx_asm_do_vmentry); + debug_state = v->domain->debugger_attached || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_INT3] || v->domain->arch.hvm_domain.params[HVM_PARAM_MEMORY_EVENT_SINGLE_STEP]; @@ -1477,7 +1723,7 @@ static void vmcs_dump(unsigned char ch) for_each_domain ( d ) { - if ( !is_hvm_domain(d) ) + if ( is_pv_domain(d) ) continue; printk("\n>>> Domain %d <<<\n", d->domain_id); for_each_vcpu ( d, v ) -- 1.7.2.3
Mukesh Rathor
2013-Jun-25 00:01 UTC
[PATCH 17/18] PVH xen: HVM support of PVH guest creation/destruction
This patch implements the HVM/vmx portion of the guest create, ie vcpu and domain initilization. Some changes to support the destroy path. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/hvm.c | 67 ++++++++++++++++++++++++++++++++++++++++++- xen/arch/x86/hvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++ 2 files changed, 105 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 3c1597b..2988a5f 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -510,6 +510,30 @@ static int hvm_print_line( return X86EMUL_OKAY; } +static int pvh_dom_initialise(struct domain *d) +{ + int rc; + + if ( !d->arch.hvm_domain.hap_enabled ) + return -EINVAL; + + spin_lock_init(&d->arch.hvm_domain.irq_lock); + + hvm_init_cacheattr_region_list(d); + + if ( (rc = paging_enable(d, PG_refcounts|PG_translate|PG_external)) != 0 ) + goto pvh_dominit_fail; + + if ( (rc = hvm_funcs.domain_initialise(d)) != 0 ) + goto pvh_dominit_fail; + + return 0; + +pvh_dominit_fail: + hvm_destroy_cacheattr_region_list(d); + return rc; +} + int hvm_domain_initialise(struct domain *d) { int rc; @@ -520,6 +544,8 @@ int hvm_domain_initialise(struct domain *d) "on a non-VT/AMDV platform.\n"); return -EINVAL; } + if ( is_pvh_domain(d) ) + return pvh_dom_initialise(d); spin_lock_init(&d->arch.hvm_domain.pbuf_lock); spin_lock_init(&d->arch.hvm_domain.irq_lock); @@ -584,6 +610,9 @@ int hvm_domain_initialise(struct domain *d) void hvm_domain_relinquish_resources(struct domain *d) { + if ( is_pvh_domain(d) ) + return; + if ( hvm_funcs.nhvm_domain_relinquish_resources ) hvm_funcs.nhvm_domain_relinquish_resources(d); @@ -609,10 +638,14 @@ void hvm_domain_relinquish_resources(struct domain *d) void hvm_domain_destroy(struct domain *d) { hvm_funcs.domain_destroy(d); + hvm_destroy_cacheattr_region_list(d); + + if ( is_pvh_domain(d) ) + return; + rtc_deinit(d); stdvga_deinit(d); vioapic_deinit(d); - hvm_destroy_cacheattr_region_list(d); } static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h) @@ -1066,6 +1099,30 @@ static int __init __hvm_register_CPU_XSAVE_save_and_restore(void) } __initcall(__hvm_register_CPU_XSAVE_save_and_restore); +static int pvh_vcpu_initialise(struct vcpu *v) +{ + int rc; + + if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) + return rc; + + softirq_tasklet_init(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet, + (void(*)(unsigned long))hvm_assert_evtchn_irq, + (unsigned long)v); + + v->arch.hvm_vcpu.hcall_64bit = 1; /* PVH 32bitfixme. */ + v->arch.user_regs.eflags = 2; + v->arch.hvm_vcpu.inject_trap.vector = -1; + + if ( (rc = hvm_vcpu_cacheattr_init(v)) != 0 ) + { + hvm_funcs.vcpu_destroy(v); + return rc; + } + + return 0; +} + int hvm_vcpu_initialise(struct vcpu *v) { int rc; @@ -1077,6 +1134,9 @@ int hvm_vcpu_initialise(struct vcpu *v) spin_lock_init(&v->arch.hvm_vcpu.tm_lock); INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); + if ( is_pvh_vcpu(v) ) + return pvh_vcpu_initialise(v); + if ( (rc = vlapic_init(v)) != 0 ) goto fail1; @@ -1165,7 +1225,10 @@ void hvm_vcpu_destroy(struct vcpu *v) tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet); hvm_vcpu_cacheattr_destroy(v); - vlapic_destroy(v); + + if ( !is_pvh_vcpu(v) ) + vlapic_destroy(v); + hvm_funcs.vcpu_destroy(v); /* Event channel is already freed by evtchn_destroy(). */ diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 62cb84d..cb82523 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -82,6 +82,9 @@ static int vmx_domain_initialise(struct domain *d) { int rc; + if ( is_pvh_domain(d) ) + return 0; + if ( (rc = vmx_alloc_vlapic_mapping(d)) != 0 ) return rc; @@ -90,6 +93,9 @@ static int vmx_domain_initialise(struct domain *d) static void vmx_domain_destroy(struct domain *d) { + if ( is_pvh_domain(d) ) + return; + vmx_free_vlapic_mapping(d); } @@ -113,6 +119,12 @@ static int vmx_vcpu_initialise(struct vcpu *v) vpmu_initialise(v); + if ( is_pvh_vcpu(v) ) + { + /* This for hvm_long_mode_enabled(v). */ + v->arch.hvm_vcpu.guest_efer = EFER_SCE | EFER_LMA | EFER_LME; + return 0; + } vmx_install_vlapic_mapping(v); /* %eax == 1 signals full real-mode support to the guest loader. */ @@ -1034,6 +1046,28 @@ static void vmx_update_host_cr3(struct vcpu *v) vmx_vmcs_exit(v); } +/* + * PVH guest never causes CR3 write vmexit. This is called during the guest + * setup. + */ +static void vmx_update_pvh_cr(struct vcpu *v, unsigned int cr) +{ + vmx_vmcs_enter(v); + switch ( cr ) + { + case 3: + __vmwrite(GUEST_CR3, v->arch.hvm_vcpu.guest_cr[3]); + hvm_asid_flush_vcpu(v); + break; + + default: + printk(XENLOG_ERR + "PVH: d%d v%d unexpected cr%d update at rip:%lx\n", + v->domain->domain_id, v->vcpu_id, cr, __vmread(GUEST_RIP)); + } + vmx_vmcs_exit(v); +} + void vmx_update_debug_state(struct vcpu *v) { unsigned long mask; @@ -1053,6 +1087,12 @@ void vmx_update_debug_state(struct vcpu *v) static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr) { + if ( is_pvh_vcpu(v) ) + { + vmx_update_pvh_cr(v, cr); + return; + } + vmx_vmcs_enter(v); switch ( cr ) -- 1.7.2.3
The heart of this patch is vmx exit handler for PVH guest. It is nicely isolated in a separate module. A call to it is added to vmx_pvh_vmexit_handler(). Changes in V2: - Move non VMX generic code to arch/x86/hvm/pvh.c - Remove get_gpr_ptr() and use existing decode_register() instead. - Defer call to pvh vmx exit handler until interrupts are enabled. So the caller vmx_pvh_vmexit_handler() handles the NMI/EXT-INT/TRIPLE_FAULT now. - Fix the CPUID (wrongly) clearing bit 24. No need to do this now, set the correct feature bits in CR4 during vmcs creation. - Fix few hard tabs. Changes in V3: - Lot of cleanup and rework in PVH vm exit handler. - add parameter to emulate_forced_invalid_op(). Changes in V5: - Move pvh.c and emulate_forced_invalid_op related changes to another patch. - Formatting. - Remove vmx_pvh_read_descriptor(). - Use SS DPL instead of CS.RPL for CPL. - Remove pvh_user_cpuid() and call pv_cpuid for user mode also. Changes in V6: - Replace domain_crash_synchronous() with domain_crash(). Changes in V7: - Don''t read all selectors on every vmexit. Do that only for the IO instruction vmexit. - Add couple checks and set guest_cr[4] in access_cr4(). - Add period after all comments in case that''s an issue. - Move making pv_cpuid and emulate_privileged_op public here. Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> --- xen/arch/x86/hvm/vmx/Makefile | 1 + xen/arch/x86/hvm/vmx/vmx.c | 7 + xen/arch/x86/hvm/vmx/vmx_pvh.c | 523 +++++++++++++++++++++++++++++++++++++ xen/arch/x86/traps.c | 4 +- xen/include/asm-x86/hvm/vmx/vmx.h | 2 + xen/include/asm-x86/processor.h | 1 + xen/include/asm-x86/traps.h | 2 + 7 files changed, 538 insertions(+), 2 deletions(-) create mode 100644 xen/arch/x86/hvm/vmx/vmx_pvh.c diff --git a/xen/arch/x86/hvm/vmx/Makefile b/xen/arch/x86/hvm/vmx/Makefile index 373b3d9..8b71dae 100644 --- a/xen/arch/x86/hvm/vmx/Makefile +++ b/xen/arch/x86/hvm/vmx/Makefile @@ -5,3 +5,4 @@ obj-y += vmcs.o obj-y += vmx.o obj-y += vpmu_core2.o obj-y += vvmx.o +obj-y += vmx_pvh.o diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index cb82523..fb69219 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1595,6 +1595,7 @@ static struct hvm_function_table __initdata vmx_function_table = { .deliver_posted_intr = vmx_deliver_posted_intr, .sync_pir_to_irr = vmx_sync_pir_to_irr, .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m, + .pvh_set_vcpu_info = vmx_pvh_set_vcpu_info, }; const struct hvm_function_table * __init start_vmx(void) @@ -2447,6 +2448,12 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) ) return vmx_failed_vmentry(exit_reason, regs); + if ( is_pvh_vcpu(v) ) + { + vmx_pvh_vmexit_handler(regs); + return; + } + if ( v->arch.hvm_vmx.vmx_realmode ) { /* Put RFLAGS back the way the guest wants it */ diff --git a/xen/arch/x86/hvm/vmx/vmx_pvh.c b/xen/arch/x86/hvm/vmx/vmx_pvh.c new file mode 100644 index 0000000..6ece221 --- /dev/null +++ b/xen/arch/x86/hvm/vmx/vmx_pvh.c @@ -0,0 +1,523 @@ +/* + * Copyright (C) 2013, Mukesh Rathor, Oracle Corp. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include <xen/hypercall.h> +#include <xen/guest_access.h> +#include <asm/p2m.h> +#include <asm/traps.h> +#include <asm/hvm/vmx/vmx.h> +#include <public/sched.h> +#include <asm/hvm/nestedhvm.h> +#include <asm/xstate.h> + +#ifndef NDEBUG +int pvhdbg = 0; +#define dbgp1(...) do { (pvhdbg == 1) ? printk(__VA_ARGS__) : 0; } while ( 0 ) +#else +#define dbgp1(...) ((void)0) +#endif + + +/* NOTE: this does NOT read the CS. */ +static void read_vmcs_selectors(struct cpu_user_regs *regs) +{ + regs->ss = __vmread(GUEST_SS_SELECTOR); + regs->ds = __vmread(GUEST_DS_SELECTOR); + regs->es = __vmread(GUEST_ES_SELECTOR); + regs->gs = __vmread(GUEST_GS_SELECTOR); + regs->fs = __vmread(GUEST_FS_SELECTOR); +} + +/* Returns : 0 == msr read successfully. */ +static int vmxit_msr_read(struct cpu_user_regs *regs) +{ + u64 msr_content = 0; + + switch ( regs->ecx ) + { + case MSR_IA32_MISC_ENABLE: + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; + break; + + default: + /* PVH fixme: see hvm_msr_read_intercept(). */ + rdmsrl(regs->ecx, msr_content); + break; + } + regs->eax = (uint32_t)msr_content; + regs->edx = (uint32_t)(msr_content >> 32); + vmx_update_guest_eip(); + + dbgp1("msr read c:%lx a:%lx d:%lx RIP:%lx RSP:%lx\n", regs->ecx, regs->eax, + regs->edx, regs->rip, regs->rsp); + + return 0; +} + +/* Returns : 0 == msr written successfully. */ +static int vmxit_msr_write(struct cpu_user_regs *regs) +{ + uint64_t msr_content = (uint32_t)regs->eax | ((uint64_t)regs->edx << 32); + + dbgp1("PVH: msr write:0x%lx. eax:0x%lx edx:0x%lx\n", regs->ecx, + regs->eax, regs->edx); + + if ( hvm_msr_write_intercept(regs->ecx, msr_content) == X86EMUL_OKAY ) + { + vmx_update_guest_eip(); + return 0; + } + return 1; +} + +static int vmxit_debug(struct cpu_user_regs *regs) +{ + struct vcpu *vp = current; + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); + + write_debugreg(6, exit_qualification | 0xffff0ff0); + + /* gdbsx or another debugger. */ + if ( vp->domain->domain_id != 0 && /* never pause dom0 */ + guest_kernel_mode(vp, regs) && vp->domain->debugger_attached ) + + domain_pause_for_debugger(); + else + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE); + + return 0; +} + +/* Returns: rc == 0: handled the MTF vmexit. */ +static int vmxit_mtf(struct cpu_user_regs *regs) +{ + struct vcpu *vp = current; + int rc = -EINVAL, ss = vp->arch.hvm_vcpu.single_step; + + vp->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, vp->arch.hvm_vmx.exec_control); + vp->arch.hvm_vcpu.single_step = 0; + + if ( vp->domain->debugger_attached && ss ) + { + domain_pause_for_debugger(); + rc = 0; + } + return rc; +} + +static int vmxit_int3(struct cpu_user_regs *regs) +{ + int ilen = vmx_get_instruction_length(); + struct vcpu *vp = current; + struct hvm_trap trap_info = { + .vector = TRAP_int3, + .type = X86_EVENTTYPE_SW_EXCEPTION, + .error_code = HVM_DELIVER_NO_ERROR_CODE, + .insn_len = ilen + }; + + /* gdbsx or another debugger. Never pause dom0. */ + if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) ) + { + regs->eip += ilen; + dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id()); + current->arch.gdbsx_vcpu_event = TRAP_int3; + domain_pause_for_debugger(); + return 0; + } + hvm_inject_trap(&trap_info); + + return 0; +} + +static int vmxit_invalid_op(struct cpu_user_regs *regs) +{ + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) ) + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); + + return 0; +} + +/* Returns: rc == 0: handled the exception. */ +static int vmxit_exception(struct cpu_user_regs *regs) +{ + int vector = (__vmread(VM_EXIT_INTR_INFO)) & INTR_INFO_VECTOR_MASK; + int rc = -ENOSYS; + + dbgp1(" EXCPT: vec:%d cs:%lx r.IP:%lx\n", vector, + __vmread(GUEST_CS_SELECTOR), regs->eip); + + switch ( vector ) + { + case TRAP_debug: + rc = vmxit_debug(regs); + break; + + case TRAP_int3: + rc = vmxit_int3(regs); + break; + + case TRAP_invalid_op: + rc = vmxit_invalid_op(regs); + break; + + case TRAP_no_device: + hvm_funcs.fpu_dirty_intercept(); + rc = 0; + break; + + default: + gdprintk(XENLOG_G_WARNING, + "PVH: Unhandled trap:%d. IP:%lx\n", vector, regs->eip); + } + return rc; +} + +static int vmxit_vmcall(struct cpu_user_regs *regs) +{ + if ( hvm_do_hypercall(regs) != HVM_HCALL_preempted ) + vmx_update_guest_eip(); + return 0; +} + +/* Returns: rc == 0: success. */ +static int access_cr0(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) +{ + struct vcpu *vp = current; + + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) + { + unsigned long new_cr0 = *regp; + unsigned long old_cr0 = __vmread(GUEST_CR0); + + dbgp1("PVH:writing to CR0. RIP:%lx val:0x%lx\n", regs->rip, *regp); + if ( (u32)new_cr0 != new_cr0 ) + { + gdprintk(XENLOG_G_WARNING, + "Guest setting upper 32 bits in CR0: %lx", new_cr0); + return -EPERM; + } + + new_cr0 &= ~HVM_CR0_GUEST_RESERVED_BITS; + /* ET is reserved and should always be 1. */ + new_cr0 |= X86_CR0_ET; + + /* A pvh is not expected to change to real mode. */ + if ( (new_cr0 & (X86_CR0_PE | X86_CR0_PG)) !+ (X86_CR0_PG | X86_CR0_PE) ) + { + gdprintk(XENLOG_G_WARNING, + "PVH attempting to turn off PE/PG. CR0:%lx\n", new_cr0); + return -EPERM; + } + /* TS going from 1 to 0 */ + if ( (old_cr0 & X86_CR0_TS) && ((new_cr0 & X86_CR0_TS) == 0) ) + vmx_fpu_enter(vp); + + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = new_cr0; + __vmwrite(GUEST_CR0, new_cr0); + __vmwrite(CR0_READ_SHADOW, new_cr0); + } + else + *regp = __vmread(GUEST_CR0); + + return 0; +} + +/* Returns: rc == 0: success. */ +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t *regp) +{ + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) + { + struct vcpu *vp = current; + u64 old_val = __vmread(GUEST_CR4); + u64 new = *regp; + + if ( new & HVM_CR4_GUEST_RESERVED_BITS(vp) ) + { + gdprintk(XENLOG_G_WARNING, + "PVH guest attempts to set reserved bit in CR4: %lx", + new); + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; + } + + if ( !(new & X86_CR4_PAE) && hvm_long_mode_enabled(vp) ) + { + gdprintk(XENLOG_G_WARNING, "Guest cleared CR4.PAE while " + "EFER.LMA is set"); + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; + } + + vp->arch.hvm_vcpu.guest_cr[4] = new; + + if ( (old_val ^ new) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) ) + vpid_sync_all(); + + __vmwrite(CR4_READ_SHADOW, new); + + new &= ~X86_CR4_PAE; /* PVH always runs with hap enabled. */ + new |= X86_CR4_VMXE | X86_CR4_MCE; + __vmwrite(GUEST_CR4, new); + } + else + *regp = __vmread(CR4_READ_SHADOW); + + return 0; +} + +/* Returns: rc == 0: success, else -errno. */ +static int vmxit_cr_access(struct cpu_user_regs *regs) +{ + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); + uint acc_typ = VMX_CONTROL_REG_ACCESS_TYPE(exit_qualification); + int cr, rc = -EINVAL; + + switch ( acc_typ ) + { + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR: + case VMX_CONTROL_REG_ACCESS_TYPE_MOV_FROM_CR: + { + uint gpr = VMX_CONTROL_REG_ACCESS_GPR(exit_qualification); + uint64_t *regp = decode_register(gpr, regs, 0); + cr = VMX_CONTROL_REG_ACCESS_NUM(exit_qualification); + + if ( regp == NULL ) + break; + + switch ( cr ) + { + case 0: + rc = access_cr0(regs, acc_typ, regp); + break; + + case 3: + gdprintk(XENLOG_G_ERR, "PVH: unexpected cr3 vmexit. rip:%lx\n", + regs->rip); + domain_crash(current->domain); + break; + + case 4: + rc = access_cr4(regs, acc_typ, regp); + break; + } + if ( rc == 0 ) + vmx_update_guest_eip(); + break; + } + + case VMX_CONTROL_REG_ACCESS_TYPE_CLTS: + { + struct vcpu *vp = current; + unsigned long cr0 = vp->arch.hvm_vcpu.guest_cr[0] & ~X86_CR0_TS; + vp->arch.hvm_vcpu.hw_cr[0] = vp->arch.hvm_vcpu.guest_cr[0] = cr0; + + vmx_fpu_enter(vp); + __vmwrite(GUEST_CR0, cr0); + __vmwrite(CR0_READ_SHADOW, cr0); + vmx_update_guest_eip(); + rc = 0; + } + } + return rc; +} + +/* + * NOTE: A PVH guest sets IOPL natively by setting bits in the eflags, and not + * via hypercalls used by a PV. + */ +static int vmxit_io_instr(struct cpu_user_regs *regs) +{ + struct segment_register seg; + int requested = (regs->rflags & X86_EFLAGS_IOPL) >> 12; + int curr_lvl = (regs->rflags & X86_EFLAGS_VM) ? 3 : 0; + + if ( curr_lvl == 0 ) + { + hvm_get_segment_register(current, x86_seg_ss, &seg); + curr_lvl = seg.attr.fields.dpl; + } + read_vmcs_selectors(regs); + if ( requested >= curr_lvl && emulate_privileged_op(regs) ) + return 0; + + hvm_inject_hw_exception(TRAP_gp_fault, regs->error_code); + return 0; +} + +static int pvh_ept_handle_violation(unsigned long qualification, + paddr_t gpa, struct cpu_user_regs *regs) +{ + unsigned long gla, gfn = gpa >> PAGE_SHIFT; + p2m_type_t p2mt; + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt); + + gdprintk(XENLOG_G_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), " + "gpa %#"PRIpaddr", mfn %#lx, type %i. IP:0x%lx RSP:0x%lx\n", + qualification, + (qualification & EPT_READ_VIOLATION) ? ''r'' : ''-'', + (qualification & EPT_WRITE_VIOLATION) ? ''w'' : ''-'', + (qualification & EPT_EXEC_VIOLATION) ? ''x'' : ''-'', + (qualification & EPT_EFFECTIVE_READ) ? ''r'' : ''-'', + (qualification & EPT_EFFECTIVE_WRITE) ? ''w'' : ''-'', + (qualification & EPT_EFFECTIVE_EXEC) ? ''x'' : ''-'', + gpa, mfn_x(mfn), p2mt, regs->rip, regs->rsp); + + ept_walk_table(current->domain, gfn); + + if ( qualification & EPT_GLA_VALID ) + { + gla = __vmread(GUEST_LINEAR_ADDRESS); + gdprintk(XENLOG_G_ERR, " --- GLA %#lx\n", gla); + } + hvm_inject_hw_exception(TRAP_gp_fault, 0); + return 0; +} + +/* + * Main vm exit handler for PVH . Called from vmx_vmexit_handler(). + * Note: vmx_asm_vmexit_handler updates rip/rsp/eflags in regs{} struct. + */ +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) +{ + unsigned long exit_qualification; + unsigned int exit_reason = __vmread(VM_EXIT_REASON); + int rc=0, ccpu = smp_processor_id(); + struct vcpu *v = current; + + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx EFLAGS:%lx CR0:%lx\n", + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, + __vmread(GUEST_CR0)); + + /* For guest_kernel_mode which is called from most places below. */ + regs->cs = __vmread(GUEST_CS_SELECTOR); + + switch ( (uint16_t)exit_reason ) + { + /* NMI and machine_check are handled by the caller, we handle rest here */ + case EXIT_REASON_EXCEPTION_NMI: /* 0 */ + rc = vmxit_exception(regs); + break; + + case EXIT_REASON_EXTERNAL_INTERRUPT: /* 1 */ + break; /* handled in vmx_vmexit_handler() */ + + case EXIT_REASON_PENDING_VIRT_INTR: /* 7 */ + /* Disable the interrupt window. */ + v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING; + __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control); + break; + + case EXIT_REASON_CPUID: /* 10 */ + pv_cpuid(regs); + vmx_update_guest_eip(); + break; + + case EXIT_REASON_HLT: /* 12 */ + vmx_update_guest_eip(); + hvm_hlt(regs->eflags); + break; + + case EXIT_REASON_VMCALL: /* 18 */ + rc = vmxit_vmcall(regs); + break; + + case EXIT_REASON_CR_ACCESS: /* 28 */ + rc = vmxit_cr_access(regs); + break; + + case EXIT_REASON_DR_ACCESS: /* 29 */ + exit_qualification = __vmread(EXIT_QUALIFICATION); + vmx_dr_access(exit_qualification, regs); + break; + + case EXIT_REASON_IO_INSTRUCTION: /* 30 */ + vmxit_io_instr(regs); + break; + + case EXIT_REASON_MSR_READ: /* 31 */ + rc = vmxit_msr_read(regs); + break; + + case EXIT_REASON_MSR_WRITE: /* 32 */ + rc = vmxit_msr_write(regs); + break; + + case EXIT_REASON_MONITOR_TRAP_FLAG: /* 37 */ + rc = vmxit_mtf(regs); + break; + + case EXIT_REASON_MCE_DURING_VMENTRY: /* 41 */ + break; /* handled in vmx_vmexit_handler() */ + + case EXIT_REASON_EPT_VIOLATION: /* 48 */ + { + paddr_t gpa = __vmread(GUEST_PHYSICAL_ADDRESS); + exit_qualification = __vmread(EXIT_QUALIFICATION); + rc = pvh_ept_handle_violation(exit_qualification, gpa, regs); + break; + } + + default: + rc = 1; + gdprintk(XENLOG_G_ERR, + "PVH: Unexpected exit reason:0x%x\n", exit_reason); + } + + if ( rc ) + { + exit_qualification = __vmread(EXIT_QUALIFICATION); + gdprintk(XENLOG_G_WARNING, + "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n", + ccpu, exit_reason, exit_reason, exit_qualification, + exit_qualification, __vmread(GUEST_CR0)); + gdprintk(XENLOG_G_WARNING, "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n", + regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3)); + domain_crash(v->domain); + } +} + +/* + * Sets info for non boot SMP vcpu. VCPU 0 context is set by the library. + * In case of linux, the call comes from cpu_initialize_context(). + */ +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) +{ + if ( v->vcpu_id == 0 ) + return 0; + + vmx_vmcs_enter(v); + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); + + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); + + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) + { + vmx_vmcs_exit(v); + return -EINVAL; + } + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_kernel); + + vmx_vmcs_exit(v); + return 0; +} diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 6c74e96..0e8f1bd 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -744,7 +744,7 @@ int cpuid_hypervisor_leaves( uint32_t idx, uint32_t sub_idx, return 1; } -static void pv_cpuid(struct cpu_user_regs *regs) +void pv_cpuid(struct cpu_user_regs *regs) { uint32_t a, b, c, d; @@ -1903,7 +1903,7 @@ static int is_cpufreq_controller(struct domain *d) #include "x86_64/mmconfig.h" -static int emulate_privileged_op(struct cpu_user_regs *regs) +int emulate_privileged_op(struct cpu_user_regs *regs) { enum x86_segment which_sel; struct vcpu *v = current; diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h index ad341dc..3ed6cff 100644 --- a/xen/include/asm-x86/hvm/vmx/vmx.h +++ b/xen/include/asm-x86/hvm/vmx/vmx.h @@ -472,6 +472,8 @@ void setup_ept_dump(void); void vmx_update_guest_eip(void); void vmx_dr_access(unsigned long exit_qualification, struct cpu_user_regs *regs); +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs); +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp); int alloc_p2m_hap_data(struct p2m_domain *p2m); void free_p2m_hap_data(struct p2m_domain *p2m); diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h index 9acd9ea..14ce706 100644 --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -567,6 +567,7 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); int microcode_resume_cpu(int cpu); int emulate_forced_invalid_op(struct cpu_user_regs *regs); +void pv_cpuid(struct cpu_user_regs *regs); #endif /* !__ASSEMBLY__ */ #endif /* __ASM_X86_PROCESSOR_H */ diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h index 82cbcee..308b129 100644 --- a/xen/include/asm-x86/traps.h +++ b/xen/include/asm-x86/traps.h @@ -49,4 +49,6 @@ extern int guest_has_trap_callback(struct domain *d, uint16_t vcpuid, extern int send_guest_trap(struct domain *d, uint16_t vcpuid, unsigned int trap_nr); +int emulate_privileged_op(struct cpu_user_regs *regs); + #endif /* ASM_TRAP_H */ -- 1.7.2.3
Jan Beulich
2013-Jun-25 08:40 UTC
Re: [PATCH 01/18] PVH xen: turn gdb_frames/gdt_ents into union
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > Changes in V2: > - Add __XEN_INTERFACE_VERSION__ > > Changes in V3: > - Rename union to ''gdt'' and rename field names. > > Reviewed-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>To reduce the amount of editing when eventually applying your patches, please keep the natural ordering of tags: Review tags (as much as ack and test ones) can''t possibly come before your (the initial) sign-off (as no-one can have reviewed the patch before you posting it). Jan
Jan Beulich
2013-Jun-25 08:44 UTC
Re: [PATCH 03/18] PVH xen: Move e820 fields out of pv_domain struct
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > This patch moves fields out of the pv_domain struct as they are used by > PVH also. > > Changes in V6: > - Don''t base on guest type the initialization and cleanup. > > Changes in V7: > - If statement doesn''t need to be split across lines anymore. > > Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>Reviewed-by: Jan Beulich <jbeulich@suse.com>> --- > xen/arch/x86/domain.c | 10 ++++------ > xen/arch/x86/mm.c | 26 ++++++++++++-------------- > xen/include/asm-x86/domain.h | 10 +++++----- > 3 files changed, 21 insertions(+), 25 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index d530964..6c85c94 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -553,6 +553,7 @@ int arch_domain_create(struct domain *d, unsigned int > domcr_flags) > if ( (rc = iommu_domain_init(d)) != 0 ) > goto fail; > } > + spin_lock_init(&d->arch.e820_lock); > > if ( is_hvm_domain(d) ) > { > @@ -563,13 +564,9 @@ int arch_domain_create(struct domain *d, unsigned int > domcr_flags) > } > } > else > - { > /* 64-bit PV guest by default. */ > d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0; > > - spin_lock_init(&d->arch.pv_domain.e820_lock); > - } > - > /* initialize default tsc behavior in case tools don''t */ > tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0); > spin_lock_init(&d->arch.vtsc_lock); > @@ -592,8 +589,9 @@ void arch_domain_destroy(struct domain *d) > { > if ( is_hvm_domain(d) ) > hvm_domain_destroy(d); > - else > - xfree(d->arch.pv_domain.e820); > + > + if ( d->arch.e820 ) > + xfree(d->arch.e820); > > free_domain_pirqs(d); > if ( !is_idle_domain(d) ) > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 5123860..9f58968 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -4759,11 +4759,11 @@ long arch_memory_op(int op, > XEN_GUEST_HANDLE_PARAM(void) arg) > return -EFAULT; > } > > - spin_lock(&d->arch.pv_domain.e820_lock); > - xfree(d->arch.pv_domain.e820); > - d->arch.pv_domain.e820 = e820; > - d->arch.pv_domain.nr_e820 = fmap.map.nr_entries; > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > + xfree(d->arch.e820); > + d->arch.e820 = e820; > + d->arch.nr_e820 = fmap.map.nr_entries; > + spin_unlock(&d->arch.e820_lock); > > rcu_unlock_domain(d); > return rc; > @@ -4777,26 +4777,24 @@ long arch_memory_op(int op, > XEN_GUEST_HANDLE_PARAM(void) arg) > if ( copy_from_guest(&map, arg, 1) ) > return -EFAULT; > > - spin_lock(&d->arch.pv_domain.e820_lock); > + spin_lock(&d->arch.e820_lock); > > /* Backwards compatibility. */ > - if ( (d->arch.pv_domain.nr_e820 == 0) || > - (d->arch.pv_domain.e820 == NULL) ) > + if ( (d->arch.nr_e820 == 0) || (d->arch.e820 == NULL) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -ENOSYS; > } > > - map.nr_entries = min(map.nr_entries, d->arch.pv_domain.nr_e820); > - if ( copy_to_guest(map.buffer, d->arch.pv_domain.e820, > - map.nr_entries) || > + map.nr_entries = min(map.nr_entries, d->arch.nr_e820); > + if ( copy_to_guest(map.buffer, d->arch.e820, map.nr_entries) || > __copy_to_guest(arg, &map, 1) ) > { > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return -EFAULT; > } > > - spin_unlock(&d->arch.pv_domain.e820_lock); > + spin_unlock(&d->arch.e820_lock); > return 0; > } > > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index d79464d..c3f9f8e 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -234,11 +234,6 @@ struct pv_domain > > /* map_domain_page() mapping cache. */ > struct mapcache_domain mapcache; > - > - /* Pseudophysical e820 map (XENMEM_memory_map). */ > - spinlock_t e820_lock; > - struct e820entry *e820; > - unsigned int nr_e820; > }; > > struct arch_domain > @@ -313,6 +308,11 @@ struct arch_domain > (possibly other cases in the future */ > uint64_t vtsc_kerncount; /* for hvm, counts all vtsc */ > uint64_t vtsc_usercount; /* not used for hvm */ > + > + /* Pseudophysical e820 map (XENMEM_memory_map). */ > + spinlock_t e820_lock; > + struct e820entry *e820; > + unsigned int nr_e820; > } __cacheline_aligned; > > #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Jan Beulich
2013-Jun-25 08:48 UTC
Re: [PATCH 04/18] PVH xen: vmx related preparatory changes for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h > @@ -475,6 +475,7 @@ void vmx_vmcs_switch(struct vmcs_struct *from, struct vmcs_struct *to); > void vmx_set_eoi_exit_bitmap(struct vcpu *v, u8 vector); > void vmx_clear_eoi_exit_bitmap(struct vcpu *v, u8 vector); > int vmx_check_msr_bitmap(unsigned long *msr_bitmap, u32 msr, int access_type); > +void vmx_fpu_enter(struct vcpu *v);I think this is misplaced here, but of course that''s not a major issue. Jan
Jan Beulich
2013-Jun-25 08:51 UTC
Re: [PATCH 05/18] PVH xen: hvm/vmcs related preparatory changes for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > In this patch, some common code is factored out to create > vmx_set_common_host_vmcs_fields() to be used by PVH. Also, some changes > in hvm.c as hvm_domain.params is not set for PVH. > > Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/hvm.c | 10 ++++--- > xen/arch/x86/hvm/vmx/vmcs.c | 58 +++++++++++++++++++++++-------------------The changes to the two files don''t appear to be connected to one another in any way - please make this a HVM patch and a VMX patch, thus also identifiable easily via patch title. As patch 4 already was VMX-specific, perhaps the HVM one should come before the vendor specific ones? Jan> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 43b6d05..118e21a 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -1070,10 +1070,13 @@ int hvm_vcpu_initialise(struct vcpu *v) > { > int rc; > struct domain *d = v->domain; > - domid_t dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; > + domid_t dm_domid; > > hvm_asid_flush_vcpu(v); > > + spin_lock_init(&v->arch.hvm_vcpu.tm_lock); > + INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); > + > if ( (rc = vlapic_init(v)) != 0 ) > goto fail1; > > @@ -1084,6 +1087,8 @@ int hvm_vcpu_initialise(struct vcpu *v) > && (rc = nestedhvm_vcpu_initialise(v)) < 0 ) > goto fail3; > > + dm_domid = d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN]; > + > /* Create ioreq event channel. */ > rc = alloc_unbound_xen_event_channel(v, dm_domid, NULL); > if ( rc < 0 ) > @@ -1106,9 +1111,6 @@ int hvm_vcpu_initialise(struct vcpu *v) > get_ioreq(v)->vp_eport = v->arch.hvm_vcpu.xen_port; > spin_unlock(&d->arch.hvm_domain.ioreq.lock); > > - spin_lock_init(&v->arch.hvm_vcpu.tm_lock); > - INIT_LIST_HEAD(&v->arch.hvm_vcpu.tm_list); > - > v->arch.hvm_vcpu.inject_trap.vector = -1; > > rc = setup_compat_arg_xlat(v); > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > index ef0ee7f..43539a6 100644 > --- a/xen/arch/x86/hvm/vmx/vmcs.c > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > @@ -825,11 +825,40 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 > vmcs_encoding, u64 val) > virtual_vmcs_exit(vvmcs); > } > > -static int construct_vmcs(struct vcpu *v) > +static void vmx_set_common_host_vmcs_fields(struct vcpu *v) > { > - struct domain *d = v->domain; > uint16_t sysenter_cs; > unsigned long sysenter_eip; > + > + /* Host data selectors. */ > + __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); > + __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); > + __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); > + __vmwrite(HOST_FS_SELECTOR, 0); > + __vmwrite(HOST_GS_SELECTOR, 0); > + __vmwrite(HOST_FS_BASE, 0); > + __vmwrite(HOST_GS_BASE, 0); > + > + /* Host control registers. */ > + v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; > + __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); > + __vmwrite(HOST_CR4, > + mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); > + > + /* Host CS:RIP. */ > + __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); > + __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); > + > + /* Host SYSENTER CS:RIP. */ > + rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); > + __vmwrite(HOST_SYSENTER_CS, sysenter_cs); > + rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); > + __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); > +} > + > +static int construct_vmcs(struct vcpu *v) > +{ > + struct domain *d = v->domain; > u32 vmexit_ctl = vmx_vmexit_control; > u32 vmentry_ctl = vmx_vmentry_control; > > @@ -932,30 +961,7 @@ static int construct_vmcs(struct vcpu *v) > __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector); > } > > - /* Host data selectors. */ > - __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS); > - __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS); > - __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS); > - __vmwrite(HOST_FS_SELECTOR, 0); > - __vmwrite(HOST_GS_SELECTOR, 0); > - __vmwrite(HOST_FS_BASE, 0); > - __vmwrite(HOST_GS_BASE, 0); > - > - /* Host control registers. */ > - v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; > - __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); > - __vmwrite(HOST_CR4, > - mmu_cr4_features | (xsave_enabled(v) ? X86_CR4_OSXSAVE : 0)); > - > - /* Host CS:RIP. */ > - __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); > - __vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler); > - > - /* Host SYSENTER CS:RIP. */ > - rdmsrl(MSR_IA32_SYSENTER_CS, sysenter_cs); > - __vmwrite(HOST_SYSENTER_CS, sysenter_cs); > - rdmsrl(MSR_IA32_SYSENTER_EIP, sysenter_eip); > - __vmwrite(HOST_SYSENTER_EIP, sysenter_eip); > + vmx_set_common_host_vmcs_fields(v); > > /* MSR intercepts. */ > __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0); > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Jan Beulich
2013-Jun-25 09:01 UTC
Re: [PATCH 06/18] PVH xen: Introduce PVH guest type and some basic changes.
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -644,6 +644,13 @@ int arch_set_info_guest( > unsigned int i; > int rc = 0, compat; > > + /* This removed when all patches are checked in and PVH is done. */ > + if ( is_pvh_vcpu(v) ) > + { > + printk("PVH: You don''t have the correct xen version for PVH\n"); > + return -EINVAL; > + } > +As the patch doesn''t add code setting guest_type to is_pvh, this is pointless to be added here. The only logical thing, if at all, would be an ASSERT().> --- a/xen/include/asm-x86/desc.h > +++ b/xen/include/asm-x86/desc.h > @@ -38,7 +38,13 @@ > > #ifndef __ASSEMBLY__ > > +#ifndef NDEBUG > +/* PVH 32bitfixme : see emulate_gate_op call from do_general_protection */ > +#define GUEST_KERNEL_RPL(d) (is_pvh_domain(d) ? ({ BUG(); 0; }) : \ > + is_pv_32bit_domain(d) ? 1 : 3) > +#else > #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3) > +#endifAs it is easily doable, please without explicit check of NDEBUG. E.g. #define GUEST_KERNEL_RPL(d) ({ ASSERT(!is_pvh_domain(d)); \ is_pv_32bit_domain(d) ? 1 : 3; })> --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -238,6 +238,14 @@ struct mem_event_per_domain > struct mem_event_domain access; > }; > > +/* > + * PVH is a PV guest running in an HVM container. While is_hvm_* checks are > + * false for it, it uses many of the HVM data structs. > + */ > +enum guest_type { > + is_pv, is_pvh, is_hvmPretty odd names for enumerators - it''s more conventional for them to have a prefix identifying their enumeration type in some way.> @@ -732,8 +743,12 @@ void watchdog_domain_destroy(struct domain *d); > > #define VM_ASSIST(_d,_t) (test_bit((_t), &(_d)->vm_assist)) > > -#define is_hvm_domain(d) ((d)->is_hvm) > +#define is_pv_domain(d) ((d)->guest_type == is_pv) > +#define is_pv_vcpu(v) (is_pv_domain(v->domain))Even if the pre-existing is_hvm_vcpu() gives a bad example - please properly parenthesize macro parameters.> +#define is_hvm_domain(d) ((d)->guest_type == is_hvm) > #define is_hvm_vcpu(v) (is_hvm_domain(v->domain)) > +#define is_pvh_domain(d) ((d)->guest_type == is_pvh) > +#define is_pvh_vcpu(v) (is_pvh_domain(v->domain))Same here. Jan> #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ > cpumask_weight((v)->cpu_affinity) == 1) > #ifdef HAS_PASSTHROUGH
Jan Beulich
2013-Jun-25 09:13 UTC
Re: [PATCH 07/18] PVH xen: domain create, schedular related code changes
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > @@ -750,7 +753,10 @@ int arch_set_info_guest( > v->arch.pv_vcpu.iopl = (v->arch.user_regs.eflags >> 12) & 3; > v->arch.user_regs.eflags &= ~X86_EFLAGS_IOPL; > > - /* Ensure real hardware interrupts are enabled. */ > + /* > + * Ensure real hardware interrupts are enabled. Note: PVH may not have > + * IDT set on all vcpus so we don''t enable IF for it yet. > + */ > v->arch.user_regs.eflags |= X86_EFLAGS_IF; > > if ( !v->is_initialised )Please drop this comment change - it''s confusing, as it contradicts what the code is doing, and (just like for HVM) the code is being bypassed for PVH.> @@ -852,6 +858,7 @@ int arch_set_info_guest( > > set_bit(_VPF_in_reset, &v->pause_flags); > > +pvh_skip_pv_stuff:Labels should be indented by at least one space.> @@ -942,6 +956,13 @@ int arch_set_info_guest( > > update_cr3(v); > > + if ( is_pvh_vcpu(v) ) > + { > + /* Guest is bringing up non-boot SMP vcpu. */ > + if ( (rc=hvm_set_vcpu_info(v, c.nat)) != 0 )Coding style. Also, if this is PVH specific, the function name should start with pvh_. Jan
Jan Beulich
2013-Jun-25 09:16 UTC
Re: [PATCH 08/18] PVH xen: support invalid op emulation for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -459,6 +459,10 @@ static void instruction_done( > struct cpu_user_regs *regs, unsigned long eip, unsigned int bpmatch) > { > regs->eip = eip; > + > + if ( is_pvh_vcpu(current) ) > + return;This is lacking a tag identifying it as needing to be fixed.> --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -566,6 +566,7 @@ void microcode_set_module(unsigned int); > int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); > int microcode_resume_cpu(int cpu); > > +int emulate_forced_invalid_op(struct cpu_user_regs *regs);This would more logically belong into traps.h I think. Jan
Jan Beulich
2013-Jun-25 09:36 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > @@ -1524,6 +1528,49 @@ static int read_descriptor(unsigned int sel, > return 1; > } > > +static int read_descriptor_sel(unsigned int sel, > + enum x86_segment which_sel, > + struct vcpu *v, > + const struct cpu_user_regs *regs, > + unsigned long *base, > + unsigned long *limit, > + unsigned int *ar, > + unsigned int vm86attr) > +{ > + struct segment_register seg; > + unsigned int long_mode = 0;Pointless initializer and bogus type.> + > + if ( !is_pvh_vcpu(v) ) > + return read_descriptor(sel, v, regs, base, limit, ar, vm86attr); > + > + hvm_get_segment_register(v, x86_seg_cs, &seg); > + long_mode = seg.attr.fields.l; > + > + if ( which_sel != x86_seg_cs ) > + hvm_get_segment_register(v, which_sel, &seg); > + > + /* "ar" is returned packed as in segment_attributes_t. Fix it up. */ > + *ar = (unsigned int)seg.attr.bytes;Is the cast really needed for anything here?> + *ar = (*ar & 0xff ) | ((*ar & 0xf00) << 4); > + *ar = *ar << 8;Preferably fold this into the prior expression or use <<=.> + > + if ( long_mode ) > + { > + *limit = ~0UL; > + > + if ( which_sel < x86_seg_fs ) > + { > + *base = 0UL; > + return 1; > + } > + } > + else > + *limit = (unsigned long)seg.limit;Again - is the cast really needed for anything here?> --- a/xen/include/asm-x86/system.h > +++ b/xen/include/asm-x86/system.h > @@ -4,10 +4,20 @@ > #include <xen/lib.h> > #include <xen/bitops.h> > > -#define read_segment_register(vcpu, regs, name) \ > -({ u16 __sel; \ > - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ > - __sel; \ > +/* > + * We need vcpu because during context switch, going from PVH to PV, > + * in save_segments(), current has been updated to next, and no longer pointing > + * to the PVH. > + */This is bogus - you shouldn''t need any of the {save,load}_segment() machinery for PVH, and hence this is not a valid reason for adding a vcpu parameter here.> +#define read_segment_register(vcpu, regs, name) \ > +({ u16 __sel; \ > + struct cpu_user_regs *_regs = (regs); \_If_ these changes need to remain, please const-qualify this pointer to clarify the intention of not modifying anything.> + \ > + if ( is_pvh_vcpu(vcpu) && guest_mode(regs) ) \Need to use _regs here too.> + __sel = _regs->name; \By now (going through the series sequentially) I don''t think I''ve seen code writing these fields, so reading them can''t logically be done at this point. Or did I overlook anything? If reordering this would cause a lot of grief, I''d be tolerant of this provided a note gets added to the commit message. Jan
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > PVH doesn''t use map cache. show_registers() for PVH takes the HVM path. > > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>Reviewed-by: Jan Beulich <jbeulich@suse.com>> diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c > index 9297ea0..5092fdb 100644 > --- a/xen/arch/x86/domain_page.c > +++ b/xen/arch/x86/domain_page.c > @@ -34,7 +34,7 @@ static inline struct vcpu *mapcache_current_vcpu(void) > * then it means we are running on the idle domain''s page table and > must > * therefore use its mapcache. > */ > - if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) ) > + if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) ) > { > /* If we really are idling, perform lazy context switch now. */ > if ( (v = idle_vcpu[smp_processor_id()]) == current ) > @@ -71,7 +71,7 @@ void *map_domain_page(unsigned long mfn) > #endif > > v = mapcache_current_vcpu(); > - if ( !v || is_hvm_vcpu(v) ) > + if ( !v || !is_pv_vcpu(v) ) > return mfn_to_virt(mfn); > > dcache = &v->domain->arch.pv_domain.mapcache; > @@ -176,7 +176,7 @@ void unmap_domain_page(const void *ptr) > ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); > > v = mapcache_current_vcpu(); > - ASSERT(v && !is_hvm_vcpu(v)); > + ASSERT(v && is_pv_vcpu(v)); > > dcache = &v->domain->arch.pv_domain.mapcache; > ASSERT(dcache->inuse); > @@ -243,7 +243,7 @@ int mapcache_domain_init(struct domain *d) > struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; > unsigned int bitmap_pages; > > - if ( is_hvm_domain(d) || is_idle_domain(d) ) > + if ( !is_pv_domain(d) || is_idle_domain(d) ) > return 0; > > #ifdef NDEBUG > @@ -274,7 +274,7 @@ int mapcache_vcpu_init(struct vcpu *v) > unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES; > unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long)); > > - if ( is_hvm_vcpu(v) || !dcache->inuse ) > + if ( !is_pv_vcpu(v) || !dcache->inuse ) > return 0; > > if ( ents > dcache->entries ) > diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c > index d2f7209..bcfd740 100644 > --- a/xen/arch/x86/x86_64/traps.c > +++ b/xen/arch/x86/x86_64/traps.c > @@ -85,7 +85,7 @@ void show_registers(struct cpu_user_regs *regs) > enum context context; > struct vcpu *v = current; > > - if ( is_hvm_vcpu(v) && guest_mode(regs) ) > + if ( !is_pv_vcpu(v) && guest_mode(regs) ) > { > struct segment_register sreg; > context = CTXT_hvm_guest; > @@ -146,8 +146,8 @@ void vcpu_show_registers(const struct vcpu *v) > const struct cpu_user_regs *regs = &v->arch.user_regs; > unsigned long crs[8]; > > - /* No need to handle HVM for now. */ > - if ( is_hvm_vcpu(v) ) > + /* No need to handle HVM and PVH for now. */ > + if ( !is_pv_vcpu(v) ) > return; > > crs[0] = v->arch.pv_vcpu.ctrlreg[0]; > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Jan Beulich
2013-Jun-25 09:54 UTC
Re: [PATCH 14/18] PVH xen: Checks, asserts, and limitations for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > --- a/xen/arch/x86/hvm/mtrr.c > +++ b/xen/arch/x86/hvm/mtrr.c > @@ -578,6 +578,9 @@ int32_t hvm_set_mem_pinned_cacheattr( > { > struct hvm_mem_pinned_cacheattr_range *range; > > + /* A PVH guest writes to MSR_IA32_CR_PAT natively. */ > + ASSERT(!is_pvh_domain(d));This can''t be an assert, or did I overlook you preventing the function to be called for PVH guests. The comment would then be wrong too, as there is a path leading here from a domctl (i.e. unaffected by how the guest itself would access the MSR).> --- a/xen/arch/x86/physdev.c > +++ b/xen/arch/x86/physdev.c > @@ -475,6 +475,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > > case PHYSDEVOP_set_iopl: { > struct physdev_set_iopl set_iopl; > + > + if ( is_pvh_vcpu(current) ) > + { > + ret = -EINVAL; > + break; > + } > + > ret = -EFAULT; > if ( copy_from_guest(&set_iopl, arg, 1) != 0 ) > break; > @@ -488,6 +495,12 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > > case PHYSDEVOP_set_iobitmap: { > struct physdev_set_iobitmap set_iobitmap; > + > + if ( is_pvh_vcpu(current) ) > + { > + ret = -EINVAL; > + break; > + } > ret = -EFAULT; > if ( copy_from_guest(&set_iobitmap, arg, 1) != 0 ) > break;I would really like these two to have better distinguishable error codes (e.g. -EPERM).> @@ -3325,6 +3327,9 @@ void do_device_not_available(struct cpu_user_regs *regs) > > BUG_ON(!guest_mode(regs)); > > + /* PVH should not get here. (ctrlreg is not implemented). */ > + ASSERT(!is_pvh_vcpu(curr));I think this is right, but the comment is confusing/misleading.> --- a/xen/arch/x86/x86_64/traps.c > +++ b/xen/arch/x86/x86_64/traps.c > @@ -440,6 +440,8 @@ static long register_guest_callback(struct callback_register *reg) > long ret = 0; > struct vcpu *v = current; > > + ASSERT(!is_pvh_vcpu(v)); > +For one, I don''t think there has been anything so far making clear that this is unreachable for PVH. And then it is inconsistent to do this here, but not also in unregister_guest_callback(). Jan
Jan Beulich
2013-Jun-25 10:12 UTC
Re: [PATCH 15/18] PVH xen: add hypercall support for PVH
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > int hvm_do_hypercall(struct cpu_user_regs *regs) > { > struct vcpu *curr = current; > struct segment_register sreg; > int mode = hvm_guest_x86_mode(curr); > uint32_t eax = regs->eax; > + hvm_hypercall_t **hcall_table;If you properly cont-qualified this, ...> @@ -3348,17 +3379,24 @@ int hvm_do_hypercall(struct cpu_user_regs *regs) > eax, regs->rdi, regs->rsi, regs->rdx, > regs->r10, regs->r8, regs->r9); > > + if ( is_pvh_vcpu(curr) ) > + hcall_table = (hvm_hypercall_t **)pvh_hypercall64_table; > + else > + hcall_table = (hvm_hypercall_t **)hvm_hypercall64_table;... you wouldn''t need these dangerous casts.> @@ -3777,7 +3815,7 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) > return -ESRCH; > > rc = -EINVAL; > - if ( !is_hvm_domain(d) ) > + if ( is_pv_domain(d) ) > goto param_fail; > > rc = xsm_hvm_param(XSM_TARGET, d, op); > @@ -3949,7 +3987,7 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) > break; > } > > - if ( rc == 0 ) > + if ( rc == 0 && !is_pvh_domain(d) ) > { > d->arch.hvm_domain.params[a.index] = a.value; >This last check I think you do because params[] points nowhere for PVH guests. If so - why don''t you just drop this and the earlier hunk? Or otherwise some of the case statements between need to also guard against accessing the unset pointer. Jan
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > @@ -1281,8 +1521,11 @@ void vmx_do_resume(struct vcpu *v) > > vmx_clear_vmcs(v); > vmx_load_vmcs(v); > - hvm_migrate_timers(v); > - hvm_migrate_pirqs(v); > + if ( !is_pvh_vcpu(v) ) > + { > + hvm_migrate_timers(v); > + hvm_migrate_pirqs(v); > + }This change is not covered by the patch description, and it is (at least to me) all but obvious why it is being done. Jan
George Dunlap
2013-Jun-25 10:17 UTC
Re: [PATCH 00/18][V7]: PVH xen: Phase I, Version 7 patches...
On Tue, Jun 25, 2013 at 1:01 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> Hi all, > > This is V7 of PVH patches for xen. These are xen changes to support > boot of a 64bit PVH domU guest. Built on top of unstable git c/s: > a12d15d8c1d512a4ed6498b39f9058f69a1c1f6cYou don''t happen to have a public git tree somewhere around, do you? -George> > New in V7: > - Dropped all dom0 patches from V6. > - Dropped tool changes. They will need to be broken down into multiple > patches as asked by Ian C, and I''ll submit them separately. > - Reorg of patches to make them smaller and more centered around a logical > change, instead of focussing more around changes in a file. > > Coming in future after this is done, two patchsets: > - 1) tools changes and 2) dom0 changes. > > Phase I: > - Establish a baseline of something working. Note, HAP is required for PVH. > > Repeating from before: > > As a result of V3, there were two new action items on the linux side before > it will boot as PVH: 1)MSI-X fixup and 2)load KERNEL_CS righ after gdt switch. > > As a result of V5 a new fixme: > - MMIO ranges above the highest covered e820 address must be mapped for dom0. > > Following fixme''s exist in the code: > - Add support for more memory types in arch/x86/hvm/mtrr.c. > - arch/x86/time.c: support more tsc modes. > - check_guest_io_breakpoint(): check/add support for IO breakpoint. > - implement arch_get_info_guest() for pvh. > - vmxit_msr_read(): during AMD port go thru hvm_msr_read_intercept() again. > - verify bp matching on emulated instructions will work same as HVM for > PVH guest. see instruction_done() and check_guest_io_breakpoint(). > > Following remain to be done for PVH: > - AMD port. > - Avail PVH dom0 of posted interrupts. (This will be a big win). > - 32bit support in both linux and xen. Xen changes are tagged "32bitfixme". > - Add support for monitoring guest behavior. See hvm_memory_event* functions > in hvm.c > - Change xl to support other modes other than "phy:". > - Hotplug support > - Migration of PVH guests. > > Thanks for all the help, > Mukesh > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > --- /dev/null > +++ b/xen/arch/x86/hvm/vmx/vmx_pvh.cJust pvh.c please - the directory we''re in already tells us that this is VMX.> @@ -0,0 +1,523 @@ > +/* > + * Copyright (C) 2013, Mukesh Rathor, Oracle Corp. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License v2 as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + */ > + > +#include <xen/hypercall.h> > +#include <xen/guest_access.h> > +#include <asm/p2m.h> > +#include <asm/traps.h> > +#include <asm/hvm/vmx/vmx.h> > +#include <public/sched.h> > +#include <asm/hvm/nestedhvm.h> > +#include <asm/xstate.h> > + > +#ifndef NDEBUG > +int pvhdbg = 0;This apparently not being declared in a header suggests that it should be static.> +#define dbgp1(...) do { (pvhdbg == 1) ? printk(__VA_ARGS__) : 0; } while ( 0 ) > +#else > +#define dbgp1(...) ((void)0) > +#endif > + > + > +/* NOTE: this does NOT read the CS. */Should probably also say why.> +static void read_vmcs_selectors(struct cpu_user_regs *regs) > +{ > + regs->ss = __vmread(GUEST_SS_SELECTOR); > + regs->ds = __vmread(GUEST_DS_SELECTOR); > + regs->es = __vmread(GUEST_ES_SELECTOR); > + regs->gs = __vmread(GUEST_GS_SELECTOR); > + regs->fs = __vmread(GUEST_FS_SELECTOR); > +}By only conditionally reading the selector registers, how do you guarantee that read_segment_register() would always read valid values? I think that macro needs to not look at "regs->?s" at all...> +static int vmxit_debug(struct cpu_user_regs *regs) > +{ > + struct vcpu *vp = current; > + unsigned long exit_qualification = __vmread(EXIT_QUALIFICATION); > + > + write_debugreg(6, exit_qualification | 0xffff0ff0); > + > + /* gdbsx or another debugger. */ > + if ( vp->domain->domain_id != 0 && /* never pause dom0 */ > + guest_kernel_mode(vp, regs) && vp->domain->debugger_attached ) > +Bogus double blank, and bogus blank line.> + domain_pause_for_debugger(); > + else > + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE); > + > + return 0; > +}[...]> +static int vmxit_invalid_op(struct cpu_user_regs *regs) > +{ > + if ( guest_kernel_mode(current, regs) || !emulate_forced_invalid_op(regs) )Did you perhaps mean !guest_kernel_mode()?> + default: > + gdprintk(XENLOG_G_WARNING, > + "PVH: Unhandled trap:%d. IP:%lx\n", vector, regs->eip);gdprintk() shouldn''t be used with XENLOG_G_*.> +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) > +{ > + unsigned long exit_qualification; > + unsigned int exit_reason = __vmread(VM_EXIT_REASON); > + int rc=0, ccpu = smp_processor_id(); > + struct vcpu *v = current; > + > + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx EFLAGS:%lx CR0:%lx\n", > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, > + __vmread(GUEST_CR0)); > + > + /* For guest_kernel_mode which is called from most places below. */ > + regs->cs = __vmread(GUEST_CS_SELECTOR);Which raises the question of whether your uses of guest_kernel_mode() are appropriate in the first place: Before this series there''s no use at all under xen/arch/x86/hvm/. And if it is, I''d like to point out once again that this check should be looking at SS.DPL, not CS.RPL.> + if ( rc ) > + { > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + gdprintk(XENLOG_G_WARNING, > + "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n",Please use %#x and alike in favor of 0x%x etc.> +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) > +{ > + if ( v->vcpu_id == 0 ) > + return 0; > + > + vmx_vmcs_enter(v); > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > + > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs);How does this work without also writing the "hidden" register fields?> + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) > + { > + vmx_vmcs_exit(v); > + return -EINVAL; > + } > + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_kernel);So you write both GS bases, but not the base of FS (and above its selector is being skipped too)? And there are other parts of struct vcpu_guest_context that I don''t see getting mirrored - are all of them getting handled elsewhere? Jan
Konrad Rzeszutek Wilk
2013-Jun-25 14:29 UTC
Re: [PATCH 10/18] PVH xen: interrupt/event-channel delivery to PVH
On Mon, Jun 24, 2013 at 05:01:39PM -0700, Mukesh Rathor wrote:> PVH uses HVMIRQ_callback_vector for interrupt delivery. Also, change > hvm_vcpu_has_pending_irq() as PVH doesn''t use vlapic emulation.Please explain why it can''t use the normal "if .." in the hvm_vcpu_has_pending_irq. I figured it is b/c the guest boots in an HVM container so the event mechanism is offline until it gets enabled. And that means no HVM type interrupts (so emulated timer interrupts say) should interrupt it until the event mechanism (or rather the callback vector) is in place. But that is conjecture on my part and I would appreciate you putting that in the git commit. The reasoning is if somebody decides one day to take the knife to hvm_vcpu_has_pending_irq() they will know what to expect and what to test for. Thank you.> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/irq.c | 3 +++ > xen/arch/x86/hvm/vmx/intr.c | 8 ++++++-- > xen/include/asm-x86/domain.h | 2 +- > xen/include/asm-x86/event.h | 2 +- > 4 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c > index 9eae5de..92fb245 100644 > --- a/xen/arch/x86/hvm/irq.c > +++ b/xen/arch/x86/hvm/irq.c > @@ -405,6 +405,9 @@ struct hvm_intack hvm_vcpu_has_pending_irq(struct vcpu *v) > && vcpu_info(v, evtchn_upcall_pending) ) > return hvm_intack_vector(plat->irq.callback_via.vector); > > + if ( is_pvh_vcpu(v) ) > + return hvm_intack_none; > + > if ( vlapic_accept_pic_intr(v) && plat->vpic[0].int_output ) > return hvm_intack_pic(0); > > diff --git a/xen/arch/x86/hvm/vmx/intr.c b/xen/arch/x86/hvm/vmx/intr.c > index e376f3c..ce42950 100644 > --- a/xen/arch/x86/hvm/vmx/intr.c > +++ b/xen/arch/x86/hvm/vmx/intr.c > @@ -165,6 +165,9 @@ static int nvmx_intr_intercept(struct vcpu *v, struct hvm_intack intack) > { > u32 ctrl; > > + if ( is_pvh_vcpu(v) ) > + return 0; > + > if ( nvmx_intr_blocked(v) != hvm_intblk_none ) > { > enable_intr_window(v, intack); > @@ -219,8 +222,9 @@ void vmx_intr_assist(void) > return; > } > > - /* Crank the handle on interrupt state. */ > - pt_vector = pt_update_irq(v); > + if ( !is_pvh_vcpu(v) ) > + /* Crank the handle on interrupt state. */ > + pt_vector = pt_update_irq(v); > > do { > intack = hvm_vcpu_has_pending_irq(v); > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index c3f9f8e..b95314a 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -16,7 +16,7 @@ > #define is_pv_32on64_domain(d) (is_pv_32bit_domain(d)) > #define is_pv_32on64_vcpu(v) (is_pv_32on64_domain((v)->domain)) > > -#define is_hvm_pv_evtchn_domain(d) (is_hvm_domain(d) && \ > +#define is_hvm_pv_evtchn_domain(d) (!is_pv_domain(d) && \ > d->arch.hvm_domain.irq.callback_via_type == HVMIRQ_callback_vector) > #define is_hvm_pv_evtchn_vcpu(v) (is_hvm_pv_evtchn_domain(v->domain)) > > diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h > index 06057c7..7ed5812 100644 > --- a/xen/include/asm-x86/event.h > +++ b/xen/include/asm-x86/event.h > @@ -18,7 +18,7 @@ int hvm_local_events_need_delivery(struct vcpu *v); > static inline int local_events_need_delivery(void) > { > struct vcpu *v = current; > - return (is_hvm_vcpu(v) ? hvm_local_events_need_delivery(v) : > + return (!is_pv_vcpu(v) ? hvm_local_events_need_delivery(v) : > (vcpu_info(v, evtchn_upcall_pending) && > !vcpu_info(v, evtchn_upcall_mask))); > } > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Konrad Rzeszutek Wilk
2013-Jun-25 14:30 UTC
Re: [PATCH 13/18] PVH xen: mtrr, tsc, grant changes...
On Mon, Jun 24, 2013 at 05:01:42PM -0700, Mukesh Rathor wrote:> PVH only supports limited memory types in Phase I. TSC is limited to native > mode only also for the moment. Finally, grant mapping of iomem for PVH hasn''t > been explorted in phase I.explored.> > Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> > --- > xen/arch/x86/hvm/mtrr.c | 8 ++++++++ > xen/arch/x86/time.c | 8 ++++++++ > xen/common/grant_table.c | 4 ++-- > 3 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c > index ef51a8d..b9d6411 100644 > --- a/xen/arch/x86/hvm/mtrr.c > +++ b/xen/arch/x86/hvm/mtrr.c > @@ -693,6 +693,14 @@ uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn, > ((d->vcpu == NULL) || ((v = d->vcpu[0]) == NULL)) ) > return MTRR_TYPE_WRBACK; > > + /* PVH fixme: Add support for more memory types. */ > + if ( is_pvh_domain(d) ) > + { > + if ( direct_mmio ) > + return MTRR_TYPE_UNCACHABLE; > + return MTRR_TYPE_WRBACK; > + } > + > if ( !v->domain->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] ) > return MTRR_TYPE_WRBACK; > > diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c > index 86640f5..5b1b6bb 100644 > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -1893,6 +1893,14 @@ void tsc_set_info(struct domain *d, > d->arch.vtsc = 0; > return; > } > + if ( is_pvh_domain(d) && tsc_mode != TSC_MODE_NEVER_EMULATE ) > + { > + /* PVH fixme: support more tsc modes. */ > + printk(XENLOG_WARNING > + "PVH currently does not support tsc emulation. Setting timer_mode = native\n"); > + d->arch.vtsc = 0; > + return; > + } > > switch ( d->arch.tsc_mode = tsc_mode ) > { > diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c > index 3f97328..a2073d2 100644 > --- a/xen/common/grant_table.c > +++ b/xen/common/grant_table.c > @@ -721,7 +721,7 @@ __gnttab_map_grant_ref( > > double_gt_lock(lgt, rgt); > > - if ( !is_hvm_domain(ld) && need_iommu(ld) ) > + if ( is_pv_domain(ld) && need_iommu(ld) ) > { > unsigned int wrc, rdc; > int err = 0; > @@ -932,7 +932,7 @@ __gnttab_unmap_common( > act->pin -= GNTPIN_hstw_inc; > } > > - if ( !is_hvm_domain(ld) && need_iommu(ld) ) > + if ( is_pv_domain(ld) && need_iommu(ld) ) > { > unsigned int wrc, rdc; > int err = 0; > -- > 1.7.2.3 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Mukesh Rathor
2013-Jun-26 00:04 UTC
Re: [PATCH 00/18][V7]: PVH xen: Phase I, Version 7 patches...
On Tue, 25 Jun 2013 11:17:21 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Tue, Jun 25, 2013 at 1:01 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > Hi all, > > > > This is V7 of PVH patches for xen. These are xen changes to support > > boot of a 64bit PVH domU guest. Built on top of unstable git c/s: > > a12d15d8c1d512a4ed6498b39f9058f69a1c1f6c > > You don''t happen to have a public git tree somewhere around, do you? >Working on it, hopefully very soon. -Mukesh
Mukesh Rathor
2013-Jun-26 01:14 UTC
Re: [PATCH 06/18] PVH xen: Introduce PVH guest type and some basic changes.
On Tue, 25 Jun 2013 10:01:23 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > --- a/xen/arch/x86/domain.c > > +++ b/xen/arch/x86/domain.c > > @@ -644,6 +644,13 @@ int arch_set_info_guest( > > unsigned int i; > > int rc = 0, compat; > > > > + /* This removed when all patches are checked in and PVH is > > done. */ > > + if ( is_pvh_vcpu(v) ) > > + { > > + printk("PVH: You don''t have the correct xen version for > > PVH\n"); > > + return -EINVAL; > > + } > > + > > As the patch doesn''t add code setting guest_type to is_pvh, this is > pointless to be added here. The only logical thing, if at all, would > be an ASSERT().Actually, now that dom0 and tools are out, one can''t create a PVH guest anyways. My intention was to disallow creation till we reach a satisfactory point in patches. So, I can just move this to toolset/dom0 whichever patchset is next.> > --- a/xen/include/asm-x86/desc.h > > +++ b/xen/include/asm-x86/desc.h > > @@ -38,7 +38,13 @@ > > > > #ifndef __ASSEMBLY__ > > > > +#ifndef NDEBUG > > +/* PVH 32bitfixme : see emulate_gate_op call from > > do_general_protection */ +#define GUEST_KERNEL_RPL(d) > > (is_pvh_domain(d) ? ({ BUG(); 0; }) : \ > > + > > is_pv_32bit_domain(d) ? 1 : 3) +#else > > #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3) > > +#endif > > As it is easily doable, please without explicit check of NDEBUG. E.g. > > #define GUEST_KERNEL_RPL(d) ({ ASSERT(!is_pvh_domain(d)); \ > is_pv_32bit_domain(d) ? > 1 : 3; })OK, thanks.> > --- a/xen/include/xen/sched.h > > +++ b/xen/include/xen/sched.h > > @@ -238,6 +238,14 @@ struct mem_event_per_domain > > struct mem_event_domain access; > > }; > > > > +/* > > + * PVH is a PV guest running in an HVM container. While is_hvm_* > > checks are > > + * false for it, it uses many of the HVM data structs. > > + */ > > +enum guest_type { > > + is_pv, is_pvh, is_hvm > > Pretty odd names for enumerators - it''s more conventional for them > to have a prefix identifying their enumeration type in some way.Ok, which is better: guest_is_pv, guest_is_pvh, guest_is_hvm or guest_type_is_pv, guest_type_is_pvh, guest_type_is_hvm or dom_is_pv, dom_is_pvh, dom_is_hvm (change enum to domain_type) thanks, Mukesh
Jan Beulich
2013-Jun-26 08:18 UTC
Re: [PATCH 06/18] PVH xen: Introduce PVH guest type and some basic changes.
>>> On 26.06.13 at 03:14, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 10:01:23 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> > --- a/xen/include/xen/sched.h >> > +++ b/xen/include/xen/sched.h >> > @@ -238,6 +238,14 @@ struct mem_event_per_domain >> > struct mem_event_domain access; >> > }; >> > >> > +/* >> > + * PVH is a PV guest running in an HVM container. While is_hvm_* >> > checks are >> > + * false for it, it uses many of the HVM data structs. >> > + */ >> > +enum guest_type { >> > + is_pv, is_pvh, is_hvm >> >> Pretty odd names for enumerators - it''s more conventional for them >> to have a prefix identifying their enumeration type in some way. > > Ok, which is better: > > guest_is_pv, guest_is_pvh, guest_is_hvm > or > guest_type_is_pv, guest_type_is_pvh, guest_type_is_hvm > or > dom_is_pv, dom_is_pvh, dom_is_hvm (change enum to domain_type)guest_type_pv, guest_type_pvh, guest_type_hvm or dom_pv, dom_pvh, dom_hvm (change enum to domain_type) Are both fine with me - I mainly dislike the "is" in the names, as to me that''s only a sensible prefix or infix for a function or macro testing a property. Jan
Mukesh Rathor
2013-Jun-26 22:41 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
On Tue, 25 Jun 2013 10:36:41 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > @@ -1524,6 +1528,49 @@ static int read_descriptor(unsigned int sel, > > --- a/xen/include/asm-x86/system.h > > +++ b/xen/include/asm-x86/system.h > > @@ -4,10 +4,20 @@ > > #include <xen/lib.h> > > #include <xen/bitops.h> > > > > -#define read_segment_register(vcpu, regs, name) \ > > -({ u16 __sel; \ > > - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ > > - __sel; \ > > +/* > > + * We need vcpu because during context switch, going from PVH to > > PV, > > + * in save_segments(), current has been updated to next, and no > > longer pointing > > + * to the PVH. > > + */ > > This is bogus - you shouldn''t need any of the {save,load}_segment() > machinery for PVH, and hence this is not a valid reason for adding a > vcpu parameter here.Ok, lets revisit this again since it''s been few months already: read_segment_register() is called from few places for PVH, and for PVH it needs to read the value from regs. So it needs to be modified to check for PVH. Originally, I had started with checking for is_pvh_vcpu(current), but that failed quickly because of the context switch call chain: __context_switch -> ctxt_switch_from --> save_segments -> read_segment_register In this path, going from PV to PVH, the intention is to save segments for PV, and since current has already been updated to point to PVH, the check for current is not correct. Hence, the need for vcpu parameter. I will enhance my comments in the macro prolog in the next patch version. Hope that resolves it. thanks, Mukesh
Mukesh Rathor
2013-Jun-27 02:43 UTC
Re: [PATCH 14/18] PVH xen: Checks, asserts, and limitations for PVH
On Tue, 25 Jun 2013 10:54:15 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > --- a/xen/arch/x86/hvm/mtrr.c > > +++ b/xen/arch/x86/hvm/mtrr.c > > @@ -578,6 +578,9 @@ int32_t hvm_set_mem_pinned_cacheattr( > > { > > struct hvm_mem_pinned_cacheattr_range *range; > > > > + /* A PVH guest writes to MSR_IA32_CR_PAT natively. */ > > + ASSERT(!is_pvh_domain(d)); > > This can''t be an assert, or did I overlook you preventing the > function to be called for PVH guests. > > The comment would then be wrong too, as there is a path > leading here from a domctl (i.e. unaffected by how the guest > itself would access the MSR).Well, there are no callers right now, and I wanted to catch any during my test runs. But, now I think the ASSERT should be replaced with returning -ENOSYS. Let me know if you disagree.> > > --- a/xen/arch/x86/x86_64/traps.c > > +++ b/xen/arch/x86/x86_64/traps.c > > @@ -440,6 +440,8 @@ static long register_guest_callback(struct > > callback_register *reg) long ret = 0; > > struct vcpu *v = current; > > > > + ASSERT(!is_pvh_vcpu(v)); > > + > > For one, I don''t think there has been anything so far making > clear that this is unreachable for PVH.hvm_do_hypercall() returns -ENOSYS for both callers of register_guest_callback so this is unreachable for PVH. I can even remove the ASSERT if you''d like.> And then it is inconsistent to do this here, but not also in > unregister_guest_callback().I can add one here too, or remove the one from register. thanks Mukesh
Mukesh Rathor
2013-Jun-27 03:09 UTC
Re: [PATCH 15/18] PVH xen: add hypercall support for PVH
On Tue, 25 Jun 2013 11:12:25 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> > @@ -3777,7 +3815,7 @@ long do_hvm_op(unsigned long op, > > XEN_GUEST_HANDLE_PARAM(void) arg) return -ESRCH; > > > > rc = -EINVAL; > > - if ( !is_hvm_domain(d) ) > > + if ( is_pv_domain(d) ) > > goto param_fail; > > > > rc = xsm_hvm_param(XSM_TARGET, d, op); > > @@ -3949,7 +3987,7 @@ long do_hvm_op(unsigned long op, > > XEN_GUEST_HANDLE_PARAM(void) arg) break; > > } > > > > - if ( rc == 0 ) > > + if ( rc == 0 && !is_pvh_domain(d) ) > > { > > d->arch.hvm_domain.params[a.index] = a.value; > > > > This last check I think you do because params[] points nowhere forCorrect.> PVH guests. If so - why don''t you just drop this and the earlier > hunk? Or otherwise some of the case statements between need toI don''t understand, drop from where? You mean a totally separate function for PVH (I had that in very earlier patches).> also guard against accessing the unset pointer.Correct, I''d need to do that. Originally, I had a white list of case operations prohibited for PVH, but removed it. -Mukesh
On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > --- /dev/null > > +++ b/xen/arch/x86/hvm/vmx/vmx_pvh.c > > > Should probably also say why. > > > +static void read_vmcs_selectors(struct cpu_user_regs *regs) > > +{ > > + regs->ss = __vmread(GUEST_SS_SELECTOR); > > + regs->ds = __vmread(GUEST_DS_SELECTOR); > > + regs->es = __vmread(GUEST_ES_SELECTOR); > > + regs->gs = __vmread(GUEST_GS_SELECTOR); > > + regs->fs = __vmread(GUEST_FS_SELECTOR); > > +} > > By only conditionally reading the selector registers, how do you > guarantee that read_segment_register() would always read > valid values? I think that macro needs to not look at "regs->?s" > at all...read_segment_register() gets called for PVH only for EXIT_REASON_IO_INSTRUCTION intercept. In this path, we call read all selectors before calling emulate_privileged_op. If someone changes code, they''d have to make sure of that. I can add more comments there, or go back to always read all selectors upon vmexit, but you already made me change that.> > +static int vmxit_invalid_op(struct cpu_user_regs *regs) > > +{ > > + if ( guest_kernel_mode(current, regs) > > || !emulate_forced_invalid_op(regs) ) > > Did you perhaps mean !guest_kernel_mode()?No, pvh kernel has been changed to just do cpuid natively. Hopefully, over time, looong time, emulate_forced_invalid_op can just be removed.> > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, > > + __vmread(GUEST_CR0)); > > + > > + /* For guest_kernel_mode which is called from most places > > below. */ > > + regs->cs = __vmread(GUEST_CS_SELECTOR); > > Which raises the question of whether your uses of > guest_kernel_mode() are appropriate in the first place: Before this > series there''s no use at all under xen/arch/x86/hvm/.HVM should do this for debug intercepts, otherwise it is wrongly intercepting user level debuggers like gdb. HVM can also use this check for emulating forced invalid op for only user levels. Since there''s a cpuid intercept, and we are trying to reduce pv-ops, this seems plausible. thanks, -Mukesh
Jan Beulich
2013-Jun-27 07:22 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
>>> On 27.06.13 at 00:41, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 10:36:41 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > @@ -1524,6 +1528,49 @@ static int read_descriptor(unsigned int sel, >> > --- a/xen/include/asm-x86/system.h >> > +++ b/xen/include/asm-x86/system.h >> > @@ -4,10 +4,20 @@ >> > #include <xen/lib.h> >> > #include <xen/bitops.h> >> > >> > -#define read_segment_register(vcpu, regs, name) \ >> > -({ u16 __sel; \ >> > - asm volatile ( "movw %%" STR(name) ",%0" : "=r" (__sel) ); \ >> > - __sel; \ >> > +/* >> > + * We need vcpu because during context switch, going from PVH to >> > PV, >> > + * in save_segments(), current has been updated to next, and no >> > longer pointing >> > + * to the PVH. >> > + */ >> >> This is bogus - you shouldn''t need any of the {save,load}_segment() >> machinery for PVH, and hence this is not a valid reason for adding a >> vcpu parameter here. > > Ok, lets revisit this again since it''s been few months already: > > read_segment_register() is called from few places for PVH, and for PVH > it needs to read the value from regs. So it needs to be modified to check > for PVH. Originally, I had started with checking for is_pvh_vcpu(current), > but that failed quickly because of the context switch call chain: > > __context_switch -> ctxt_switch_from --> save_segments -> read_segment_register > > In this path, going from PV to PVH, the intention is to save segments for > PV, and since current has already been updated to point to PVH, the check > for current is not correct. Hence, the need for vcpu parameter. I will > enhance my comments in the macro prolog in the next patch version.No. I already said that {save,load}_segments() ought to be skipped for PVH, as what it does is already done by VMRESUME/ #VMEXIT. And the function is being passed a vCPU pointer, so simply making the single call to save_segments() conditional on is_pv_vcpu(), and converting the !is_hvm_vcpu() around the call to load_LDT() and load_segments() to is_pv_vcpu() (provided the LDT handling isn''t needed on the same basis) should eliminate that need. Furthermore - the reading from struct cpu_user_regs continues to be bogus (read: at least a latent bug) as long as you don''t always save the selector registers, which you now validly don''t do anymore. You should be consulting the VMCS instead, i.e. go through hvm_get_segment_register(). Whether a more lightweight variant reading just the selector is on order I can''t immediately tell. Jan
Jan Beulich
2013-Jun-27 07:25 UTC
Re: [PATCH 14/18] PVH xen: Checks, asserts, and limitations for PVH
>>> On 27.06.13 at 04:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 10:54:15 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > --- a/xen/arch/x86/hvm/mtrr.c >> > +++ b/xen/arch/x86/hvm/mtrr.c >> > @@ -578,6 +578,9 @@ int32_t hvm_set_mem_pinned_cacheattr( >> > { >> > struct hvm_mem_pinned_cacheattr_range *range; >> > >> > + /* A PVH guest writes to MSR_IA32_CR_PAT natively. */ >> > + ASSERT(!is_pvh_domain(d)); >> >> This can''t be an assert, or did I overlook you preventing the >> function to be called for PVH guests. >> >> The comment would then be wrong too, as there is a path >> leading here from a domctl (i.e. unaffected by how the guest >> itself would access the MSR). > > Well, there are no callers right now, and I wanted to catch any during > my test runs. But, now I think the ASSERT should be replaced with > returning -ENOSYS. Let me know if you disagree.An error return seems correct, but -ENOSYS doesn''t seem the best possible error code to correctly identify the kind of error. -EOPNOTSUPP perhaps?>> > --- a/xen/arch/x86/x86_64/traps.c >> > +++ b/xen/arch/x86/x86_64/traps.c >> > @@ -440,6 +440,8 @@ static long register_guest_callback(struct >> > callback_register *reg) long ret = 0; >> > struct vcpu *v = current; >> > >> > + ASSERT(!is_pvh_vcpu(v)); >> > + >> >> For one, I don''t think there has been anything so far making >> clear that this is unreachable for PVH. > > hvm_do_hypercall() returns -ENOSYS for both callers of > register_guest_callback > so this is unreachable for PVH. I can even remove the ASSERT if you''d > like. > >> And then it is inconsistent to do this here, but not also in >> unregister_guest_callback(). > > I can add one here too, or remove the one from register.Removing the one above would be my preference, but the only requirement I have is that both cases should be consistent with one another. Jan
Jan Beulich
2013-Jun-27 07:29 UTC
Re: [PATCH 15/18] PVH xen: add hypercall support for PVH
>>> On 27.06.13 at 05:09, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 11:12:25 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> > @@ -3777,7 +3815,7 @@ long do_hvm_op(unsigned long op, >> > XEN_GUEST_HANDLE_PARAM(void) arg) return -ESRCH; >> > >> > rc = -EINVAL; >> > - if ( !is_hvm_domain(d) ) >> > + if ( is_pv_domain(d) ) >> > goto param_fail; >> > >> > rc = xsm_hvm_param(XSM_TARGET, d, op); >> > @@ -3949,7 +3987,7 @@ long do_hvm_op(unsigned long op, >> > XEN_GUEST_HANDLE_PARAM(void) arg) break; >> > } >> > >> > - if ( rc == 0 ) >> > + if ( rc == 0 && !is_pvh_domain(d) ) >> > { >> > d->arch.hvm_domain.params[a.index] = a.value; >> > >> >> This last check I think you do because params[] points nowhere for > Correct. > >> PVH guests. If so - why don''t you just drop this and the earlier >> hunk? Or otherwise some of the case statements between need to > > I don''t understand, drop from where? You mean a totally separate function > for PVH (I had that in very earlier patches).No, just drop the two patch hunks. The first check then allows to only get into all the parameter handling code for HVM guests, and hence no further checks are necessary further down in any of the parameter handling.>> also guard against accessing the unset pointer. > > Correct, I''d need to do that. Originally, I had a white list of case > operations prohibited for PVH, but removed it.This would only be necessary if _some_ of the parameters can validly be set of PVH. In which case params[] can''t be NULL anymore, so the whole logic would need to change. Jan
>>> On 27.06.13 at 05:30, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > +static void read_vmcs_selectors(struct cpu_user_regs *regs) >> > +{ >> > + regs->ss = __vmread(GUEST_SS_SELECTOR); >> > + regs->ds = __vmread(GUEST_DS_SELECTOR); >> > + regs->es = __vmread(GUEST_ES_SELECTOR); >> > + regs->gs = __vmread(GUEST_GS_SELECTOR); >> > + regs->fs = __vmread(GUEST_FS_SELECTOR); >> > +} >> >> By only conditionally reading the selector registers, how do you >> guarantee that read_segment_register() would always read >> valid values? I think that macro needs to not look at "regs->?s" >> at all... > > read_segment_register() gets called for PVH only for > EXIT_REASON_IO_INSTRUCTION > intercept. In this path, we call read all selectors before calling > emulate_privileged_op. If someone changes code, they''d have to make sure > of that. I can add more comments there, or go back to always read all > selectors > upon vmexit, but you already made me change that.As per my earlier reply, I think this is wrong. Both from a conceptual POV and considering that new users of read_segment_register() may appear in the future. You ought to read the VMCS field in read_segment_register() if you want to keep avoiding the saving of the selector fields (which I strongly recommend).>> > +static int vmxit_invalid_op(struct cpu_user_regs *regs) >> > +{ >> > + if ( guest_kernel_mode(current, regs) >> > || !emulate_forced_invalid_op(regs) ) >> >> Did you perhaps mean !guest_kernel_mode()? > > No, pvh kernel has been changed to just do cpuid natively. Hopefully, > over time, looong time, emulate_forced_invalid_op can just be removed.While I don''t disagree with a decision like this, the way you present this still makes me want to comment: What you do or don''t do in Linux doesn''t matter. What matters is a clear ABI description - what are the requirements to a PVH kernel implementation, and in particular what are the differences to a PV one? In the case here, a requirement would now be to _not_ use the PV form of CPUID (or more generally any operation that would result in #UD with the expectation that the hypervisor emulates the instruction). I don''t think I''ve seen such a formalized list of differences, which would make it somewhat difficult for someone else to convert their favorite OS to support PVH too.>> > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, >> > + __vmread(GUEST_CR0)); >> > + >> > + /* For guest_kernel_mode which is called from most places >> > below. */ >> > + regs->cs = __vmread(GUEST_CS_SELECTOR); >> >> Which raises the question of whether your uses of >> guest_kernel_mode() are appropriate in the first place: Before this >> series there''s no use at all under xen/arch/x86/hvm/. > > HVM should do this for debug intercepts, otherwise it is wrongly > intercepting user level debuggers like gdb.And why would intercepting kernel debuggers like kdb or kgdb be correct?> HVM can also use this check for emulating forced invalid op for only > user levels. Since there''s a cpuid intercept, and we are trying to reduce > pv-ops, this seems plausible.HVM isn''t supposed to be using PV CPUID. Ideally the same would be true for PVH (i.e. it may be better to make it look to user space like HVM, but I''m not sure if there aren''t other collisions). Jan
Mukesh Rathor
2013-Jun-27 23:43 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
On Thu, 27 Jun 2013 08:22:42 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 27.06.13 at 00:41, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 10:36:41 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >> > + * in save_segments(), current has been updated to next, and no > >> > longer pointing > >> > + * to the PVH. > >> > + */ > >> > >> This is bogus - you shouldn''t need any of the {save,load}_segment() > >> machinery for PVH, and hence this is not a valid reason for adding > >> a vcpu parameter here. > > > > Ok, lets revisit this again since it''s been few months already: > > > > read_segment_register() is called from few places for PVH, and for > > PVH it needs to read the value from regs. So it needs to be > > modified to check for PVH. Originally, I had started with checking > > for is_pvh_vcpu(current), but that failed quickly because of the > > context switch call chain: > > > > __context_switch -> ctxt_switch_from --> save_segments -> > > read_segment_register > > > > In this path, going from PV to PVH, the intention is to save > > segments for PV, and since current has already been updated to > > point to PVH, the check for current is not correct. Hence, the need > > for vcpu parameter. I will enhance my comments in the macro prolog > > in the next patch version. > > No. I already said that {save,load}_segments() ought to be > skipped for PVH, as what it does is already done by VMRESUME/ > #VMEXIT. And the function is being passed a vCPU pointer, so > simply making the single call to save_segments() conditional on > is_pv_vcpu(), and converting the !is_hvm_vcpu() around the > call to load_LDT() and load_segments() to is_pv_vcpu() (provided > the LDT handling isn''t needed on the same basis) should eliminate > that need.They are *not* being called for PVH, where do you see that? Are you looking at the right patches? They are called for PV. Again, going from PV to PVH in context switch, current will be pointing to PVH and not PV when save_segments calls the macro to save segments for PV (not PVH). Hence, the VCPU is passed to save_segments, and we need to pass to our famed macro above!> Furthermore - the reading from struct cpu_user_regs continues > to be bogus (read: at least a latent bug) as long as you don''t > always save the selector registers, which you now validly don''t > do anymore.Right, because you made me move it to the path that calls the macro. So, for the path that the macro is called, the selectors would have been read. So, whats the latent bug?>You should be consulting the VMCS instead, i.e. go > through hvm_get_segment_register(). > Whether a more lightweight variant reading just the selector is > on order I can''t immediately tell.Well, that would require v == current always, that is not guaranteed in the macro path. What exactly is the problem the way it is? -Mukesh
On Thu, 27 Jun 2013 08:41:34 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 27.06.13 at 05:30, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote: > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > +static void read_vmcs_selectors(struct cpu_user_regs *regs) > >> > +{ > >> > + regs->ss = __vmread(GUEST_SS_SELECTOR); > >> > + regs->ds = __vmread(GUEST_DS_SELECTOR); > >> > + regs->es = __vmread(GUEST_ES_SELECTOR); > >> > + regs->gs = __vmread(GUEST_GS_SELECTOR); > >> > + regs->fs = __vmread(GUEST_FS_SELECTOR); > >> > +} > >> > >> By only conditionally reading the selector registers, how do you > >> guarantee that read_segment_register() would always read > >> valid values? I think that macro needs to not look at "regs->?s" > >> at all... > > > > read_segment_register() gets called for PVH only for > > EXIT_REASON_IO_INSTRUCTION > > intercept. In this path, we call read all selectors before calling > > emulate_privileged_op. If someone changes code, they''d have to make > > sure of that. I can add more comments there, or go back to always > > read all selectors > > upon vmexit, but you already made me change that. > > As per my earlier reply, I think this is wrong. Both from a > conceptual POV and considering that new users of > read_segment_register() may appear in the future. You > ought to read the VMCS field in read_segment_register() if > you want to keep avoiding the saving of the selector fields > (which I strongly recommend).I fail to see why saving of selector fields is any worse. New users of read_segment_register would have to make sure that reading of selector on demand happens always on current==PVH in the new path. To me that is no different than cheking to make sure the selectors are saved on the new call path.> >> > +static int vmxit_invalid_op(struct cpu_user_regs *regs) > >> > +{ > >> > + if ( guest_kernel_mode(current, regs) > >> > || !emulate_forced_invalid_op(regs) ) > >> > >> Did you perhaps mean !guest_kernel_mode()? > > > > No, pvh kernel has been changed to just do cpuid natively. > > Hopefully, over time, looong time, emulate_forced_invalid_op can > > just be removed. > > While I don''t disagree with a decision like this, the way you present > this still makes me want to comment: What you do or don''t do in > Linux doesn''t matter. What matters is a clear ABI description - what > are the requirements to a PVH kernel implementation, and in > particular what are the differences to a PV one? In the case here, a > requirement would now be to _not_ use the PV form of CPUID (or > more generally any operation that would result in #UD with the > expectation that the hypervisor emulates the instruction). > > I don''t think I''ve seen such a formalized list of differences, which > would make it somewhat difficult for someone else to convert their > favorite OS to support PVH too.Correct, I don''t think we are at a point where we can create such a list. It''s been six months almost that the patches have been out, and right now I''d like to focus on getting them in.> >> > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, > >> > + __vmread(GUEST_CR0)); > >> > + > >> > + /* For guest_kernel_mode which is called from most places > >> > below. */ > >> > + regs->cs = __vmread(GUEST_CS_SELECTOR); > >> > >> Which raises the question of whether your uses of > >> guest_kernel_mode() are appropriate in the first place: Before this > >> series there''s no use at all under xen/arch/x86/hvm/. > > > > HVM should do this for debug intercepts, otherwise it is wrongly > > intercepting user level debuggers like gdb. > > And why would intercepting kernel debuggers like kdb or kgdb be > correct?The last I checked there were no kernel debuggers supported in domU, so I wrote gdbsx. If that has changed, then there''s some work to do for gdbsx for both hvm and pvh.> > HVM can also use this check for emulating forced invalid op for only > > user levels. Since there''s a cpuid intercept, and we are trying to > > reduce pv-ops, this seems plausible. > > HVM isn''t supposed to be using PV CPUID. Ideally the same would > be true for PVH (i.e. it may be better to make it look to user space > like HVM, but I''m not sure if there aren''t other collisions).Correct, PVH does not use PV CPUID in PVH kernel. Initially, it was not supported at all, but you convinced me to support from user level. thanks Mukesh
On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > --- /dev/null........> > +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) > > +{ > > + unsigned long exit_qualification; > > + unsigned int exit_reason = __vmread(VM_EXIT_REASON); > > + int rc=0, ccpu = smp_processor_id(); > > + struct vcpu *v = current; > > + > > + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx > > EFLAGS:%lx CR0:%lx\n", > > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, > > + __vmread(GUEST_CR0)); > > + > > + /* For guest_kernel_mode which is called from most places > > below. */ > > + regs->cs = __vmread(GUEST_CS_SELECTOR); > > Which raises the question of whether your uses of > guest_kernel_mode() are appropriate in the first place: Before this > series there''s no use at all under xen/arch/x86/hvm/. > > And if it is, I''d like to point out once again that this check should > be looking at SS.DPL, not CS.RPL.Are you suggesting changing the macro to check for SS.DPL instead of CS.RPL it has always done for PV also? Note, PVH has checks in this patch to enforce long mode execution always, so CS.RPL should always be valid for PVH. Mukesh
On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: .......> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > > vcpu_guest_context *ctxtp) +{ > > + if ( v->vcpu_id == 0 ) > > + return 0; > > + > > + vmx_vmcs_enter(v); > > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > > + > > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > > How does this work without also writing the "hidden" register > fields?This is for bringing up SMP CPUs by the guest, which already has set GDT up so it just needs selectors to be loaded to start the target vcpu.> > + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) > > + { > > + vmx_vmcs_exit(v); > > + return -EINVAL; > > + } > > + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_kernel); > > So you write both GS bases, but not the base of FS (and above > its selector is being skipped too)?Right, for 32bit PVH we''d need to do that. It needs PVH 32bitfixme tag. Or I can just do it for both, 64bit would be null anyways.> And there are other parts of struct vcpu_guest_context that > I don''t see getting mirrored - are all of them getting handled > elsewhere?The call comes from VCPUOP_initialise -> arch_set_info_guest() which handles some of the other fields. There''s lot less to load for PVH compared to PV. thanks mukesh
Jan Beulich
2013-Jun-28 09:20 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
>>> On 28.06.13 at 01:43, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 27 Jun 2013 08:22:42 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 27.06.13 at 00:41, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Tue, 25 Jun 2013 10:36:41 +0100 >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >> > + * in save_segments(), current has been updated to next, and no >> >> > longer pointing >> >> > + * to the PVH. >> >> > + */ >> >> >> >> This is bogus - you shouldn''t need any of the {save,load}_segment() >> >> machinery for PVH, and hence this is not a valid reason for adding >> >> a vcpu parameter here. >> > >> > Ok, lets revisit this again since it''s been few months already: >> > >> > read_segment_register() is called from few places for PVH, and for >> > PVH it needs to read the value from regs. So it needs to be >> > modified to check for PVH. Originally, I had started with checking >> > for is_pvh_vcpu(current), but that failed quickly because of the >> > context switch call chain: >> > >> > __context_switch -> ctxt_switch_from --> save_segments -> >> > read_segment_register >> > >> > In this path, going from PV to PVH, the intention is to save >> > segments for PV, and since current has already been updated to >> > point to PVH, the check for current is not correct. Hence, the need >> > for vcpu parameter. I will enhance my comments in the macro prolog >> > in the next patch version. >> >> No. I already said that {save,load}_segments() ought to be >> skipped for PVH, as what it does is already done by VMRESUME/ >> #VMEXIT. And the function is being passed a vCPU pointer, so >> simply making the single call to save_segments() conditional on >> is_pv_vcpu(), and converting the !is_hvm_vcpu() around the >> call to load_LDT() and load_segments() to is_pv_vcpu() (provided >> the LDT handling isn''t needed on the same basis) should eliminate >> that need. > > They are *not* being called for PVH, where do you see that? Are you > looking at the right patches? They are called for PV. Again, going from > PV to PVH in context switch, current will be pointing to PVH and not > PV when save_segments calls the macro to save segments for PV (not PVH). > Hence, the VCPU is passed to save_segments, and we need to pass to our > famed macro above!But I''m not arguing that the vCPU pointer shouldn''t be passed to this - I''m trying to tell you that having this macro read the selector values from struct cpu_user_regs in the PVH case is wrong. It was you continuing to point to the context switch path, making me believe that so far you don''t properly suppress the uses of {save,load}_segments() for PVH.>> Furthermore - the reading from struct cpu_user_regs continues >> to be bogus (read: at least a latent bug) as long as you don''t >> always save the selector registers, which you now validly don''t >> do anymore. > > Right, because you made me move it to the path that calls the macro. > So, for the path that the macro is called, the selectors would have > been read. So, whats the latent bug?The problem is that you think that now and forever this macro will only be used from the MMIO emulation path (or some such, in any case - just from one very specific path). This is an assumption you may make while in an early development phase, but not in patches that you intend to be committed: Someone adding another use of the macro is _very_ unlikely to go and check what contraints apply to that use. The macro has to work in the general case.>>You should be consulting the VMCS instead, i.e. go >> through hvm_get_segment_register(). >> Whether a more lightweight variant reading just the selector is >> on order I can''t immediately tell. > > Well, that would require v == current always, that is not guaranteed > in the macro path. What exactly is the problem the way it is?I think you need to view this slightly differently: At present, the macro reads the live register values. Which means even if v != current, we''re still in a state where the hardware has the correct values. This ought to apply in exactly the same way to PVH - the current VMCS should still be holding the right values. Furthermore, the case is relatively easy to determine: Instead of only looking at current, you could also take per_cpu(curr_vcpu, ) into account. And finally - vmx_get_segment_register() already takes a vCPU pointer, and uses vmx_vmcs_{enter,exit}(), so there''s no dependency of that function to run in the context of the subject vCPU. vmx_vmcs_enter() itself checks whether it''s running on the subject vCPU though, so that may need inspection/tweaking if the context switch path would really ever get you into that macro (I doubt that it will though, and making assumptions about the context switch path [not] doing certain things _is_ valid, as opposed to making such assumptions on arbitrary code). Jan
>>> On 28.06.13 at 03:28, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Thu, 27 Jun 2013 08:41:34 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> >>> On 27.06.13 at 05:30, Mukesh Rathor <mukesh.rathor@oracle.com> >> > No, pvh kernel has been changed to just do cpuid natively. >> > Hopefully, over time, looong time, emulate_forced_invalid_op can >> > just be removed. >> >> While I don''t disagree with a decision like this, the way you present >> this still makes me want to comment: What you do or don''t do in >> Linux doesn''t matter. What matters is a clear ABI description - what >> are the requirements to a PVH kernel implementation, and in >> particular what are the differences to a PV one? In the case here, a >> requirement would now be to _not_ use the PV form of CPUID (or >> more generally any operation that would result in #UD with the >> expectation that the hypervisor emulates the instruction). >> >> I don''t think I''ve seen such a formalized list of differences, which >> would make it somewhat difficult for someone else to convert their >> favorite OS to support PVH too. > > Correct, I don''t think we are at a point where we can create such a > list. It''s been six months almost that the patches have been out, and > right now I''d like to focus on getting them in.So in the end this means you''re making and revisiting decisions on what a PVH guest is or is not permitted to do as you go, with no clear model spelled out up front. That''s rather worrying to me.>> > HVM should do this for debug intercepts, otherwise it is wrongly >> > intercepting user level debuggers like gdb. >> >> And why would intercepting kernel debuggers like kdb or kgdb be >> correct? > > The last I checked there were no kernel debuggers supported in domU, so > I wrote gdbsx. If that has changed, then there''s some work to do for > gdbsx for both hvm and pvh.In a HVM guest, a kernel debugger should work, or else the emulation is incomplete (which then is a plain bug). Jan
>>> On 28.06.13 at 03:35, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 11:49:57 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > --- /dev/null > ........ >> > +void vmx_pvh_vmexit_handler(struct cpu_user_regs *regs) >> > +{ >> > + unsigned long exit_qualification; >> > + unsigned int exit_reason = __vmread(VM_EXIT_REASON); >> > + int rc=0, ccpu = smp_processor_id(); >> > + struct vcpu *v = current; >> > + >> > + dbgp1("PVH:[%d]left VMCS exitreas:%d RIP:%lx RSP:%lx >> > EFLAGS:%lx CR0:%lx\n", >> > + ccpu, exit_reason, regs->rip, regs->rsp, regs->rflags, >> > + __vmread(GUEST_CR0)); >> > + >> > + /* For guest_kernel_mode which is called from most places >> > below. */ >> > + regs->cs = __vmread(GUEST_CS_SELECTOR); >> >> Which raises the question of whether your uses of >> guest_kernel_mode() are appropriate in the first place: Before this >> series there''s no use at all under xen/arch/x86/hvm/. >> >> And if it is, I''d like to point out once again that this check should >> be looking at SS.DPL, not CS.RPL. > > Are you suggesting changing the macro to check for SS.DPL instead of > CS.RPL it has always done for PV also? Note, PVH has checks in this > patch to enforce long mode execution always, so CS.RPL should always > be valid for PVH.I''m saying that guest_kernel_mode() should be looking at the VMCS for PVH (and, should it happen to be used in HVM code paths, for HVM too) rather than struct cpu_user_regs. That makes the saving of the CS selector pointless (in line with how HVM behaves), and once you''re going through hvm_get_segment_register(), you can as well do this properly (i.e. look at SS.DPL rather than CS.RPL). And no, repeatedly comparing segment register handling with PV is bogus: In the PV case we just don''t have the luxury of accessible hidden register portions, i.e. we need to get away with looking at selectors only. Once you introduce this sort of hybrid model, you should avoid _any_ unnecessary relaxations. Jan
>>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct >> > vcpu_guest_context *ctxtp) +{ >> > + if ( v->vcpu_id == 0 ) >> > + return 0; >> > + >> > + vmx_vmcs_enter(v); >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); >> > + >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); >> >> How does this work without also writing the "hidden" register >> fields? > > This is for bringing up SMP CPUs by the guest, which already has set GDT > up so it just needs selectors to be loaded to start the target vcpu.That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll get the hidden register fields loaded from the VMCS, without accessing the GDT. If that understanding of mine is wrong, please explain how you see things working in more detail.>> > + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) >> > + { >> > + vmx_vmcs_exit(v); >> > + return -EINVAL; >> > + } >> > + vmx_write_guest_msr(MSR_SHADOW_GS_BASE, ctxtp->gs_base_kernel); >> >> So you write both GS bases, but not the base of FS (and above >> its selector is being skipped too)? > > Right, for 32bit PVH we''d need to do that. It needs PVH 32bitfixme tag. > Or I can just do it for both, 64bit would be null anyways.This again is a Linux assumption. Please stop this building in of Linux-isms into the hypervisor.>> And there are other parts of struct vcpu_guest_context that >> I don''t see getting mirrored - are all of them getting handled >> elsewhere? > > The call comes from VCPUOP_initialise -> arch_set_info_guest() which handles > some of the other fields. There''s lot less to load for PVH compared to > PV.So just as an example - where would non-zero ldt_base/ldt_ents get dealt with? Just to repeat the above - don''t make any implications on which fields may be unused by your Linux patch set (or add fixme notes identifying those places). And btw., looking at that patch again I''m also getting the impression that the GS base handling in that function is lacking consideration of VGCF_in_kernel. Jan
On Fri, 28 Jun 2013 10:31:53 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 03:35, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > --- /dev/null... + regs->cs = __vmread(GUEST_CS_SELECTOR);> >> > >> Which raises the question of whether your uses of > >> guest_kernel_mode() are appropriate in the first place: Before this > >> series there''s no use at all under xen/arch/x86/hvm/. > >> > >> And if it is, I''d like to point out once again that this check > >> should be looking at SS.DPL, not CS.RPL. > > > > Are you suggesting changing the macro to check for SS.DPL instead of > > CS.RPL it has always done for PV also? Note, PVH has checks in this > > patch to enforce long mode execution always, so CS.RPL should always > > be valid for PVH. > > I''m saying that guest_kernel_mode() should be looking at the > VMCS for PVH (and, should it happen to be used in HVM code > paths, for HVM too) rather than struct cpu_user_regs. That > makes the saving of the CS selector pointless (in line with how > HVM behaves), and once you''re going through > hvm_get_segment_register(), you can as well do this properly > (i.e. look at SS.DPL rather than CS.RPL). And no, repeatedly > comparing segment register handling with PV is bogus: In the PV > case we just don''t have the luxury of accessible hidden register > portions, i.e. we need to get away with looking at selectors only.Just for my knowledge, why can''t we read the GDT entry in the PV case to get the hidden fields, since we have access to both the GDT base and the selector? thanks Mukesh
On Fri, 28 Jun 2013 10:44:08 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote: > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > >> > vcpu_guest_context *ctxtp) +{ > >> > + if ( v->vcpu_id == 0 ) > >> > + return 0; > >> > + > >> > + vmx_vmcs_enter(v); > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > >> > + > >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > >> > >> How does this work without also writing the "hidden" register > >> fields? > > > > This is for bringing up SMP CPUs by the guest, which already has > > set GDT up so it just needs selectors to be loaded to start the > > target vcpu. > > That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll > get the hidden register fields loaded from the VMCS, without > accessing the GDT. If that understanding of mine is wrong, please > explain how you see things working in more detail.I see things same as you do, that it''ll get hidden fields from vmcs. The dilemma here is what to do about the VCPUOP_initialise hcall. I am currently checking to see if the new vcpu can just set the context itself first thing, without the VCPUOP_initialise hcall completely. Correct, the guest I am dealing with is Linux. And correct again, that these are linux focussed. I believe it was someone working on BSD who had inquired about PVH, and looking at my patches, and probably would be working with us when they are ready. These patches are to allow a 64bit Linux PV domU to boot in PVH mode on xen. Since no new hypercalls are created, nor are there any major changes to any, it doesn''t seem to me a huge deal that they are linux focussed right now. To me, it''s a step in the direction of someday having PVH support for all OSs. Mukesh
>>> On 29.06.13 at 05:03, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > Just for my knowledge, why can''t we read the GDT entry in the PV case > to get the hidden fields, since we have access to both the GDT base > and the selector?Because what''s in the registers may not be what''s in the GDT. And of course it would be quite a bit more expensive (as opposed to the PVH case where you can just pick which of the VMCS fields you want to read - there shouldn''t be a performance difference). Jan
>>> On 29.06.13 at 05:04, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 28 Jun 2013 10:44:08 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" >> > <JBeulich@suse.com> wrote: >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct >> >> > vcpu_guest_context *ctxtp) +{ >> >> > + if ( v->vcpu_id == 0 ) >> >> > + return 0; >> >> > + >> >> > + vmx_vmcs_enter(v); >> >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); >> >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); >> >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); >> >> > + >> >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); >> >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); >> >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); >> >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); >> >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); >> >> >> >> How does this work without also writing the "hidden" register >> >> fields? >> > >> > This is for bringing up SMP CPUs by the guest, which already has >> > set GDT up so it just needs selectors to be loaded to start the >> > target vcpu. >> >> That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll >> get the hidden register fields loaded from the VMCS, without >> accessing the GDT. If that understanding of mine is wrong, please >> explain how you see things working in more detail. > > I see things same as you do, that it''ll get hidden fields from vmcs. The > dilemma here is what to do about the VCPUOP_initialise hcall. I am > currently checking to see if the new vcpu can just set the context > itself first thing, without the VCPUOP_initialise hcall completely.I don''t follow: What''s the dilemma here? Why can''t you just put _all_ of the values specified by VCPUOP_initialise into the VMCS?> Correct, the guest I am dealing with is Linux. And correct again, that > these are linux focussed. I believe it was someone working on BSD who > had inquired about PVH, and looking at my patches, and probably would be > working with us when they are ready. > > These patches are to allow a 64bit Linux PV domU to boot in PVH mode on xen. > Since no new hypercalls are created, nor are there any major changes to any, > it doesn''t seem to me a huge deal that they are linux focussed right now. > To me, it''s a step in the direction of someday having PVH support for > all OSs.As said before - I would consider this sort of acceptable if the omissions were spelled out clearly (i.e. easily grep-able for and easily recognizable when reading the code). Jan
On Mon, 01 Jul 2013 09:54:30 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 29.06.13 at 05:04, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Fri, 28 Jun 2013 10:44:08 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote: > >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > >> > <JBeulich@suse.com> wrote: > >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > >> >> > vcpu_guest_context *ctxtp) +{ > >> >> > + if ( v->vcpu_id == 0 ) > >> >> > + return 0; > >> >> > + > >> >> > + vmx_vmcs_enter(v); > >> >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > >> >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > >> >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > >> >> > + > >> >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > >> >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > >> >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > >> >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > >> >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > >> >> > >> >> How does this work without also writing the "hidden" register > >> >> fields? > >> > > >> > This is for bringing up SMP CPUs by the guest, which already has > >> > set GDT up so it just needs selectors to be loaded to start the > >> > target vcpu. > >> > >> That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll > >> get the hidden register fields loaded from the VMCS, without > >> accessing the GDT. If that understanding of mine is wrong, please > >> explain how you see things working in more detail. > > > > I see things same as you do, that it''ll get hidden fields from > > vmcs. The dilemma here is what to do about the VCPUOP_initialise > > hcall. I am currently checking to see if the new vcpu can just set > > the context itself first thing, without the VCPUOP_initialise hcall > > completely. > > I don''t follow: What''s the dilemma here? Why can''t you just put > _all_ of the values specified by VCPUOP_initialise into the VMCS?Well, OK, whatever is relevant for PVH, like ldt. Other things like trap_ctxt, *callback*... are not applicable to PVH.> > Correct, the guest I am dealing with is Linux. And correct again, > > that these are linux focussed. I believe it was someone working on > > BSD who had inquired about PVH, and looking at my patches, and > > probably would be working with us when they are ready. > > > > These patches are to allow a 64bit Linux PV domU to boot in PVH > > mode on xen. Since no new hypercalls are created, nor are there any > > major changes to any, it doesn''t seem to me a huge deal that they > > are linux focussed right now. To me, it''s a step in the direction > > of someday having PVH support for all OSs. > > As said before - I would consider this sort of acceptable if the > omissions were spelled out clearly (i.e. easily grep-able for and > easily recognizable when reading the code).Ok, I''ll expand the function prolog. thanks mukesh
Mukesh Rathor
2013-Jul-03 01:38 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
On Fri, 28 Jun 2013 10:20:47 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 01:43, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Thu, 27 Jun 2013 08:22:42 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 27.06.13 at 00:41, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > On Tue, 25 Jun 2013 10:36:41 +0100 > >> > "Jan Beulich" <JBeulich@suse.com> wrote: > >> > > values from struct cpu_user_regs in the PVH case is wrong. It was > you continuing to point to the context switch path, making me > believe that so far you don''t properly suppress the uses of > {save,load}_segments() for PVH. > > >> Furthermore - the reading from struct cpu_user_regs continues > >> to be bogus (read: at least a latent bug) as long as you don''t > >> always save the selector registers, which you now validly don''t > >> do anymore. > > > > Right, because you made me move it to the path that calls the > > macro. So, for the path that the macro is called, the selectors > > would have been read. So, whats the latent bug? > > The problem is that you think that now and forever this macro > will only be used from the MMIO emulation path (or some such, in > any case - just from one very specific path). This is an assumption > you may make while in an early development phase, but not in > patches that you intend to be committed: Someone adding another > use of the macro is _very_ unlikely to go and check what contraints > apply to that use. The macro has to work in the general case.Hmm.. Ok, I still fail to see the difference, caching upfront always is such a low overhead. Anyways, I can make the change, but do realize that the name parameter will need to change to ''enum x86_segment'', and so all callers will need to change too. The macro will now need to have a switch statement inside for non-pvh case... I may as well change it from macro to an inlined function. hope all that sounds ok. mukesh
On Fri, 28 Jun 2013 10:44:08 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote:......> And btw., looking at that patch again I''m also getting the > impression that the GS base handling in that function is lacking > consideration of VGCF_in_kernel.I still fail to see what VGCF_in_kernel has to do with GS base for PVH guest. The flag should be irrelevant for PVH IMO. Can you kindly elaborate a bit? thanks mukesh
Jan Beulich
2013-Jul-03 10:21 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
>>> On 03.07.13 at 03:38, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 28 Jun 2013 10:20:47 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 28.06.13 at 01:43, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Thu, 27 Jun 2013 08:22:42 +0100 >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> >> >>> On 27.06.13 at 00:41, Mukesh Rathor <mukesh.rathor@oracle.com> >> >> >>> wrote: >> >> > On Tue, 25 Jun 2013 10:36:41 +0100 >> >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > >> values from struct cpu_user_regs in the PVH case is wrong. It was >> you continuing to point to the context switch path, making me >> believe that so far you don''t properly suppress the uses of >> {save,load}_segments() for PVH. >> >> >> Furthermore - the reading from struct cpu_user_regs continues >> >> to be bogus (read: at least a latent bug) as long as you don''t >> >> always save the selector registers, which you now validly don''t >> >> do anymore. >> > >> > Right, because you made me move it to the path that calls the >> > macro. So, for the path that the macro is called, the selectors >> > would have been read. So, whats the latent bug? >> >> The problem is that you think that now and forever this macro >> will only be used from the MMIO emulation path (or some such, in >> any case - just from one very specific path). This is an assumption >> you may make while in an early development phase, but not in >> patches that you intend to be committed: Someone adding another >> use of the macro is _very_ unlikely to go and check what contraints >> apply to that use. The macro has to work in the general case. > > Hmm.. Ok, I still fail to see the difference, caching upfront always is > such a low overhead.Even if it really is (which I doubt), you still would make PVH different from both PV and HVM, which both don''t populate the selector fields of the frame (PV obviously has ->cs and ->ss populated [by the CPU], but HVM avoids even that).> Anyways, I can make the change, but do realize that > the name parameter will need to change to ''enum x86_segment'', and so all > callers will need to change too. The macro will now need to have a > switch statement inside for non-pvh case... I may as well change it > from macro to an inlined function. hope all that sounds ok.We''ll have to see - at the first glance I don''t follow... Jan
>>> On 03.07.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 28 Jun 2013 10:44:08 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" >> > <JBeulich@suse.com> wrote: > ...... >> And btw., looking at that patch again I''m also getting the >> impression that the GS base handling in that function is lacking >> consideration of VGCF_in_kernel. > > I still fail to see what VGCF_in_kernel has to do with GS base for > PVH guest. The flag should be irrelevant for PVH IMO. Can you kindly > elaborate a bit?VGCF_in_kernel specifies whether a guest wants to start its vCPU in user of kernel mode (why the interface permits that is another question, but you have to play by what is there). The CPU notion of the two GS bases is "active" and "shadow", with no meaning associated whether one is the kernel''s and the other is for user mode. The Xen notion otoh is "gs_base_kernel" and "gs_base_user". Which VMCS field needs to be populated with which of the two struct vcpu_guest_context values thus depends on said flag. Jan
Mukesh Rathor
2013-Jul-04 02:00 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
On Wed, 03 Jul 2013 11:21:20 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 03.07.13 at 03:38, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Fri, 28 Jun 2013 10:20:47 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 28.06.13 at 01:43, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote:.............> >> The problem is that you think that now and forever this macro > >> will only be used from the MMIO emulation path (or some such, in > >> any case - just from one very specific path). This is an assumption > >> you may make while in an early development phase, but not in > >> patches that you intend to be committed: Someone adding another > >> use of the macro is _very_ unlikely to go and check what contraints > >> apply to that use. The macro has to work in the general case. > > > > Hmm.. Ok, I still fail to see the difference, caching upfront > > always is such a low overhead. > > Even if it really is (which I doubt), you still would make PVH > different from both PV and HVM, which both don''t populate the > selector fields of the frame (PV obviously has ->cs and ->ss > populated [by the CPU], but HVM avoids even that).And what''s wrong with PVH being little different? If anything, that makes the code easier than to make it like PVH/HVM IMHO. HVM may not cache selectors, but caches other things like CR3 upfront: vmx_vmexit_handler(): v->arch.hvm_vcpu.guest_cr[3] = v->arch.hvm_vcpu.hw_cr[3] __vmread(GUEST_CR3);> We''ll have to see - at the first glance I don''t follow...Here''s what I am talking about: diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 16e25e9..ab1953f 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1241,10 +1241,10 @@ static void save_segments(struct vcpu *v) struct cpu_user_regs *regs = &v->arch.user_regs; unsigned int dirty_segment_mask = 0; - regs->ds = read_segment_register(v, regs, ds); - regs->es = read_segment_register(v, regs, es); - regs->fs = read_segment_register(v, regs, fs); - regs->gs = read_segment_register(v, regs, gs); + regs->ds = read_segment_register(v, regs, x86_seg_ds); + regs->es = read_segment_register(v, regs, x86_seg_es); + regs->fs = read_segment_register(v, regs, x86_seg_fs); + regs->gs = read_segment_register(v, regs, x86_seg_gs); if ( regs->ds ) dirty_segment_mask |= DIRTY_DS; diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index e0d84af..5318126 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -676,6 +676,45 @@ static void vmx_ctxt_switch_to(struct vcpu *v) .fields = { .type = 0xb, .s = 0, .dpl = 0, .p = 1, .avl = 0, \ .l = 0, .db = 0, .g = 0, .pad = 0 } }).bytes) +u16 vmx_get_selector(struct vcpu *v, enum x86_segment seg) +{ + u16 sel = 0; + + vmx_vmcs_enter(v); + switch ( seg ) + { + case x86_seg_cs: + sel = __vmread(GUEST_CS_SELECTOR); + break; + + case x86_seg_ss: + sel = __vmread(GUEST_SS_SELECTOR); + break; + + case x86_seg_es: + sel = __vmread(GUEST_ES_SELECTOR); + break; + + case x86_seg_ds: + sel = __vmread(GUEST_DS_SELECTOR); + break; + + case x86_seg_fs: + sel = __vmread(GUEST_FS_SELECTOR); + break; + + case x86_seg_gs: + sel = __vmread(GUEST_GS_SELECTOR); + break; + + default: + BUG(); + } + vmx_vmcs_exit(v); + + return sel; +} + void vmx_get_segment_register(struct vcpu *v, enum x86_segment seg, struct segment_register *reg) { diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index b638a6e..6b6989a 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -112,6 +112,52 @@ boolean_param("ler", opt_ler); #define stack_words_per_line 4 #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)regs->rsp) +/* + * We need vcpu because during context switch, going from pure PV to PVH, + * in save_segments(), current has been updated to next, and no longer pointing + * to the pure PV. + */ +u16 read_segment_register(struct vcpu *vcpu, + struct cpu_user_regs *regs, enum x86_segment seg) +{ + u16 sel = 0; + + if ( is_pvh_vcpu(vcpu) && guest_mode(regs) ) + sel = pvh_get_selector(vcpu, seg); + else + { + switch ( seg ) + { + case x86_seg_cs: + asm volatile ( "movw %%cs,%0" : "=r" (sel) ); + break; + + case x86_seg_ss: + asm volatile ( "movw %%ss,%0" : "=r" (sel) ); + break; + + case x86_seg_es: + asm volatile ( "movw %%es,%0" : "=r" (sel) ); + break; + + case x86_seg_ds: + asm volatile ( "movw %%ds,%0" : "=r" (sel) ); + break; + + case x86_seg_fs: + asm volatile ( "movw %%fs,%0" : "=r" (sel) ); + break; + + case x86_seg_gs: + asm volatile ( "movw %%gs,%0" : "=r" (sel) ); + break; + + default: + BUG(); + } + } + return sel; +} static void show_guest_stack(struct vcpu *v, struct cpu_user_regs *regs) { int i; @@ -1940,7 +1986,7 @@ int emulate_privileged_op(struct cpu_user_regs *regs) goto fail; /* emulating only opcodes not allowing SS to be default */ - data_sel = read_segment_register(v, regs, ds); + data_sel = read_segment_register(v, regs, x86_seg_ds); which_sel = x86_seg_ds; /* Legacy prefixes. */ @@ -1960,20 +2006,20 @@ int emulate_privileged_op(struct cpu_user_regs *regs) which_sel = x86_seg_cs; continue; case 0x3e: /* DS override */ - data_sel = read_segment_register(v, regs, ds); + data_sel = read_segment_register(v, regs, x86_seg_ds); which_sel = x86_seg_ds; continue; case 0x26: /* ES override */ - data_sel = read_segment_register(v, regs, es); + data_sel = read_segment_register(v, regs, x86_seg_es); which_sel = x86_seg_es; continue; case 0x64: /* FS override */ - data_sel = read_segment_register(v, regs, fs); + data_sel = read_segment_register(v, regs, x86_seg_fs); which_sel = x86_seg_fs; lm_ovr = lm_seg_fs; continue; case 0x65: /* GS override */ - data_sel = read_segment_register(v, regs, gs); + data_sel = read_segment_register(v, regs, x86_seg_gs); which_sel = x86_seg_gs; lm_ovr = lm_seg_gs; continue; @@ -2022,7 +2068,7 @@ int emulate_privileged_op(struct cpu_user_regs *regs) if ( !(opcode & 2) ) { - data_sel = read_segment_register(v, regs, es); + data_sel = read_segment_register(v, regs, x86_seg_es); which_sel = x86_seg_es; lm_ovr = lm_seg_none; } @@ -2769,22 +2815,22 @@ static void emulate_gate_op(struct cpu_user_regs *regs) ASSERT(opnd_sel); continue; case 0x3e: /* DS override */ - opnd_sel = read_segment_register(v, regs, ds); + opnd_sel = read_segment_register(v, regs, x86_seg_ds); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x26: /* ES override */ - opnd_sel = read_segment_register(v, regs, es); + opnd_sel = read_segment_register(v, regs, x86_seg_es); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x64: /* FS override */ - opnd_sel = read_segment_register(v, regs, fs); + opnd_sel = read_segment_register(v, regs, x86_seg_fs); if ( !opnd_sel ) opnd_sel = dpl; continue; case 0x65: /* GS override */ - opnd_sel = read_segment_register(v, regs, gs); + opnd_sel = read_segment_register(v, regs, x86_seg_gs); if ( !opnd_sel ) opnd_sel = dpl; continue; @@ -2837,7 +2883,8 @@ static void emulate_gate_op(struct cpu_user_regs *regs) switch ( modrm & 7 ) { default: - opnd_sel = read_segment_register(v, regs, ds); + opnd_sel = read_segment_register(v, regs, + x86_seg_ds); break; case 4: case 5: opnd_sel = regs->ss; @@ -2865,7 +2912,8 @@ static void emulate_gate_op(struct cpu_user_regs *regs) break; } if ( !opnd_sel ) - opnd_sel = read_segment_register(v, regs, ds); + opnd_sel = read_segment_register(v, regs, + x86_seg_ds); switch ( modrm & 7 ) { case 0: case 2: case 4: diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index 0df1e1c..fbf6506 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -122,10 +122,10 @@ void show_registers(struct cpu_user_regs *regs) fault_crs[0] = read_cr0(); fault_crs[3] = read_cr3(); fault_crs[4] = read_cr4(); - fault_regs.ds = read_segment_register(v, regs, ds); - fault_regs.es = read_segment_register(v, regs, es); - fault_regs.fs = read_segment_register(v, regs, fs); - fault_regs.gs = read_segment_register(v, regs, gs); + fault_regs.ds = read_segment_register(v, regs, x86_seg_ds); + fault_regs.es = read_segment_register(v, regs, x86_seg_es); + fault_regs.fs = read_segment_register(v, regs, x86_seg_fs); + fault_regs.gs = read_segment_register(v, regs, x86_seg_gs); } print_xen_info(); @@ -240,10 +240,10 @@ void do_double_fault(struct cpu_user_regs *regs) crs[2] = read_cr2(); crs[3] = read_cr3(); crs[4] = read_cr4(); - regs->ds = read_segment_register(current, regs, ds); - regs->es = read_segment_register(current, regs, es); - regs->fs = read_segment_register(current, regs, fs); - regs->gs = read_segment_register(current, regs, gs); + regs->ds = read_segment_register(current, regs, x86_seg_ds); + regs->es = read_segment_register(current, regs, x86_seg_es); + regs->fs = read_segment_register(current, regs, x86_seg_fs); + regs->gs = read_segment_register(current, regs, x86_seg_gs); printk("CPU: %d\n", cpu); _show_registers(regs, crs, CTXT_hypervisor, NULL); diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 7e21ee1..c63988d 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -194,6 +194,7 @@ struct hvm_function_table { bool_t access_w, bool_t access_x); int (*pvh_set_vcpu_info)(struct vcpu *v, struct vcpu_guest_context *ctxtp); + u16 (*vmx_read_selector)(struct vcpu *v, enum x86_segment seg); }; extern struct hvm_function_table hvm_funcs; @@ -308,6 +309,11 @@ static inline void hvm_flush_guest_tlbs(void) void hvm_hypercall_page_initialise(struct domain *d, void *hypercall_page); +static inline u16 pvh_get_selector(struct vcpu *v, enum x86_segment seg) +{ + return hvm_funcs.vmx_read_selector(v, seg); +} + static inline void hvm_get_segment_register(struct vcpu *v, enum x86_segment seg, struct segment_register *reg) diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h index 035944f..d5d90ca 100644 --- a/xen/include/asm-x86/system.h +++ b/xen/include/asm-x86/system.h @@ -4,22 +4,6 @@ #include <xen/lib.h> #include <xen/bitops.h> -/* - * We need vcpu because during context switch, going from pure PV to PVH, - * in save_segments(), current has been updated to next, and no longer pointing - * to the pure PV. Note: for PVH, we update regs->selectors on each vmexit. - */ -#define read_segment_register(vcpu, regs, name) \ -({ u16 __sel; \ - struct cpu_user_regs *_regs = (regs); \ - \ - if ( is_pvh_vcpu(vcpu) && guest_mode(regs) ) \ - __sel = _regs->name; \ - else \ - asm volatile ( "movw %%" #name ",%0" : "=r" (__sel) ); \ - __sel; \ -}) - #define wbinvd() \ asm volatile ( "wbinvd" : : : "memory" ) diff --git a/xen/include/asm-x86/traps.h b/xen/include/asm-x86/traps.h index 202e3be..c4a2e2e 100644 --- a/xen/include/asm-x86/traps.h +++ b/xen/include/asm-x86/traps.h @@ -50,4 +50,6 @@ extern int send_guest_trap(struct domain *d, uint16_t vcpuid, unsigned int trap_nr); int emulate_privileged_op(struct cpu_user_regs *regs); +u16 read_segment_register(struct vcpu *vcpu, + struct cpu_user_regs *regs, enum x86_segment seg); #endif /* ASM_TRAP_H */
On Wed, 03 Jul 2013 11:25:49 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 03.07.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Fri, 28 Jun 2013 10:44:08 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > >> > <JBeulich@suse.com> wrote: > > ...... > >> And btw., looking at that patch again I''m also getting the > >> impression that the GS base handling in that function is lacking > >> consideration of VGCF_in_kernel. > > > > I still fail to see what VGCF_in_kernel has to do with GS base for > > PVH guest. The flag should be irrelevant for PVH IMO. Can you kindly > > elaborate a bit? > > VGCF_in_kernel specifies whether a guest wants to start its vCPU > in user of kernel mode (why the interface permits that is another > question, but you have to play by what is there).My understanding is because of the trap bounce, we need to keep track of kernel/user mode 64bit PV guest in ring3. Fortunatley, nothing of that for PVH. thanks mukesh
Jan Beulich
2013-Jul-04 08:04 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
>>> On 04.07.13 at 04:00, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 03 Jul 2013 11:21:20 +0100 "Jan Beulich" <JBeulich@suse.com> wrote: >> Even if it really is (which I doubt), you still would make PVH >> different from both PV and HVM, which both don''t populate the >> selector fields of the frame (PV obviously has ->cs and ->ss >> populated [by the CPU], but HVM avoids even that). > > And what''s wrong with PVH being little different?There''s nothing wrong with this as long as it''s for a useful purpose, and without introducing hidden dependencies (the latter is what is happening here). Being different just for the purpose of being different is not desirable (and likely not even acceptable, as in any case this makes code more difficult to understand).>> We''ll have to see - at the first glance I don''t follow... > > Here''s what I am talking about: > > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -1241,10 +1241,10 @@ static void save_segments(struct vcpu *v) > struct cpu_user_regs *regs = &v->arch.user_regs; > unsigned int dirty_segment_mask = 0; > > - regs->ds = read_segment_register(v, regs, ds); > - regs->es = read_segment_register(v, regs, es); > - regs->fs = read_segment_register(v, regs, fs); > - regs->gs = read_segment_register(v, regs, gs); > + regs->ds = read_segment_register(v, regs, x86_seg_ds); > + regs->es = read_segment_register(v, regs, x86_seg_es); > + regs->fs = read_segment_register(v, regs, x86_seg_fs); > + regs->gs = read_segment_register(v, regs, x86_seg_gs);This I think is completely pointless a change if you keep the thing being a macro (using token concatenation): #define read_segment_register(vcpu, regs, name) \ ({ u16 sel_; \ const struct vcpu *vcpu_ = (vcpu); \ const struct cpu_user_regs *regs_ = (regs); \ \ if ( is_pvh_vcpu(vcpu_) && guest_mode(regs_) ) \ sel_ = pvh_get_selector(vcpu_, x86_seg_##name); \ else \ asm volatile ( "movw %%" #name ",%0" : "=r" (sel_) ); \ sel_; \ }) Jan
>>> On 04.07.13 at 04:02, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Wed, 03 Jul 2013 11:25:49 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 03.07.13 at 03:40, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Fri, 28 Jun 2013 10:44:08 +0100 >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> >> >> >>> wrote: >> >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" >> >> > <JBeulich@suse.com> wrote: >> > ...... >> >> And btw., looking at that patch again I''m also getting the >> >> impression that the GS base handling in that function is lacking >> >> consideration of VGCF_in_kernel. >> > >> > I still fail to see what VGCF_in_kernel has to do with GS base for >> > PVH guest. The flag should be irrelevant for PVH IMO. Can you kindly >> > elaborate a bit? >> >> VGCF_in_kernel specifies whether a guest wants to start its vCPU >> in user of kernel mode (why the interface permits that is another >> question, but you have to play by what is there). > > My understanding is because of the trap bounce, we need to keep track > of kernel/user mode 64bit PV guest in ring3. Fortunatley, nothing of > that for PVH.That''s correct, but is largely unrelated to the point I was making. If anything you might sanity check RPL and/or DPL of CS and SS against that flag, and return an error if they''re inconsistent. Jan
On Fri, 28 Jun 2013 10:31:53 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 03:35, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 25.06.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > --- /dev/null........> >> Which raises the question of whether your uses of > >> guest_kernel_mode() are appropriate in the first place: Before this > >> series there''s no use at all under xen/arch/x86/hvm/. > >> > >> And if it is, I''d like to point out once again that this check > >> should be looking at SS.DPL, not CS.RPL. > > > > Are you suggesting changing the macro to check for SS.DPL instead of > > CS.RPL it has always done for PV also? Note, PVH has checks in this > > patch to enforce long mode execution always, so CS.RPL should always > > be valid for PVH. > > I''m saying that guest_kernel_mode() should be looking at the > VMCS for PVH (and, should it happen to be used in HVM code > paths, for HVM too) rather than struct cpu_user_regs. That > makes the saving of the CS selector pointless (in line with how > HVM behaves), and once you''re going through > hvm_get_segment_register(), you can as well do this properly > (i.e. look at SS.DPL rather than CS.RPL). And no, repeatedlyOk, lmk if you are ok with following: diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile index d502bdf..eb5706e 100644 --- a/xen/arch/x86/Makefile +++ b/xen/arch/x86/Makefile @@ -41,6 +41,7 @@ obj-y += numa.o obj-y += pci.o obj-y += percpu.o obj-y += physdev.o +obj-y += pvh.o obj-y += setup.o obj-y += shutdown.o obj-y += smp.o diff --git a/xen/arch/x86/pvh.c b/xen/arch/x86/pvh.c new file mode 100644 index 0000000..db9d434 --- /dev/null +++ b/xen/arch/x86/pvh.c @@ -0,0 +1,11 @@ +#include <xen/sched.h> +#include <asm/hvm/hvm.h> + +bool_t pvh_kernel_mode(const struct vcpu *v) +{ + struct segment_register seg; + + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg); + return (seg.attr.fields.dpl != 3); +} + diff --git a/xen/include/asm-x86/x86_64/regs.h b/xen/include/asm-x86/x86_64/regs.h index 2ea49c5..c437a41 100644 --- a/xen/include/asm-x86/x86_64/regs.h +++ b/xen/include/asm-x86/x86_64/regs.h @@ -10,8 +10,10 @@ #define ring_2(r) (((r)->cs & 3) == 2) #define ring_3(r) (((r)->cs & 3) == 3) +bool_t pvh_kernel_mode(const struct vcpu *); + #define guest_kernel_mode(v, r) \ - (is_pvh_vcpu(v) ? (ring_0(r)) : \ + (is_pvh_vcpu(v) ? (pvh_kernel_mode(v)) : \ (!is_pv_32bit_vcpu(v) ? \ (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ (ring_1(r))))
Mukesh Rathor
2013-Jul-06 01:43 UTC
Re: [PATCH 09/18] PVH xen: Support privileged op emulation for PVH
On Thu, 04 Jul 2013 09:04:48 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 04.07.13 at 04:00, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Wed, 03 Jul 2013 11:21:20 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote: > >> Even if it really is (which I doubt), you still would make PVH > >> different from both PV and HVM, which both don''t populate the > >> selector fields of the frame (PV obviously has ->cs and ->ss > >> populated [by the CPU], but HVM avoids even that). > > > > And what''s wrong with PVH being little different? > > There''s nothing wrong with this as long as it''s for a useful purpose, > and without introducing hidden dependencies (the latter is what is > happening here). Being different just for the purpose of being > different is not desirable (and likely not even acceptable, as in any > case this makes code more difficult to understand). > > >> We''ll have to see - at the first glance I don''t follow... > > > > Here''s what I am talking about: > > > > --- a/xen/arch/x86/domain.c > > +++ b/xen/arch/x86/domain.c > > @@ -1241,10 +1241,10 @@ static void save_segments(struct vcpu *v) > > struct cpu_user_regs *regs = &v->arch.user_regs; > > unsigned int dirty_segment_mask = 0; > > > > - regs->ds = read_segment_register(v, regs, ds); > > - regs->es = read_segment_register(v, regs, es); > > - regs->fs = read_segment_register(v, regs, fs); > > - regs->gs = read_segment_register(v, regs, gs); > > + regs->ds = read_segment_register(v, regs, x86_seg_ds); > > + regs->es = read_segment_register(v, regs, x86_seg_es); > > + regs->fs = read_segment_register(v, regs, x86_seg_fs); > > + regs->gs = read_segment_register(v, regs, x86_seg_gs); > > This I think is completely pointless a change if you keep the thing > being a macro (using token concatenation):I find the ## to be disgusting because it makes code much harder to read/understand as one cannot use cscope/grep to find things. But, I''ll make the change as suggested. thanks Mukesh
>>> On 06.07.13 at 03:31, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > Ok, lmk if you are ok with following:Fundamentally yes. But ...> --- a/xen/arch/x86/Makefile > +++ b/xen/arch/x86/Makefile > @@ -41,6 +41,7 @@ obj-y += numa.o > obj-y += pci.o > obj-y += percpu.o > obj-y += physdev.o > +obj-y += pvh.oDoes this indeed warrant a separate file?> --- /dev/null > +++ b/xen/arch/x86/pvh.c > @@ -0,0 +1,11 @@ > +#include <xen/sched.h> > +#include <asm/hvm/hvm.h> > + > +bool_t pvh_kernel_mode(const struct vcpu *v) > +{ > + struct segment_register seg; > + > + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg);Ugly cast, calling for hvm_get_segment_register()''s declaration to be changed to include "const" instead.> + return (seg.attr.fields.dpl != 3);It''s not really clear what we want to call "kernel mode"; I''d think though that only ring 0 should be considered such (albeit I can see reasons to treat all but ring 3 this way, yet it''s really an attribute of the guest OS what rings 1 and 2 are used for). Jan
On Mon, 08 Jul 2013 09:31:17 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 06.07.13 at 03:31, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > Ok, lmk if you are ok with following: > > Fundamentally yes. But ... > > > --- a/xen/arch/x86/Makefile > > +++ b/xen/arch/x86/Makefile > > @@ -41,6 +41,7 @@ obj-y += numa.o > > obj-y += pci.o > > obj-y += percpu.o > > obj-y += physdev.o > > +obj-y += pvh.o > > Does this indeed warrant a separate file?yeah, i wasn''t sure about that, but was not sure where to put it. I think we could just have hvm_kernel_mode() next to hvm_get_segment_register in hvm.h. It can then also be used in HVM code in various places where it currently checks for dpl/cpl. thanks Mukesh
On Mon, 8 Jul 2013 16:09:55 -0700 Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> On Mon, 08 Jul 2013 09:31:17 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >>> On 06.07.13 at 03:31, Mukesh Rathor <mukesh.rathor@oracle.com> > > >>> wrote: > > > Ok, lmk if you are ok with following: > > > > Fundamentally yes. But ... > > > > > --- a/xen/arch/x86/Makefile > > > +++ b/xen/arch/x86/Makefile > > > @@ -41,6 +41,7 @@ obj-y += numa.o > > > obj-y += pci.o > > > obj-y += percpu.o > > > obj-y += physdev.o > > > +obj-y += pvh.o > > > > Does this indeed warrant a separate file? > > yeah, i wasn''t sure about that, but was not sure where to put it. I > think we could just have hvm_kernel_mode() next to > hvm_get_segment_register in hvm.h. It can then also be used in HVM > code in various places where it currently checks for dpl/cpl.Actually, not feasible to put anything in any header since regs.h is a pretty early-on header include, and can''t include any other headers. So: diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 8284b3b..06f9470 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4642,6 +4642,14 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v) return hvm_funcs.nhvm_intr_blocked(v); } +bool_t hvm_kernel_mode(const struct vcpu *v) +{ + struct segment_register seg; + + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg); + return (seg.attr.fields.dpl == 0); +} + /* * Local variables: * mode: C diff --git a/xen/include/asm-x86/x86_64/regs.h b/xen/include/asm-x86/x86_64/regs index 2ea49c5..7a9bc44 100644 --- a/xen/include/asm-x86/x86_64/regs.h +++ b/xen/include/asm-x86/x86_64/regs.h @@ -10,8 +10,10 @@ #define ring_2(r) (((r)->cs & 3) == 2) #define ring_3(r) (((r)->cs & 3) == 3) +bool_t hvm_kernel_mode(const struct vcpu *); + #define guest_kernel_mode(v, r) \ - (is_pvh_vcpu(v) ? (ring_0(r)) : \ + (is_pvh_vcpu(v) ? (hvm_kernel_mode(v)) : \ (!is_pv_32bit_vcpu(v) ? \ (ring_3(r) && ((v)->arch.flags & TF_kernel_mode)) : \ (ring_1(r)))) Also, the cast in hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg); is difficult to change since it triggers a chain of events down the stream in *_get_segment_register(). So for now, I''ll just leave it. thanks Mukesh
>>> On 09.07.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Mon, 8 Jul 2013 16:09:55 -0700 > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > >> On Mon, 08 Jul 2013 09:31:17 +0100 >> "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > >>> On 06.07.13 at 03:31, Mukesh Rathor <mukesh.rathor@oracle.com> >> > >>> wrote: >> > > Ok, lmk if you are ok with following: >> > >> > Fundamentally yes. But ... >> > >> > > --- a/xen/arch/x86/Makefile >> > > +++ b/xen/arch/x86/Makefile >> > > @@ -41,6 +41,7 @@ obj-y += numa.o >> > > obj-y += pci.o >> > > obj-y += percpu.o >> > > obj-y += physdev.o >> > > +obj-y += pvh.o >> > >> > Does this indeed warrant a separate file? >> >> yeah, i wasn''t sure about that, but was not sure where to put it. I >> think we could just have hvm_kernel_mode() next to >> hvm_get_segment_register in hvm.h. It can then also be used in HVM >> code in various places where it currently checks for dpl/cpl. > > Actually, not feasible to put anything in any header since regs.h is a > pretty > early-on header include, and can''t include any other headers. So:Fine with me, except (as said before) ...> --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -4642,6 +4642,14 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v) > return hvm_funcs.nhvm_intr_blocked(v); > } > > +bool_t hvm_kernel_mode(const struct vcpu *v) > +{ > + struct segment_register seg; > + > + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg);.. for this cast. Jan> + return (seg.attr.fields.dpl == 0); > +} > +
On Tue, 09 Jul 2013 08:31:24 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 09.07.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Mon, 8 Jul 2013 16:09:55 -0700 > > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > > > >> On Mon, 08 Jul 2013 09:31:17 +0100 > >> "Jan Beulich" <JBeulich@suse.com> wrote:.......> Fine with me, except (as said before) ... > > > --- a/xen/arch/x86/hvm/hvm.c > > +++ b/xen/arch/x86/hvm/hvm.c > > @@ -4642,6 +4642,14 @@ enum hvm_intblk > > nhvm_interrupt_blocked(struct vcpu *v) return > > hvm_funcs.nhvm_intr_blocked(v); } > > > > +bool_t hvm_kernel_mode(const struct vcpu *v) > > +{ > > + struct segment_register seg; > > + > > + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg); > > .. for this cast.Like I said in prev email, changing the cast is very hard, as it trickles down all the way to vcpu_runnable thru SVM and VMX and would need changing vcpu_runnable itself and all callers of it. So, I can either leave the cast, or better just remove "const" from the sole caller using it, please LMK: diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c index d2f7209..dae8261 100644 --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -141,7 +141,7 @@ void show_registers(struct cpu_user_regs *regs) } } -void vcpu_show_registers(const struct vcpu *v) +void vcpu_show_registers(struct vcpu *v) { const struct cpu_user_regs *regs = &v->arch.user_regs; unsigned long crs[8]; diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index c3f9f8e..22a72df 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -447,7 +447,7 @@ struct arch_vcpu #define hvm_svm hvm_vcpu.u.svm void vcpu_show_execution_state(struct vcpu *); -void vcpu_show_registers(const struct vcpu *); +void vcpu_show_registers(struct vcpu *);
>>> On 10.07.13 at 02:33, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 09 Jul 2013 08:31:24 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 09.07.13 at 02:01, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Mon, 8 Jul 2013 16:09:55 -0700 >> > Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> > >> >> On Mon, 08 Jul 2013 09:31:17 +0100 >> >> "Jan Beulich" <JBeulich@suse.com> wrote: > ....... >> Fine with me, except (as said before) ... >> >> > --- a/xen/arch/x86/hvm/hvm.c >> > +++ b/xen/arch/x86/hvm/hvm.c >> > @@ -4642,6 +4642,14 @@ enum hvm_intblk >> > nhvm_interrupt_blocked(struct vcpu *v) return >> > hvm_funcs.nhvm_intr_blocked(v); } >> > >> > +bool_t hvm_kernel_mode(const struct vcpu *v) >> > +{ >> > + struct segment_register seg; >> > + >> > + hvm_get_segment_register((struct vcpu *)v, x86_seg_ss, &seg); >> >> .. for this cast. > > Like I said in prev email, changing the cast is very hard, as it trickles > down all the way to vcpu_runnable thru SVM and VMX and would need changing > vcpu_runnable itself and all callers of it. So, I can either leave the cast, > or better just remove "const" from the sole caller using it, please LMK:At first I wanted to say this is a no-go, but in fact you can''t have hvm_get_segment_register() have a const vcpu pointer: On VMX, if v != current, this may involve a vcpu_pause, and that one _can''t_ have a const pointer passed. Thus casting away the constness above is actively wrong without an assertion proving v == current (the validity of which would depend on whether this is used in the context switch path). Hence sadly I have to agree to you removing the const from vcpu_show_registers()'' only parameter, no matter how wrong that looks. Jan
Mukesh Rathor
2013-Jul-12 00:29 UTC
Re: [PATCH 10/18] PVH xen: interrupt/event-channel delivery to PVH
On Tue, 25 Jun 2013 10:29:54 -0400 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Mon, Jun 24, 2013 at 05:01:39PM -0700, Mukesh Rathor wrote: > > PVH uses HVMIRQ_callback_vector for interrupt delivery. Also, change > > hvm_vcpu_has_pending_irq() as PVH doesn''t use vlapic emulation. > > Please explain why it can''t use the normal "if .." in the > hvm_vcpu_has_pending_irq. > > I figured it is b/c the guest boots in an HVM container so the > event mechanism is offline until it gets enabled. And that means > no HVM type interrupts (so emulated timer interrupts say) should > interrupt it until the event mechanism (or rather the callback > vector) is in place. >What you say is true, I can put that in git commit. The function handles hvm which can either be callback or vlapic. For PVH we know vlapic is never used, so we can just shortcut that. thanks mukesh
On Fri, 28 Jun 2013 10:44:08 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > > <JBeulich@suse.com> wrote: > >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > >> > vcpu_guest_context *ctxtp) +{ > >> > + if ( v->vcpu_id == 0 ) > >> > + return 0; > >> > + > >> > + vmx_vmcs_enter(v); > >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > >> > + > >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > >> > >> How does this work without also writing the "hidden" register > >> fields? > > > > This is for bringing up SMP CPUs by the guest, which already has > > set GDT up so it just needs selectors to be loaded to start the > > target vcpu. > > That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll > get the hidden register fields loaded from the VMCS, without > accessing the GDT. If that understanding of mine is wrong, please > explain how you see things working in more detail.Re-reading this I realize I misunderstood your question. Sorry. The hidden fields are set to the default values in vmcs create. That call comes from the tool stack during domain creation via the do_domctl ---> vcpu_initialise hcall. That happens during guest creation. Here, the guest is booting and bringing up secondary CPU, and wants to set certain fields via the do_vcpu_op() -> VCPUOP_initialise hcall. Hope that clarifies. Thanks Mukesh
>>> On 16.07.13 at 04:00, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Fri, 28 Jun 2013 10:44:08 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" >> > <JBeulich@suse.com> wrote: >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct >> >> > vcpu_guest_context *ctxtp) +{ >> >> > + if ( v->vcpu_id == 0 ) >> >> > + return 0; >> >> > + >> >> > + vmx_vmcs_enter(v); >> >> > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); >> >> > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); >> >> > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); >> >> > + >> >> > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); >> >> > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); >> >> > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); >> >> > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); >> >> > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); >> >> >> >> How does this work without also writing the "hidden" register >> >> fields? >> > >> > This is for bringing up SMP CPUs by the guest, which already has >> > set GDT up so it just needs selectors to be loaded to start the >> > target vcpu. >> >> That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll >> get the hidden register fields loaded from the VMCS, without >> accessing the GDT. If that understanding of mine is wrong, please >> explain how you see things working in more detail. > > Re-reading this I realize I misunderstood your question. Sorry. The hidden > fields are set to the default values in vmcs create. That call comes from > the tool stack during domain creation via the do_domctl ---> vcpu_initialise > hcall. That happens during guest creation. Here, the guest is booting and > bringing up secondary CPU, and wants to set certain fields via > the do_vcpu_op() -> VCPUOP_initialise hcall. > > Hope that clarifies.Yes, that clarifies that my complaint was right. Either you add a comment to the code stating why you don''t load these fields with the guest specified values (from the GDT/LDT), or you set them to the specified values rather than the defaults (or, of course, a mixture of both - it may be reasonable to require the descriptors referenced by CS, SS, DS, and ES to use defaults, but extending this to FS, GS, and LDT would likely be too much of a requirement). And even if implying defaults, verifying the descriptors to match those requirements at least in debug builds may be a good idea. Jan
On Tue, 16 Jul 2013 07:50:49 +0100 "Jan Beulich" <JBeulich@suse.com> wrote:> >>> On 16.07.13 at 04:00, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > On Fri, 28 Jun 2013 10:44:08 +0100 > > "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> > >> >>> wrote: > >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" > >> > <JBeulich@suse.com> wrote: > >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct....> >> >> How does this work without also writing the "hidden" register > >> >> fields? > >> > > >> > This is for bringing up SMP CPUs by the guest, which already has > >> > set GDT up so it just needs selectors to be loaded to start the > >> > target vcpu. > >> > >> That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll > >> get the hidden register fields loaded from the VMCS, without > >> accessing the GDT. If that understanding of mine is wrong, please > >> explain how you see things working in more detail. > > > > Re-reading this I realize I misunderstood your question. Sorry. The > > hidden fields are set to the default values in vmcs create. That > > call comes from the tool stack during domain creation via the > > do_domctl ---> vcpu_initialise hcall. That happens during guest > > creation. Here, the guest is booting and bringing up secondary CPU, > > and wants to set certain fields via the do_vcpu_op() -> > > VCPUOP_initialise hcall. > > > > Hope that clarifies. > > Yes, that clarifies that my complaint was right. Either you add a > comment to the code stating why you don''t load these fields with > the guest specified values (from the GDT/LDT), or you set them > to the specified values rather than the defaults (or, of course, a > mixture of both - it may be reasonable to require the descriptors > referenced by CS, SS, DS, and ES to use defaults, but extending > this to FS, GS, and LDT would likely be too much of a requirement). > And even if implying defaults, verifying the descriptors to match > those requirements at least in debug builds may be a good idea.I see your point. I had it in the back of my mind that loading a selector will cause the hidden fields to be loaded, SDM Vol 3A 3.4.3, but I see that it doesn''t apply to VMCS. I didn''t realize this because the default values are the same as the initial guest GDT. So, options: o Add comment indicating that we don''t need to load hidden parts because they are same as default values. - This makes it linux dependent. o Load hidden parts also for any non null CS/SS/DS/ES selector being loaded. - This is better if the guest GDT doesn''t match defaults. o Don''t worry about the selectors in xen, just let the guest load it itself first thing. I tested this and works fine. The ABI is flexible at this point :). Please lmk what you prefer. I prefer the last option as I believe the hypervisor should do the least, but am OK with any options you prefer. I think this is the last issue from prev version, and I''d be ready to submit next version of patches as soon as I fix this. thanks Mukesh
>>> On 17.07.13 at 02:47, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: > On Tue, 16 Jul 2013 07:50:49 +0100 > "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 16.07.13 at 04:00, Mukesh Rathor <mukesh.rathor@oracle.com> >> >>> wrote: >> > On Fri, 28 Jun 2013 10:44:08 +0100 >> > "Jan Beulich" <JBeulich@suse.com> wrote: >> > >> >> >>> On 28.06.13 at 04:28, Mukesh Rathor <mukesh.rathor@oracle.com> >> >> >>> wrote: >> >> > On Tue, 25 Jun 2013 11:49:57 +0100 "Jan Beulich" >> >> > <JBeulich@suse.com> wrote: >> >> >> > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct > .... >> >> >> How does this work without also writing the "hidden" register >> >> >> fields? >> >> > >> >> > This is for bringing up SMP CPUs by the guest, which already has >> >> > set GDT up so it just needs selectors to be loaded to start the >> >> > target vcpu. >> >> >> >> That makes no sense to me: Once you VMLAUNCH that vCPU, it''ll >> >> get the hidden register fields loaded from the VMCS, without >> >> accessing the GDT. If that understanding of mine is wrong, please >> >> explain how you see things working in more detail. >> > >> > Re-reading this I realize I misunderstood your question. Sorry. The >> > hidden fields are set to the default values in vmcs create. That >> > call comes from the tool stack during domain creation via the >> > do_domctl ---> vcpu_initialise hcall. That happens during guest >> > creation. Here, the guest is booting and bringing up secondary CPU, >> > and wants to set certain fields via the do_vcpu_op() -> >> > VCPUOP_initialise hcall. >> > >> > Hope that clarifies. >> >> Yes, that clarifies that my complaint was right. Either you add a >> comment to the code stating why you don''t load these fields with >> the guest specified values (from the GDT/LDT), or you set them >> to the specified values rather than the defaults (or, of course, a >> mixture of both - it may be reasonable to require the descriptors >> referenced by CS, SS, DS, and ES to use defaults, but extending >> this to FS, GS, and LDT would likely be too much of a requirement). >> And even if implying defaults, verifying the descriptors to match >> those requirements at least in debug builds may be a good idea. > > I see your point. I had it in the back of my mind that loading a selector > will cause the hidden fields to be loaded, SDM Vol 3A 3.4.3, but I see > that it doesn''t apply to VMCS. I didn''t realize this because the default > values are the same as the initial guest GDT. So, options: > > o Add comment indicating that we don''t need to load hidden parts because > they are same as default values. > > - This makes it linux dependent.Not if documented to be this way. There are restrictions on what a PV guest may do, so reasonable restrictions - properly documented - can also be placed on PVH.> o Load hidden parts also for any non null CS/SS/DS/ES selector being > loaded. > > - This is better if the guest GDT doesn''t match defaults.If going that route, null selectors also need taking care of (you would have to invalidate register, CS only for a 64-bit guest, but all of them for a 32-bit one). And of course, a NULL selector in CS (and for a 32-bit guest in SS) would need to be treated as illegal (and would probably also cause vmlaunch to fail).> o Don''t worry about the selectors in xen, just let the guest load it itself > first thing. I tested this and works fine. The ABI is flexible at this > point :).I don''t like this option. PV startup code doesn''t need to care about setting up selector registers that aren''t used for special purposes, so I''d prefer PVH to not start placing such a requirement onto the guest. Early startup code - aiui - should be possible to be the same for PV and PVH modes. Am I mistaken here? Bottom line, considering that you probably want to save yourself the hassle of implementing the second of the options above, I''d see the first one as the better choice over the last one. Jan