Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH RFC] Extending pvclock down to usermode for vsyscall
This series implements the vread method for the clocksource, which allows it to be used directly from usermode via vsyscall. Most of the work is in the generic pvclock code, with a bit of hypervisor-specific code to set things up. Specifically, it sets up a page which is an array of pvclock_vcpu_time_info structures indexed by vcpu number. The usermode code gets the current (v)cpu via vgetcpu, gets the appropriate time info and computes system time from the tsc. After doing all that it rechecks the version number to make things happened consistently. The kernel also installs a preempt notifier to bump the version number so that usermode can tell it has had its cpu context switched under it. I''ve included the Xen implementation, but it should be simple to implement for KVM (especially since it already has the feature I had to add to Xen to implement this: the ability to place the pvclock_vcpu_time_info structure at an arbitrary address). With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz Q6600. I''m sure this could be tuned a bit more, but it is already much better than a syscall. (There''s also a couple of other little generic pvclock fixes I''ll send separately.) Thanks, J arch/x86/Kconfig | 4 + arch/x86/include/asm/fixmap.h | 21 +++++ arch/x86/include/asm/pvclock.h | 6 + arch/x86/include/asm/vsyscall.h | 3 arch/x86/kernel/Makefile | 2 arch/x86/kernel/pvclock.c | 156 ++++++++++++++++++++++++++++++++++++---- arch/x86/xen/Kconfig | 6 + arch/x86/xen/mmu.c | 3 arch/x86/xen/smp.c | 2 arch/x86/xen/time.c | 52 +++++++++++++ arch/x86/xen/xen-ops.h | 8 ++ include/xen/interface/vcpu.h | 41 ++++++++++ 12 files changed, 291 insertions(+), 13 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH 1/5] x86/pvclock: make sure rdtsc doesn''t speculate out of region
rdtsc is not necessarily ordered with surrounding instructions so we need to make sure that it doesn''t get sampled out of order with respect to reading the time_info. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/kernel/pvclock.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 03801f2..963f349 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -117,10 +117,10 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) do { version = pvclock_get_time_values(&shadow, src); - barrier(); + rdtsc_barrier(); offset = pvclock_get_nsec_offset(&shadow); ret = shadow.system_timestamp + offset; - barrier(); + rdtsc_barrier(); } while (version != src->version); return ret; -- 1.6.2.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH 2/5] x86/pvclock: no need to use strong read barriers in pvclock_get_time_values
The time values are guaranteed to be only updated by the pcpu that the vcpu is currently running on (preempting the guest), so there''s never any risk of cross-pcpu races. Therefore, we only need a barrier to prevent compiler reordering. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/kernel/pvclock.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 963f349..5ecce7f 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -86,12 +86,12 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, { do { dst->version = src->version; - rmb(); /* fetch version before data */ + barrier(); /* fetch version before data */ dst->tsc_timestamp = src->tsc_timestamp; dst->system_timestamp = src->system_time; dst->tsc_to_nsec_mul = src->tsc_to_system_mul; dst->tsc_shift = src->tsc_shift; - rmb(); /* test version after fetching data */ + barrier(); /* test version after fetching data */ } while ((src->version & 1) || (dst->version != src->version)); return dst->version; -- 1.6.2.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH 3/5] x86/pvclock: add vsyscall implementation
This patch allows the pvclock mechanism to be used in usermode. To do this, we map an extra page into usermode containing an array of pvclock_vcpu_time_info structures which give the information required to compute a global system clock from the tsc. With this, we can implement pvclock_clocksource_vread(). One complication is that usermode is subject to two levels of scheduling: kernel scheduling of tasks onto vcpus, and hypervisor scheduling of vcpus onto pcpus. In either case the underlying pcpu changed, and with it, the correct set of parameters to compute tsc->system clock. To address this we install a preempt notifier on sched_out to increment that vcpu''s version number. Usermode can then check the version number is unchanged while computing the time and retry if it has (the only difference from the kernel''s version of the algorithm is that the vcpu may have changed, so we may need to switch pvclock_vcpu_time_info structures. To use this feature, hypervisor-specific code is required to call pvclock_init_vsyscall(), and if successful: - cause the pvclock_vcpu_time_info structure at pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for each vcpu. - use pvclock_clocksource_vread as the implementation of clocksource .vread. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Keir Fraser <keir.fraser@eu.citrix.com> Cc: Avi Kivity <avi@redhat.com> Cc: Glauber Costa <gcosta@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> --- arch/x86/Kconfig | 4 + arch/x86/include/asm/fixmap.h | 3 + arch/x86/include/asm/pvclock.h | 6 ++ arch/x86/include/asm/vsyscall.h | 3 + arch/x86/kernel/Makefile | 2 + arch/x86/kernel/pvclock.c | 152 ++++++++++++++++++++++++++++++++++++--- 6 files changed, 160 insertions(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 13ffa5d..93346ff 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -518,6 +518,10 @@ config PARAVIRT_CLOCK bool default n +config PARAVIRT_CLOCK_VSYSCALL + bool + depends on PARAVIRT_CLOCK && PREEMPT_NOTIFIERS + endif config PARAVIRT_DEBUG diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index 7b2d71d..ff3cffa 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -80,6 +80,9 @@ enum fixed_addresses { + ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1, VSYSCALL_HPET, #endif +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + FIX_PVCLOCK_TIME_INFO, +#endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_X86_LOCAL_APIC diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 53235fd..d2402b3 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -3,6 +3,7 @@ #include <linux/clocksource.h> #include <asm/pvclock-abi.h> +#include <asm/vsyscall.h> /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); @@ -11,4 +12,9 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +int __init pvclock_init_vsyscall(void); +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); + +cycle_t __vsyscall_fn pvclock_clocksource_vread(void); + #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h index d0983d2..df5fb43 100644 --- a/arch/x86/include/asm/vsyscall.h +++ b/arch/x86/include/asm/vsyscall.h @@ -33,6 +33,9 @@ enum vsyscall_num { extern int __vgetcpu_mode; extern volatile unsigned long __jiffies; +struct getcpu_cache; +extern long vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache); + /* kernel space (writeable) */ extern int vgetcpu_mode; extern struct timezone sys_tz; diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 430d5b2..97d2e88 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -24,10 +24,12 @@ CFLAGS_vsyscall_64.o := $(PROFILING) -g0 $(nostackp) CFLAGS_hpet.o := $(nostackp) CFLAGS_tsc.o := $(nostackp) CFLAGS_paravirt.o := $(nostackp) +CFLAGS_pvclock.o := $(nostackp) GCOV_PROFILE_vsyscall_64.o := n GCOV_PROFILE_hpet.o := n GCOV_PROFILE_tsc.o := n GCOV_PROFILE_paravirt.o := n +GCOV_PROFILE_pvclock.o := n obj-y := process_$(BITS).o signal.o entry_$(BITS).o obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 5ecce7f..14de7f3 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -17,7 +17,9 @@ #include <linux/kernel.h> #include <linux/percpu.h> + #include <asm/pvclock.h> +#include <asm/vsyscall.h> /* * These are perodically updated @@ -71,9 +73,10 @@ static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift) return product; } -static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +static __always_inline +u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { - u64 delta = native_read_tsc() - shadow->tsc_timestamp; + u64 delta = __native_read_tsc() - shadow->tsc_timestamp; return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift); } @@ -81,8 +84,9 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) * Reads a consistent set of time-base values from hypervisor, * into a shadow data area. */ -static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, - struct pvclock_vcpu_time_info *src) +static __always_inline +unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, + const struct pvclock_vcpu_time_info *src) { do { dst->version = src->version; @@ -109,18 +113,31 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) return pv_tsc_khz; } -cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) +static __always_inline +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, + cycle_t *cycles) { struct pvclock_shadow_time shadow; unsigned version; cycle_t ret, offset; + version = pvclock_get_time_values(&shadow, src); + rdtsc_barrier(); + offset = pvclock_get_nsec_offset(&shadow); + ret = shadow.system_timestamp + offset; + rdtsc_barrier(); + + *cycles = ret; + return version; +} + +cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) +{ + unsigned version; + cycle_t ret; + do { - version = pvclock_get_time_values(&shadow, src); - rdtsc_barrier(); - offset = pvclock_get_nsec_offset(&shadow); - ret = shadow.system_timestamp + offset; - rdtsc_barrier(); + version = __pvclock_read_cycles(src, &ret); } while (version != src->version); return ret; @@ -151,3 +168,118 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } + +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +static struct pvclock_vcpu_time_info *pvclock_vsyscall_time_info; + +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) +{ + if (pvclock_vsyscall_time_info == NULL) + return NULL; + + return &pvclock_vsyscall_time_info[cpu]; +} + +static void vti_inc_version(struct pvclock_vcpu_time_info *pvti) +{ + /* + * This increments the version in an interrupt-atomic way. + * We''re not concerned about global bus (inter-cpu) atomicity, + * but we just need to make sure the update can''t be + * interrupted by the hypervisor preempting us. + */ +#ifdef CONFIG_X86 + asm("add $2, %0\n" : "+m" (pvti->version)); +#else +#error FIXME +#endif +} + +/* + * Increment version when switching away from a task so that it can + * tell if it has switched vcpus (hypervisor''s update of the version + * will tell it if it switches pcpus). + */ +static void pvclock_vsyscall_sched_out(struct preempt_notifier *pn, + struct task_struct *next) +{ + int cpu = smp_processor_id(); + struct pvclock_vcpu_time_info *pvti; + + pvti = pvclock_get_vsyscall_time_info(cpu); + if (pvti) + vti_inc_version(pvti); +} + +/* Don''t care about scheduling in */ +static void pvclock_vsyscall_sched_in(struct preempt_notifier *notifier, int cpu) +{ +} + +static struct preempt_notifier pvclock_vsyscall_notifier; +static struct preempt_ops pvclock_vsyscall_preempt_ops = { + .sched_in = pvclock_vsyscall_sched_in, + .sched_out = pvclock_vsyscall_sched_out, +}; + +cycle_t __vsyscall_fn pvclock_clocksource_vread(void) +{ + const struct pvclock_vcpu_time_info *pvti_base; + const struct pvclock_vcpu_time_info *pvti; + cycle_t ret; + u32 version; + + pvti_base = (struct pvclock_vcpu_time_info *)fix_to_virt(FIX_PVCLOCK_TIME_INFO); + + /* + * When looping to get a consistent (time-info, tsc) pair, we + * also need to deal with the possibility we can switch vcpus, + * so make sure we always re-fetch time-info for the current vcpu. + */ + do { + unsigned cpu; + + vgetcpu(&cpu, NULL, NULL); + pvti = &pvti_base[cpu]; + + version = __pvclock_read_cycles(pvti, &ret); + } while (unlikely(pvti->version != version)); + + return ret; +} + +/* + * Initialize the generic pvclock vsyscall state. This will allocate + * a/some page(s) for the per-vcpu pvclock information, set up a + * fixmap mapping for the page(s) + */ +int __init pvclock_init_vsyscall(void) +{ + int cpu; + + /* Just one page for now */ + if (nr_cpu_ids * sizeof(struct vcpu_time_info) > PAGE_SIZE) { + printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit time_info into a single page\n"); + return -ENOSPC; + } + + pvclock_vsyscall_time_info + (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL); + if (pvclock_vsyscall_time_info == NULL) + return -ENOMEM; + + for (cpu = 0; cpu < nr_cpu_ids; cpu++) + pvclock_vsyscall_time_info[cpu].version = ~0; + + __set_fixmap(FIX_PVCLOCK_TIME_INFO, __pa(pvclock_vsyscall_time_info), + PAGE_KERNEL_VSYSCALL); + + preempt_notifier_init(&pvclock_vsyscall_notifier, + &pvclock_vsyscall_preempt_ops); + preempt_notifier_register(&pvclock_vsyscall_notifier); + + return 0; +} + +#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */ -- 1.6.2.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps
Some set of fixmaps are intented to be mapped into usermode. Add an predicate to test for those fixmaps. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/include/asm/fixmap.h | 18 ++++++++++++++++++ 1 files changed, 18 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index ff3cffa..0edc2bb 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx, extern void __this_fixmap_does_not_exist(void); +static inline bool user_fixmap(enum fixed_addresses fixmap) +{ + switch (fixmap) { +#ifdef CONFIG_X86_32 + case FIX_HOLE ... FIX_VDSO: +#else + case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET: +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + case FIX_PVCLOCK_TIME_INFO: +#endif +#endif + return true; + + default: + return false; + } +} + /* * ''index to address'' translation. If anyone tries to use the idx * directly without translation, we catch the bug with a NULL-deference -- 1.6.2.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 00:50 UTC
[Xen-devel] [PATCH 5/5] xen/time: add pvclock_clocksource_vread support
Add support to register pvclock_vcpu_time_info structures in the userspace mapped page and set the xen clocksource .vread method if that works. The common pvclock code does everything else. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/xen/Kconfig | 6 +++++ arch/x86/xen/mmu.c | 3 +- arch/x86/xen/smp.c | 2 + arch/x86/xen/time.c | 52 ++++++++++++++++++++++++++++++++++++++++++ arch/x86/xen/xen-ops.h | 8 ++++++ include/xen/interface/vcpu.h | 41 +++++++++++++++++++++++++++++++++ 6 files changed, 111 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig index b83e119..a002004 100644 --- a/arch/x86/xen/Kconfig +++ b/arch/x86/xen/Kconfig @@ -13,6 +13,11 @@ config XEN kernel to boot in a paravirtualized environment under the Xen hypervisor. +config XEN_TIME_VSYSCALL + def_bool y + depends on PREEMPT_NOTIFIERS + select PARAVIRT_CLOCK_VSYSCALL + config XEN_MAX_DOMAIN_MEMORY int "Maximum allowed size of a domain in gigabytes" default 8 if X86_32 @@ -36,3 +41,4 @@ config XEN_DEBUG_FS help Enable statistics output and various tuning options in debugfs. Enabling this option may incur a significant performance overhead. + diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 4ceb285..99b8aa5 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) # endif #else case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE: + case FIX_PVCLOCK_TIME_INFO: #endif #ifdef CONFIG_X86_LOCAL_APIC case FIX_APIC_BASE: /* maps dummy local APIC */ @@ -1834,7 +1835,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) #ifdef CONFIG_X86_64 /* Replicate changes to map the vsyscall page into the user pagetable vsyscall mapping. */ - if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) { + if (user_fixmap(idx)) { unsigned long vaddr = __fix_to_virt(idx); set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte); } diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 429834e..a2ee882 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -313,6 +313,8 @@ static int __cpuinit xen_cpu_up(unsigned int cpu) if (rc) return rc; + xen_setup_vcpu_vsyscall_time_info(cpu); + rc = HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL); BUG_ON(rc); diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index 0a5aa44..ab3119e 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -474,6 +474,55 @@ void xen_timer_resume(void) } } +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL +void xen_setup_vcpu_vsyscall_time_info(int cpu) +{ + int ret; + struct pvclock_vcpu_time_info *pvti; + struct vcpu_register_time_memory_area t; + + pvti = pvclock_get_vsyscall_time_info(cpu); + if (!pvti) + return; + + t.addr.pv = pvti; + + ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, + cpu, &t); + /* + * If the call succeeds, it will update the vcpu_time_info and + * set the version to something valid. If it fails, we set + * the version to invalid so that usermode doesn''t try to use + * it. + */ + if (ret != 0) + pvti->version = ~0; +} + +static int __init xen_setup_vsyscall_timeinfo(int cpu) +{ + int ret; + + ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, + cpu, NULL); + if (ret == -ENOSYS) { + printk(KERN_INFO "xen: vcpu_time_info placement not supported\n"); + return ret; + } + + ret = pvclock_init_vsyscall(); + if (ret != 0) { + printk(KERN_INFO "xen: Failed to initialize pvclock vsyscall: %d\n", + ret); + return ret; + } + + xen_setup_vcpu_vsyscall_time_info(cpu); + + return 0; +} +#endif /* CONFIG_PARAVIRT_CLOCK_VSYSCALL */ + __init void xen_time_init(void) { int cpu = smp_processor_id(); @@ -496,4 +545,7 @@ __init void xen_time_init(void) xen_setup_timer(cpu); xen_setup_cpu_clockevents(); + + if (xen_setup_vsyscall_timeinfo(cpu) == 0) + xen_clocksource.vread = pvclock_clocksource_vread; } diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 22494fd..d92ddc8 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -58,6 +58,14 @@ bool xen_vcpu_stolen(int vcpu); void xen_setup_vcpu_info_placement(void); +#ifdef CONFIG_XEN_TIME_VSYSCALL +void xen_setup_vcpu_vsyscall_time_info(int cpu); +#else +static inline void xen_setup_vcpu_vsyscall_time_info(int cpu) +{ +} +#endif + #ifdef CONFIG_SMP void xen_smp_init(void); diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h index 87e6f8a..0a8edfd 100644 --- a/include/xen/interface/vcpu.h +++ b/include/xen/interface/vcpu.h @@ -170,4 +170,45 @@ struct vcpu_register_vcpu_info { }; DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info); + +/* + * Register a memory location to get a secondary copy of the vcpu time + * parameters. The master copy still exists as part of the vcpu + * shared memory area, and this secondary copy is updated whenever the + * master copy is updated. + * + * The intent is that this copy may be mapped (RO) into userspace so + * that usermode can compute system time using the time info and the + * tsc. Usermode will see an array of vcpu_time_info structures, one + * for each vcpu, and choose the right one by an existing mechanism + * which allows it to get the current vcpu number (such as via a + * segment limit). It can then apply the normal algorithm to compute + * system time from the tsc. + * + * However, because usermode threads are subject to two levels of + * scheduling (kernel scheduling of threads to vcpus, and Xen + * scheduling of vcpus to pcpus), we must make sure that the thread + * knows it has had a race with either (or both) of these two events. + * To allow this, the guest kernel updates the time_info version + * number when the vcpu does a context switch, so that usermode will + * always see a version number change when the parameters need to be + * revalidated. Xen makes sure that it always updates the guest''s + * version rather than overwriting it. (It assumes that a vcpu will + * always update its own version number, so there are no cross-cpu + * synchronization issues; the only concern is that if the guest + * kernel gets preempted by Xen it doesn''t revert the version number + * to an older value.) + * + * @extra_arg == pointer to vcpu_register_time_info_memory_area structure. + */ +#define VCPUOP_register_vcpu_time_memory_area 13 + +struct vcpu_register_time_memory_area { + union { + struct vcpu_time_info *v; + struct pvclock_vcpu_time_info *pv; + uint64_t p; + } addr; +}; + #endif /* __XEN_PUBLIC_VCPU_H__ */ -- 1.6.2.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-06 09:04 UTC
[Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/06/2009 02:50 AM, Jeremy Fitzhardinge wrote:> This patch allows the pvclock mechanism to be used in usermode. To > do this, we map an extra page into usermode containing an array of > pvclock_vcpu_time_info structures which give the information required > to compute a global system clock from the tsc. With this, we can > implement pvclock_clocksource_vread(). > > One complication is that usermode is subject to two levels of scheduling: > kernel scheduling of tasks onto vcpus, and hypervisor scheduling of > vcpus onto pcpus. In either case the underlying pcpu changed, and with > it, the correct set of parameters to compute tsc->system clock. To > address this we install a preempt notifier on sched_out to increment > that vcpu''s version number. Usermode can then check the version number > is unchanged while computing the time and retry if it has (the only > difference from the kernel''s version of the algorithm is that the vcpu > may have changed, so we may need to switch pvclock_vcpu_time_info > structures. > > To use this feature, hypervisor-specific code is required > to call pvclock_init_vsyscall(), and if successful: > - cause the pvclock_vcpu_time_info structure at > pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for > each vcpu. > - use pvclock_clocksource_vread as the implementation of clocksource > .vread. > > + > +cycle_t __vsyscall_fn pvclock_clocksource_vread(void) > +{ > + const struct pvclock_vcpu_time_info *pvti_base; > + const struct pvclock_vcpu_time_info *pvti; > + cycle_t ret; > + u32 version; > + > + pvti_base = (struct pvclock_vcpu_time_info *)fix_to_virt(FIX_PVCLOCK_TIME_INFO); > + > + /* > + * When looping to get a consistent (time-info, tsc) pair, we > + * also need to deal with the possibility we can switch vcpus, > + * so make sure we always re-fetch time-info for the current vcpu. > + */ > + do { > + unsigned cpu; > + > + vgetcpu(&cpu, NULL, NULL); > + pvti =&pvti_base[cpu]; > + > + version = __pvclock_read_cycles(pvti,&ret); > + } while (unlikely(pvti->version != version)); > + > + return ret; > +} >Instead of using vgetcpu() and rdtsc() independently, you can use rdtscp to read both atomically. This removes the need for the preempt notifier.> + > +/* > + * Initialize the generic pvclock vsyscall state. This will allocate > + * a/some page(s) for the per-vcpu pvclock information, set up a > + * fixmap mapping for the page(s) > + */ > +int __init pvclock_init_vsyscall(void) > +{ > + int cpu; > + > + /* Just one page for now */ > + if (nr_cpu_ids * sizeof(struct vcpu_time_info)> PAGE_SIZE) { > + printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit time_info into a single page\n"); > + return -ENOSPC; > + } > + > + pvclock_vsyscall_time_info > + (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL); > + if (pvclock_vsyscall_time_info == NULL) > + return -ENOMEM; > + >Need to align the vcpu_time_infos on a cacheline boundary.> + for (cpu = 0; cpu< nr_cpu_ids; cpu++) > + pvclock_vsyscall_time_info[cpu].version = ~0; > + > + __set_fixmap(FIX_PVCLOCK_TIME_INFO, __pa(pvclock_vsyscall_time_info), > + PAGE_KERNEL_VSYSCALL); > + > + preempt_notifier_init(&pvclock_vsyscall_notifier, > + &pvclock_vsyscall_preempt_ops); > + preempt_notifier_register(&pvclock_vsyscall_notifier); > + >preempt notifiers are per-thread, not global, and will upset the cycle counters. I''d drop them and use rdtscp instead (and give up if the processor doesn''t support it). -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Oct-06 10:23 UTC
Re: [Xen-devel] [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps
>>> Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> 06.10.09 02:50 >>> >--- a/arch/x86/include/asm/fixmap.h >+++ b/arch/x86/include/asm/fixmap.h >@@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx, > > extern void __this_fixmap_does_not_exist(void); > >+static inline bool user_fixmap(enum fixed_addresses fixmap) >+{ >+ switch (fixmap) { >+#ifdef CONFIG_X86_32 >+ case FIX_HOLE ... FIX_VDSO: >+#else >+ case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET: >+#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL >+ case FIX_PVCLOCK_TIME_INFO: >+#endif >+#endif >+ return true; >+ >+ default:Isn''t that nested #ifdef rather meant to be successive to the 32-/64-bit one? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Oct-06 10:28 UTC
Re: [Xen-devel] [PATCH 5/5] xen/time: add pvclock_clocksource_vread support
>>> Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> 06.10.09 02:50 >>> >--- a/arch/x86/xen/mmu.c >+++ b/arch/x86/xen/mmu.c >@@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) > # endif > #else > case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE: >+ case FIX_PVCLOCK_TIME_INFO: > #endif > #ifdef CONFIG_X86_LOCAL_APIC > case FIX_APIC_BASE: /* maps dummy local APIC */And I think this added case also needs to move past the #endif. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-06 14:19 UTC
[Xen-devel] RE: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Jeremy Fitzhardinge [mailto:jeremy.fitzhardinge@citrix.com] > With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz > Q6600. I''m sure this could be tuned a bit more, but it is > already much better than a syscall.To evaluate the goodness of this, we really need a full set of measurements for: a) cost of rdtsc (and rdtscp if different) b) cost of vsyscall+pvclock c) cost of rdtsc emulated d) cost of a hypercall that returns "hypervisor system time" On a E6850 (3Ghz but let''s use cycles), I measured; a == 72 cycles c == 1080 cycles d == 780 cycles It may be partly apples and oranges, but it looks like a good guess for b on my machine is b == 240 cycles Not bad, but is there any additional context switch cost to support it?> From: Avi Kivity [mailto:avi@redhat.com] > Instead of using vgetcpu() and rdtsc() independently, you can > use rdtscp > to read both atomically. This removes the need for the > preempt notifier.Xen does not currently expose rdtscp and so does not emulate (or context switch) TSC_AUX. Context switching TSC_AUX is certainly possible, but will likely be expensive. If the primary reason for vsyscall+pvclock is to maximize performance for gettimeofday/clock_gettime, this cost would need to be added to the mix.> preempt notifiers are per-thread, not global, and will upset > the cycle > counters. I''d drop them and use rdtscp instead (and give up if the > processor doesn''t support it).Even if rdtscp is used, in the Intel processor lineup only the very latest (Nehalem) supports rdtscp, so "give up" doesn''t seem like a very good option, at least in the near future. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-06 15:11 UTC
[Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/06/2009 04:19 PM, Dan Magenheimer wrote:>> From: Jeremy Fitzhardinge [mailto:jeremy.fitzhardinge@citrix.com] >> With this in place, I can do a gettimeofday in about 100ns on a 2.4GHz >> Q6600. I''m sure this could be tuned a bit more, but it is >> already much better than a syscall. >> > To evaluate the goodness of this, we really need a full > set of measurements for: > > a) cost of rdtsc (and rdtscp if different) > b) cost of vsyscall+pvclock > c) cost of rdtsc emulated > d) cost of a hypercall that returns "hypervisor system time" > > On a E6850 (3Ghz but let''s use cycles), I measured; > > a == 72 cycles > c == 1080 cycles > d == 780 cycles > > It may be partly apples and oranges, but it looks > like a good guess for b on my machine is > > b == 240 cycles >Two rdtscps should suffice (and I think they are much faster on modern machines).> Not bad, but is there any additional context switch > cost to support it? >rdtscp requires an additional msr read/write on heavyweight host context switches. Should be negligible compared to the savings.>> From: Avi Kivity [mailto:avi@redhat.com] >> Instead of using vgetcpu() and rdtsc() independently, you can >> use rdtscp >> to read both atomically. This removes the need for the >> preempt notifier. >> > Xen does not currently expose rdtscp and so does not emulate > (or context switch) TSC_AUX. Context switching TSC_AUX > is certainly possible, but will likely be expensive. > If the primary reason for vsyscall+pvclock is to maximize > performance for gettimeofday/clock_gettime, this cost > would need to be added to the mix. >It will cost ~100 cycles on heavyweight host context switch (guest-to-guest).>> preempt notifiers are per-thread, not global, and will upset >> the cycle >> counters. I''d drop them and use rdtscp instead (and give up if the >> processor doesn''t support it). >> > Even if rdtscp is used, in the Intel processor lineup > only the very latest (Nehalem) supports rdtscp, so > "give up" doesn''t seem like a very good option, at least > in the near future. >Why not? we still fall back to the guest kernel. By the time guest kernels with rdtscp support are in the field, these machines will be quiet old. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 18:46 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/06/09 02:04, Avi Kivity wrote:> On 10/06/2009 02:50 AM, Jeremy Fitzhardinge wrote: >> This patch allows the pvclock mechanism to be used in usermode. To >> do this, we map an extra page into usermode containing an array of >> pvclock_vcpu_time_info structures which give the information required >> to compute a global system clock from the tsc. With this, we can >> implement pvclock_clocksource_vread(). >> >> One complication is that usermode is subject to two levels of >> scheduling: >> kernel scheduling of tasks onto vcpus, and hypervisor scheduling of >> vcpus onto pcpus. In either case the underlying pcpu changed, and with >> it, the correct set of parameters to compute tsc->system clock. To >> address this we install a preempt notifier on sched_out to increment >> that vcpu''s version number. Usermode can then check the version number >> is unchanged while computing the time and retry if it has (the only >> difference from the kernel''s version of the algorithm is that the vcpu >> may have changed, so we may need to switch pvclock_vcpu_time_info >> structures. >> >> To use this feature, hypervisor-specific code is required >> to call pvclock_init_vsyscall(), and if successful: >> - cause the pvclock_vcpu_time_info structure at >> pvclock_get_vsyscall_time_info(cpu) to be updated appropriately for >> each vcpu. >> - use pvclock_clocksource_vread as the implementation of clocksource >> .vread. >> >> + >> +cycle_t __vsyscall_fn pvclock_clocksource_vread(void) >> +{ >> + const struct pvclock_vcpu_time_info *pvti_base; >> + const struct pvclock_vcpu_time_info *pvti; >> + cycle_t ret; >> + u32 version; >> + >> + pvti_base = (struct pvclock_vcpu_time_info >> *)fix_to_virt(FIX_PVCLOCK_TIME_INFO); >> + >> + /* >> + * When looping to get a consistent (time-info, tsc) pair, we >> + * also need to deal with the possibility we can switch vcpus, >> + * so make sure we always re-fetch time-info for the current vcpu. >> + */ >> + do { >> + unsigned cpu; >> + >> + vgetcpu(&cpu, NULL, NULL); >> + pvti =&pvti_base[cpu]; >> + >> + version = __pvclock_read_cycles(pvti,&ret); >> + } while (unlikely(pvti->version != version)); >> + >> + return ret; >> +} >> > > Instead of using vgetcpu() and rdtsc() independently, you can use > rdtscp to read both atomically. This removes the need for the preempt > notifier.rdtscp first appeared on Intel with Nehalem, so we need to support older Intel chips. You could use rdscp to get (tsc,cpu) atomically, but that''s not sufficient to be able to get a consistent snapshot of (tsc, time_info) because it doesn''t give you the pvclock_vcpu_time_info version number. If TSC_AUX contained that too, it might be possible. Alternatively you could compare the tsc with pvclock.tsc_timestamp, but unfortunately the ABI doesn''t specify that tsc_timestamp is updated in any particular order compared to the rest of the fields, so you still can''t use that to get a consistent snapshot (we can revise the ABI, of course). So either way it doesn''t avoid the need to iterate. vgetcpu will use rdtscp if available, but I agree it is unfortunate we need to do a redundant rdtsc in that case. Avoiding the preempt notifier would be nice. Definitely worth a followup-patch.>> + >> +/* >> + * Initialize the generic pvclock vsyscall state. This will allocate >> + * a/some page(s) for the per-vcpu pvclock information, set up a >> + * fixmap mapping for the page(s) >> + */ >> +int __init pvclock_init_vsyscall(void) >> +{ >> + int cpu; >> + >> + /* Just one page for now */ >> + if (nr_cpu_ids * sizeof(struct vcpu_time_info)> PAGE_SIZE) { >> + printk(KERN_WARNING "pvclock_vsyscall: too many CPUs to fit >> time_info into a single page\n"); >> + return -ENOSPC; >> + } >> + >> + pvclock_vsyscall_time_info >> + (struct pvclock_vcpu_time_info *)get_zeroed_page(GFP_KERNEL); >> + if (pvclock_vsyscall_time_info == NULL) >> + return -ENOMEM; >> + >> > > Need to align the vcpu_time_infos on a cacheline boundary.OK.>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++) >> + pvclock_vsyscall_time_info[cpu].version = ~0; >> + >> + __set_fixmap(FIX_PVCLOCK_TIME_INFO, >> __pa(pvclock_vsyscall_time_info), >> + PAGE_KERNEL_VSYSCALL); >> + >> + preempt_notifier_init(&pvclock_vsyscall_notifier, >> + &pvclock_vsyscall_preempt_ops); >> + preempt_notifier_register(&pvclock_vsyscall_notifier); >> + >> > > preempt notifiers are per-thread, not global, and will upset the cycle > counters.Ah, so I need to register it on every new thread? That''s a bit awkward. This is intended to satisfy the cycle-counters who want to do gettimeofday a million times a second, where I guess the tradeoff of avoiding a pile of syscalls is worth a bit of context-switch overhead.> I''d drop them and use rdtscp instead (and give up if the processor > doesn''t support it). >None of my test machines have rdtscp, so that won''t do ;) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 18:47 UTC
Re: [Xen-devel] [PATCH 4/5] x86/fixmap: add a predicate for usermode fixmaps
On 10/06/09 03:23, Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> 06.10.09 02:50 >>> >>>> >> --- a/arch/x86/include/asm/fixmap.h >> +++ b/arch/x86/include/asm/fixmap.h >> @@ -181,6 +181,24 @@ static inline void __set_fixmap(enum fixed_addresses idx, >> >> extern void __this_fixmap_does_not_exist(void); >> >> +static inline bool user_fixmap(enum fixed_addresses fixmap) >> +{ >> + switch (fixmap) { >> +#ifdef CONFIG_X86_32 >> + case FIX_HOLE ... FIX_VDSO: >> +#else >> + case VSYSCALL_LAST_PAGE ... VSYSCALL_HPET: >> +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL >> + case FIX_PVCLOCK_TIME_INFO: >> +#endif >> +#endif >> + return true; >> + >> + default: >> > Isn''t that nested #ifdef rather meant to be successive to the 32-/64-bit > one? >No, because 32-bit doesn''t support vsyscall at present. But I''ve been meaning to add a CONFIG_X86_VSYSCALL to make all the vsyscall stuff conditional on, so if/when 32-bit vsyscall gets done its easy to turn it all on. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-06 18:48 UTC
Re: [Xen-devel] [PATCH 5/5] xen/time: add pvclock_clocksource_vread support
On 10/06/09 03:28, Jan Beulich wrote:>> --- a/arch/x86/xen/mmu.c >> +++ b/arch/x86/xen/mmu.c >> @@ -1814,6 +1814,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) >> # endif >> #else >> case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE: >> + case FIX_PVCLOCK_TIME_INFO: >> #endif >> #ifdef CONFIG_X86_LOCAL_APIC >> case FIX_APIC_BASE: /* maps dummy local APIC */ >> > And I think this added case also needs to move past the #endif. >No, for the same reason. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-07 10:25 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote:> >> Instead of using vgetcpu() and rdtsc() independently, you can use >> rdtscp to read both atomically. This removes the need for the preempt >> notifier. >> > rdtscp first appeared on Intel with Nehalem, so we need to support older > Intel chips. >We can support them by falling back to the kernel. I''m a bit worried about the kernel playing with the hypervisor''s version field. It''s better to introduce yet a new version for the kernel, and check both.> You could use rdscp to get (tsc,cpu) atomically, but that''s not > sufficient to be able to get a consistent snapshot of (tsc, time_info) > because it doesn''t give you the pvclock_vcpu_time_info version number. > If TSC_AUX contained that too, it might be possible. Alternatively you > could compare the tsc with pvclock.tsc_timestamp, but unfortunately the > ABI doesn''t specify that tsc_timestamp is updated in any particular > order compared to the rest of the fields, so you still can''t use that to > get a consistent snapshot (we can revise the ABI, of course). > > So either way it doesn''t avoid the need to iterate. vgetcpu will use > rdtscp if available, but I agree it is unfortunate we need to do a > redundant rdtsc in that case. > >def try_pvclock_vtime(): tsc, p0 = rdtscp() v0 = pvclock[p0].version tsc, p = rdtscp() t = pvclock_time(pvclock[p], tsc) if p != p0 or pvclock[p].version != v0: raise Exception("Processor or timebased change under our feet") return t def pvclock_time(): while True: try: return try_pvlock_time() except: pass So, two rdtscps and two compares.>>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++) >>> + pvclock_vsyscall_time_info[cpu].version = ~0; >>> + >>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO, >>> __pa(pvclock_vsyscall_time_info), >>> + PAGE_KERNEL_VSYSCALL); >>> + >>> + preempt_notifier_init(&pvclock_vsyscall_notifier, >>> +&pvclock_vsyscall_preempt_ops); >>> + preempt_notifier_register(&pvclock_vsyscall_notifier); >>> + >>> >> preempt notifiers are per-thread, not global, and will upset the cycle >> counters. >> > Ah, so I need to register it on every new thread? That''s a bit awkward. >It''s used to manage processor registers, much like the fpu. If a thread uses a register that''s not saved and restored by the normal context switch code, it can register a preempt notifier to do that instead.> This is intended to satisfy the cycle-counters who want to do > gettimeofday a million times a second, where I guess the tradeoff of > avoiding a pile of syscalls is worth a bit of context-switch overhead. >It''s sufficient to increment a version counter on thread migration, no need to do it on context switch. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-07 19:29 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/09 03:25, Avi Kivity wrote:> On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote: >> >>> Instead of using vgetcpu() and rdtsc() independently, you can use >>> rdtscp to read both atomically. This removes the need for the preempt >>> notifier. >>> >> rdtscp first appeared on Intel with Nehalem, so we need to support older >> Intel chips. >> > > We can support them by falling back to the kernel.Yes, but its easy enough to support them with the fast-path.> I''m a bit worried about the kernel playing with the hypervisor''s > version field.For Xen I explicitly made it not a problem by adding the notion of a secondary pvclock_vcpu_time_info structure which is updated by copying, aside from the version number which is updated as-is. As far as I can tell it isn''t a problem for KVM either. The guest version number is atomic with respect to preemption by the hypervisor so there''s no scope for racing. (The ABI already guarantees that the pvclock structures are never updated cross-cpu.) It ultimately doesn''t matter what the version number is so long as it changes when the parameters are updated, and version numbers can''t be reused within a window where things get confused.> It''s better to introduce yet a new version for the kernel, and check > both.Two version numbers are awkward to read atomically at least on 32-bit. And I don''t think its necessary.> def try_pvclock_vtime(): > tsc, p0 = rdtscp() > v0 = pvclock[p0].version > tsc, p = rdtscp() > t = pvclock_time(pvclock[p], tsc) > if p != p0 or pvclock[p].version != v0: > raise Exception("Processor or timebased change under our feet") > return t > > def pvclock_time(): > while True: > try: > return try_pvlock_time() > except: > pass > > So, two rdtscps and two compares.Yep, that would work.> It''s sufficient to increment a version counter on thread migration, no > need to do it on context switch. >That''s true; switch_out is a pessimistic approximation of that. But is there a convenient hook to test for migration? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-07 20:09 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/2009 09:29 PM, Jeremy Fitzhardinge wrote:> >> I''m a bit worried about the kernel playing with the hypervisor''s >> version field. >> > For Xen I explicitly made it not a problem by adding the notion of a > secondary pvclock_vcpu_time_info structure which is updated by copying, > aside from the version number which is updated as-is. >When do you copy? I''d rather have a single copy for guest and host.> As far as I can tell it isn''t a problem for KVM either. The guest > version number is atomic with respect to preemption by the hypervisor so > there''s no scope for racing. (The ABI already guarantees that the > pvclock structures are never updated cross-cpu.) > > It ultimately doesn''t matter what the version number is so long as it > changes when the parameters are updated, and version numbers can''t be > reused within a window where things get confused. >If the hypervisor does a pvclock->version = somethingelse->version++ then the guest may get confused. But I understand you have a guest-private ->version?>> It''s better to introduce yet a new version for the kernel, and check >> both. >> > Two version numbers are awkward to read atomically at least on 32-bit. > And I don''t think its necessary. >No need to read them atomically. cpu1 = vgetcpu() hver1 = pvclock[cpu1].hver kver1 = pvclock[cpu1].kver tsc = rdtsc /* multipication magic with pvclock[cpu1]*/ cpu2 = vgetcpu() hver2 = pvclock[cpu2].hver kver2 = pvclock[cpu2].kver valid = cpu1 == cpu2 && hver1 == hver2 && kver1 == kver2>> It''s sufficient to increment a version counter on thread migration, no >> need to do it on context switch. >> >> > That''s true; switch_out is a pessimistic approximation of that. But is > there a convenient hook to test for migration? >I''d guess not but it''s probably easy to add one in the migration thread. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-07 20:48 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> We can support them by falling back to the kernel. I''m a bit worried > about the kernel playing with the hypervisor''s version field. It''s > better to introduce yet a new version for the kernel, and check both.On Nehalem, apps that need timestamp information at a high frequency will likely use rdtsc/rdtscp directly. I very much support Jeremy''s efforts to make vsyscall+pvclock work fast on processors other than the very newest ones. Dan> -----Original Message----- > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Wednesday, October 07, 2009 4:26 AM > To: Jeremy Fitzhardinge > Cc: Jeremy Fitzhardinge; Dan Magenheimer; Xen-devel; Kurt Hackel; the > arch/x86 maintainers; Linux Kernel Mailing List; Glauber de Oliveira > Costa; Keir Fraser; Zach Brown; Chris Mason > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > > On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote: > > > >> Instead of using vgetcpu() and rdtsc() independently, you can use > >> rdtscp to read both atomically. This removes the need for > the preempt > >> notifier. > >> > > rdtscp first appeared on Intel with Nehalem, so we need to > support older > > Intel chips. > > > > We can support them by falling back to the kernel. I''m a bit worried > about the kernel playing with the hypervisor''s version field. It''s > better to introduce yet a new version for the kernel, and check both. > > > You could use rdscp to get (tsc,cpu) atomically, but that''s not > > sufficient to be able to get a consistent snapshot of (tsc, > time_info) > > because it doesn''t give you the pvclock_vcpu_time_info > version number. > > If TSC_AUX contained that too, it might be possible. > Alternatively you > > could compare the tsc with pvclock.tsc_timestamp, but > unfortunately the > > ABI doesn''t specify that tsc_timestamp is updated in any particular > > order compared to the rest of the fields, so you still > can''t use that to > > get a consistent snapshot (we can revise the ABI, of course). > > > > So either way it doesn''t avoid the need to iterate. > vgetcpu will use > > rdtscp if available, but I agree it is unfortunate we need to do a > > redundant rdtsc in that case. > > > > > > def try_pvclock_vtime(): > tsc, p0 = rdtscp() > v0 = pvclock[p0].version > tsc, p = rdtscp() > t = pvclock_time(pvclock[p], tsc) > if p != p0 or pvclock[p].version != v0: > raise Exception("Processor or timebased change under our feet") > return t > > def pvclock_time(): > while True: > try: > return try_pvlock_time() > except: > pass > > So, two rdtscps and two compares. > > >>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++) > >>> + pvclock_vsyscall_time_info[cpu].version = ~0; > >>> + > >>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO, > >>> __pa(pvclock_vsyscall_time_info), > >>> + PAGE_KERNEL_VSYSCALL); > >>> + > >>> + preempt_notifier_init(&pvclock_vsyscall_notifier, > >>> +&pvclock_vsyscall_preempt_ops); > >>> + preempt_notifier_register(&pvclock_vsyscall_notifier); > >>> + > >>> > >> preempt notifiers are per-thread, not global, and will > upset the cycle > >> counters. > >> > > Ah, so I need to register it on every new thread? That''s a > bit awkward. > > > > It''s used to manage processor registers, much like the fpu. > If a thread > uses a register that''s not saved and restored by the normal context > switch code, it can register a preempt notifier to do that instead. > > > This is intended to satisfy the cycle-counters who want to do > > gettimeofday a million times a second, where I guess the tradeoff of > > avoiding a pile of syscalls is worth a bit of > context-switch overhead. > > > > It''s sufficient to increment a version counter on thread > migration, no > need to do it on context switch. > > -- > Do not meddle in the internals of kernels, for they are > subtle and quick to panic. > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-07 21:08 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/2009 10:48 PM, Dan Magenheimer wrote:>> We can support them by falling back to the kernel. I''m a bit worried >> about the kernel playing with the hypervisor''s version field. It''s >> better to introduce yet a new version for the kernel, and check both. >> > On Nehalem, apps that need timestamp information at a high > frequency will likely use rdtsc/rdtscp directly. > >Then they will get incorrect timing once they are live migrated. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-07 21:19 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/09 13:09, Avi Kivity wrote:> On 10/07/2009 09:29 PM, Jeremy Fitzhardinge wrote: >> >>> I''m a bit worried about the kernel playing with the hypervisor''s >>> version field. >>> >> For Xen I explicitly made it not a problem by adding the notion of a >> secondary pvclock_vcpu_time_info structure which is updated by copying, >> aside from the version number which is updated as-is. >> > > When do you copy? > > I''d rather have a single copy for guest and host.When Xen updates the parameters normally. The interface never really needed to share the memory between hypervisor and guest, and I think avoiding it is a bit more robust. But for KVM, you already use the MSR to place the pvclock_vcpu_time_info structure, so you could just place it in the page and use the same memory for kernel and usermode.> If the hypervisor does a pvclock->version = somethingelse->version++ > then the guest may get confused. But I understand you have a > guest-private ->version?The guest should never get confused by the version being changed by the hypervisor. It''s already part of the ABI. Or did you mean something else? I''m not sure what you mean by "guest-private version"; the versions are always guest-private: te version is part of the pvclock structure, which is per-vcpu, which is private to each guest. The guest nevern maintains a separate long-term copy of the structure, only a transient snapshot while computing time from the tsc (that''s the current pvclock.c code).> No need to read them atomically. > > cpu1 = vgetcpu() > hver1 = pvclock[cpu1].hver > kver1 = pvclock[cpu1].kver > tsc = rdtsc > /* multipication magic with pvclock[cpu1]*/ > cpu2 = vgetcpu() > hver2 = pvclock[cpu2].hver > kver2 = pvclock[cpu2].kver > valid = cpu1 == cpu2 && hver1 == hver2 && kver1 == kver2I don''t think that''s necessary, but I can certainly live with it if it makes you happier.>>> It''s sufficient to increment a version counter on thread migration, no >>> need to do it on context switch. >>> >>> >> That''s true; switch_out is a pessimistic approximation of that. But is >> there a convenient hook to test for migration? >> > > I''d guess not but it''s probably easy to add one in the migration thread.Looking at that now. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-07 21:37 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/2009 11:19 PM, Jeremy Fitzhardinge wrote:> >> When do you copy? >> >> I''d rather have a single copy for guest and host. >> > When Xen updates the parameters normally. The interface never really > needed to share the memory between hypervisor and guest, and I think > avoiding it is a bit more robust. > > But for KVM, you already use the MSR to place the pvclock_vcpu_time_info > structure, so you could just place it in the page and use the same > memory for kernel and usermode. >Yes.>> If the hypervisor does a pvclock->version = somethingelse->version++ >> then the guest may get confused. But I understand you have a >> guest-private ->version? >> > The guest should never get confused by the version being changed by the > hypervisor. It''s already part of the ABI. Or did you mean something else? >If the guest does a RMW on the version, but the host does not (copying it from somewhere else), then the guest RMW can be lost. Looking at the code, that''s what kvm does: vcpu->hv_clock.version += 2; shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0); memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock, sizeof(vcpu->hv_clock)); so a guest-side ++version can be lost.> I''m not sure what you mean by "guest-private version"; the versions are > always guest-private: te version is part of the pvclock structure, > which is per-vcpu, which is private to each guest. The guest nevern > maintains a separate long-term copy of the structure, only a transient > snapshot while computing time from the tsc (that''s the current pvclock.c > code). >Same for kvm. I''m not worried about cross-guest corruption, just the guest and host working together to confuse the guest.>> No need to read them atomically. >> >> cpu1 = vgetcpu() >> hver1 = pvclock[cpu1].hver >> kver1 = pvclock[cpu1].kver >> tsc = rdtsc >> /* multipication magic with pvclock[cpu1]*/ >> cpu2 = vgetcpu() >> hver2 = pvclock[cpu2].hver >> kver2 = pvclock[cpu2].kver >> valid = cpu1 == cpu2&& hver1 == hver2&& kver1 == kver2 >> > I don''t think that''s necessary, but I can certainly live with it if it > makes you happier. >I think the version issue requires it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-07 21:51 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/09 14:37, Avi Kivity wrote:> If the guest does a RMW on the version, but the host does not (copying > it from somewhere else), then the guest RMW can be lost. > > Looking at the code, that''s what kvm does: > > vcpu->hv_clock.version += 2; > > shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0); > > memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock, > sizeof(vcpu->hv_clock)); > > so a guest-side ++version can be lost.I see, yes. The Xen code explicitly reads back the guest version and increments that (I realize now that''s what you meant by guest-private version). If you were to have a second version number it would have to be separated as well to avoid being overwritten by the hypervisor. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-07 21:53 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/2009 11:51 PM, Jeremy Fitzhardinge wrote:> On 10/07/09 14:37, Avi Kivity wrote: > >> If the guest does a RMW on the version, but the host does not (copying >> it from somewhere else), then the guest RMW can be lost. >> >> Looking at the code, that''s what kvm does: >> >> vcpu->hv_clock.version += 2; >> >> shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0); >> >> memcpy(shared_kaddr + vcpu->time_offset,&vcpu->hv_clock, >> sizeof(vcpu->hv_clock)); >> >> so a guest-side ++version can be lost. >> > I see, yes. The Xen code explicitly reads back the guest version and > increments that (I realize now that''s what you meant by guest-private > version). If you were to have a second version number it would have to > be separated as well to avoid being overwritten by the hypervisor. >Yes. We have the space since a cacheline is 64 bytes (minimum) vs 32 bytes of pvclock data. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-07 22:36 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> Then they will get incorrect timing once they are live migrated.I''ve posted a proposed (OS-independent) solution for that and am (slowly) in the process of implementing it.> -----Original Message----- > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Wednesday, October 07, 2009 3:08 PM > To: Dan Magenheimer > Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Xen-devel; Kurt Hackel; > the arch/x86 maintainers; Linux Kernel Mailing List; Glauber > de Oliveira > Costa; Keir Fraser; Zach Brown; Chris Mason > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > > On 10/07/2009 10:48 PM, Dan Magenheimer wrote: > >> We can support them by falling back to the kernel. I''m a > bit worried > >> about the kernel playing with the hypervisor''s version field. It''s > >> better to introduce yet a new version for the kernel, and > check both. > >> > > On Nehalem, apps that need timestamp information at a high > > frequency will likely use rdtsc/rdtscp directly. > > > > > > Then they will get incorrect timing once they are live migrated. > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-10 00:24 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/07/09 03:25, Avi Kivity wrote:> def try_pvclock_vtime(): > tsc, p0 = rdtscp() > v0 = pvclock[p0].version > tsc, p = rdtscp() > t = pvclock_time(pvclock[p], tsc) > if p != p0 or pvclock[p].version != v0: > raise Exception("Processor or timebased change under our feet") > return tThis doesn''t quite work. If we end up migrating some time after the first rdtscp, then the accesses to pvclock[] will be cross-cpu. Since we don''t made any strong SMP memory ordering guarantees on updating the structure, the snapshot isn''t guaranteed to be consistent even if we re-check the version at the end. So to use rdtscp we need to either redefine the update of pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration check. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-10 18:10 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote:> On 10/07/09 03:25, Avi Kivity wrote: > >> def try_pvclock_vtime(): >> tsc, p0 = rdtscp() >> v0 = pvclock[p0].version >> tsc, p = rdtscp() >> t = pvclock_time(pvclock[p], tsc) >> if p != p0 or pvclock[p].version != v0: >> raise Exception("Processor or timebased change under our feet") >> return t >> > This doesn''t quite work. > > If we end up migrating some time after the first rdtscp, then the > accesses to pvclock[] will be cross-cpu. Since we don''t made any strong > SMP memory ordering guarantees on updating the structure, the snapshot > isn''t guaranteed to be consistent even if we re-check the version at the > end. >We only hit this if we have a double migration, otherwise we see p != p0. Most likely all existing implementations do have a write barrier on the guest entry path, so if we add a read barrier between the two compares, that ensures we''re reading from the same cpu again.> So to use rdtscp we need to either redefine the update of > pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration > check. >I think we can update the ABI after verifying all implementations do have a write barrier. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-12 18:20 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/10/09 11:10, Avi Kivity wrote:> On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote: >> On 10/07/09 03:25, Avi Kivity wrote: >> >>> def try_pvclock_vtime(): >>> tsc, p0 = rdtscp() >>> v0 = pvclock[p0].version >>> tsc, p = rdtscp() >>> t = pvclock_time(pvclock[p], tsc) >>> if p != p0 or pvclock[p].version != v0: >>> raise Exception("Processor or timebased change under our feet") >>> return t >>> >> This doesn''t quite work. >> >> If we end up migrating some time after the first rdtscp, then the >> accesses to pvclock[] will be cross-cpu. Since we don''t made any strong >> SMP memory ordering guarantees on updating the structure, the snapshot >> isn''t guaranteed to be consistent even if we re-check the version at the >> end. >> > > We only hit this if we have a double migration, otherwise we see p != p0. > > Most likely all existing implementations do have a write barrier on > the guest entry path, so if we add a read barrier between the two > compares, that ensures we''re reading from the same cpu again.There''s a second problem: If the time_info gets updated between the first rdtscp and the first version fetch, then we won''t have a consistent tsc,time_info pair. You could check if tsc_timestamp is > tsc, but that won''t necessarily work on save/restore/migrate.>> So to use rdtscp we need to either redefine the update of >> pvclock_vcpu_time_info to be SMP-safe, or keep the additional migration >> check. >> > > I think we can update the ABI after verifying all implementations do > have a write barrier. >I suppose that works if you assume that: 1. every task->vcpu migration is associated with a hv/guest context switch, and 2. every hv/guest context switch is a write barrier I guess 2 is a given, but I can at least imagine cases where 1 might not be true. Maybe. It all seems very subtle. And I don''t really see a gain. You avoid maintaining a second version number, but at the cost of two rdtscps. In my measurements, the whole vsyscall takes around 100ns to run, and a single rdtsc takes about 30, so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in the instruction stream, and so will take at least as long; two of them will completely blow the vsyscall execution time. (By contrast, lsl only takes around 10ns, which suggests it should be used preferentially in vgetcpu anyway.) AMD CPUs have traditionally been much better than Intel at these kinds of things, so maybe rdtscp makes sense there. Or maybe Nehalem is much better than my Core2 Q6600. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-12 18:29 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/12/2009 08:20 PM, Jeremy Fitzhardinge wrote:> On 10/10/09 11:10, Avi Kivity wrote: > >> On 10/10/2009 02:24 AM, Jeremy Fitzhardinge wrote: >> >>> On 10/07/09 03:25, Avi Kivity wrote: >>> >>> >>>> def try_pvclock_vtime(): >>>> tsc, p0 = rdtscp() >>>> v0 = pvclock[p0].version >>>> tsc, p = rdtscp() >>>> t = pvclock_time(pvclock[p], tsc) >>>> if p != p0 or pvclock[p].version != v0: >>>> raise Exception("Processor or timebased change under our feet") >>>> return t >>>> > There''s a second problem: If the time_info gets updated between the > first rdtscp and the first version fetch, then we won''t have a > consistent tsc,time_info pair. You could check if tsc_timestamp is> > tsc, but that won''t necessarily work on save/restore/migrate. >Good catch. Doesn''t that invalidate rdtscp based vgettimeofday on non-virt as well (assuming p == cpu)?> I suppose that works if you assume that: > > 1. every task->vcpu migration is associated with a hv/guest context > switch, and > 2. every hv/guest context switch is a write barrier > > I guess 2 is a given, but I can at least imagine cases where 1 might not > be true. Maybe. It all seems very subtle. >What is 1 exactly? task switching to another vcpu? that doesn''t incur hypervisor involvement. vcpu moving to another cpu? That does.> And I don''t really see a gain. You avoid maintaining a second version > number, but at the cost of two rdtscps. In my measurements, the whole > vsyscall takes around 100ns to run, and a single rdtsc takes about 30, > so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in > the instruction stream, and so will take at least as long; two of them > will completely blow the vsyscall execution time. >I agree, let''s stick with the rdtscpless implementation. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-12 19:13 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/12/09 11:29, Avi Kivity wrote:> Good catch. Doesn''t that invalidate rdtscp based vgettimeofday on > non-virt as well (assuming p == cpu)?The tsc clocksource assumes the tsc is (mostly?) synced; it doesn''t use rdtscp or make any attempt at per-cpu corrections.>> I suppose that works if you assume that: >> >> 1. every task->vcpu migration is associated with a hv/guest context >> switch, and >> 2. every hv/guest context switch is a write barrier >> >> I guess 2 is a given, but I can at least imagine cases where 1 might not >> be true. Maybe. It all seems very subtle. >> > > What is 1 exactly? task switching to another vcpu? that doesn''t > incur hypervisor involvement. vcpu moving to another cpu? That does.Aie... OK. So no barrier is required for a task double migration on vcpus, because it ends up on the same pcpu and the ordering is local; if there''s a vcpu migration to a new pcpu in there too, then we always expect a barrier.>> And I don''t really see a gain. You avoid maintaining a second version >> number, but at the cost of two rdtscps. In my measurements, the whole >> vsyscall takes around 100ns to run, and a single rdtsc takes about 30, >> so 30% of total. Unlike rdtsc, rdtscp is documented as being ordered in >> the instruction stream, and so will take at least as long; two of them >> will completely blow the vsyscall execution time. >> > > I agree, let''s stick with the rdtscpless implementation.OK, I''ll use PeterZ''s hint to try and find a more complete set of migration points. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-13 06:39 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/12/2009 09:13 PM, Jeremy Fitzhardinge wrote:> On 10/12/09 11:29, Avi Kivity wrote: > >> Good catch. Doesn''t that invalidate rdtscp based vgettimeofday on >> non-virt as well (assuming p == cpu)? >> > The tsc clocksource assumes the tsc is (mostly?) synced; it doesn''t use > rdtscp or make any attempt at per-cpu corrections. >So it''s broken or disabled when that assumption is wrong? We could easily fix that now. Might even reuse the pvclock structures. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-13 20:00 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/12/09 23:39, Avi Kivity wrote:> On 10/12/2009 09:13 PM, Jeremy Fitzhardinge wrote: >> On 10/12/09 11:29, Avi Kivity wrote: >> >>> Good catch. Doesn''t that invalidate rdtscp based vgettimeofday on >>> non-virt as well (assuming p == cpu)? >>> >> The tsc clocksource assumes the tsc is (mostly?) synced; it doesn''t use >> rdtscp or make any attempt at per-cpu corrections. >> > > So it''s broken or disabled when that assumption is wrong? We could > easily fix that now. Might even reuse the pvclock structures.Well, the kernel internally makes more or less the same assumption; the vsyscall clocksource is the same as the kernel''s internal one. I think idea is that it just drops back to something like hpet if the tsc doesn''t have very simple SMP characteristics. If the kernel could characterize the per-cpu properties of the tsc more accurately, then it could use the pvclock mechanism on native. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-14 12:32 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote:> >> So it''s broken or disabled when that assumption is wrong? We could >> easily fix that now. Might even reuse the pvclock structures. >> > Well, the kernel internally makes more or less the same assumption; the > vsyscall clocksource is the same as the kernel''s internal one. I think > idea is that it just drops back to something like hpet if the tsc > doesn''t have very simple SMP characteristics. > > If the kernel could characterize the per-cpu properties of the tsc more > accurately, then it could use the pvclock mechanism on native. >It does - that''s how kvm implements pvclock on the host side. See kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-15 19:17 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/14/09 05:32, Avi Kivity wrote:> On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote: >> >>> So it''s broken or disabled when that assumption is wrong? We could >>> easily fix that now. Might even reuse the pvclock structures. >>> >> Well, the kernel internally makes more or less the same assumption; the >> vsyscall clocksource is the same as the kernel''s internal one. I think >> idea is that it just drops back to something like hpet if the tsc >> doesn''t have very simple SMP characteristics. >> >> If the kernel could characterize the per-cpu properties of the tsc more >> accurately, then it could use the pvclock mechanism on native. >> > > It does - that''s how kvm implements pvclock on the host side. See > kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c.The tsc clocksource currently seems a fair bit more picky though; it doesn''t attempt to sync tscs or work out their relative offsets or rates. Which seems a bit defeatist. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-27 17:29 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
Is there any way for an application to conclusively determine programmatically if the "fast vsyscall" pvclock is functional vs the much much slower gettimeofday/clock_gettime equivalents? If not, might it be possible to implement some (sysfs?) way to determine this, that would also be backwards compatible to existing OS''s that don''t have pvclock+vsyscall supported? Thanks, Dan> -----Original Message----- > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Wednesday, October 14, 2009 6:33 AM > To: Jeremy Fitzhardinge > Cc: Dan Magenheimer; Jeremy Fitzhardinge; Kurt Hackel; the arch/x86 > maintainers; Linux Kernel Mailing List; Glauber de Oliveira Costa; > Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo Molnar > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > > On 10/14/2009 05:00 AM, Jeremy Fitzhardinge wrote: > > > >> So it''s broken or disabled when that assumption is wrong? We could > >> easily fix that now. Might even reuse the pvclock structures. > >> > > Well, the kernel internally makes more or less the same > assumption; the > > vsyscall clocksource is the same as the kernel''s internal > one. I think > > idea is that it just drops back to something like hpet if the tsc > > doesn''t have very simple SMP characteristics. > > > > If the kernel could characterize the per-cpu properties of > the tsc more > > accurately, then it could use the pvclock mechanism on native. > > > > It does - that''s how kvm implements pvclock on the host side. See > kvmclock_cpufreq_notifier() in arch/x86/kvm/x86.c. > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-27 18:20 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/27/09 10:29, Dan Magenheimer wrote:> Is there any way for an application to conclusively determine > programmatically if the "fast vsyscall" pvclock is functional > vs the much much slower gettimeofday/clock_gettime equivalents? > > If not, might it be possible to implement some (sysfs?) > way to determine this, that would also be backwards compatible > to existing OS''s that don''t have pvclock+vsyscall supported? >It would probably be simplest and most portable for the app to just measure the performance of gettimeofday and see if it meets its needs. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-28 05:52 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/27/2009 08:20 PM, Jeremy Fitzhardinge wrote:> On 10/27/09 10:29, Dan Magenheimer wrote: > >> Is there any way for an application to conclusively determine >> programmatically if the "fast vsyscall" pvclock is functional >> vs the much much slower gettimeofday/clock_gettime equivalents? >> >> If not, might it be possible to implement some (sysfs?) >> way to determine this, that would also be backwards compatible >> to existing OS''s that don''t have pvclock+vsyscall supported? >> >> > It would probably be simplest and most portable for the app to just > measure the performance of gettimeofday and see if it meets its needs. >How can you reliably measure performance in a virtualized environment? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Glauber Costa
2009-Oct-28 09:29 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On Wed, Oct 28, 2009 at 07:52:04AM +0200, Avi Kivity wrote:> On 10/27/2009 08:20 PM, Jeremy Fitzhardinge wrote: >> On 10/27/09 10:29, Dan Magenheimer wrote: >> >>> Is there any way for an application to conclusively determine >>> programmatically if the "fast vsyscall" pvclock is functional >>> vs the much much slower gettimeofday/clock_gettime equivalents? >>> >>> If not, might it be possible to implement some (sysfs?) >>> way to determine this, that would also be backwards compatible >>> to existing OS''s that don''t have pvclock+vsyscall supported? >>> >>> >> It would probably be simplest and most portable for the app to just >> measure the performance of gettimeofday and see if it meets its needs. >> > > How can you reliably measure performance in a virtualized environment?If we loop gettimeofday(), I would expect the vsyscall-based version not to show up in strace, right? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-28 09:34 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/28/2009 11:29 AM, Glauber Costa wrote:>> How can you reliably measure performance in a virtualized environment? >> > If we loop gettimeofday(), I would expect the vsyscall-based version not to show > up in strace, right? >Much better to have an API for this. Life is hacky enough already. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-28 17:47 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/28/09 02:34, Avi Kivity wrote:> On 10/28/2009 11:29 AM, Glauber Costa wrote: >>> How can you reliably measure performance in a virtualized environment? >>> >> If we loop gettimeofday(), I would expect the vsyscall-based version >> not to show >> up in strace, right? >> > > Much better to have an API for this. Life is hacky enough already.My point is that if an app cares about property X then it should just measure property X. The fact that gettimeofday is a vsyscall is just an implementation detail that apps don''t really care about. What they care about is whether gettimeofday is fast or not. If the environment has such unstable timing that the effect can''t be measured, then it is moot whether its a vsyscall or not (but in that case its almost certainly better to use the standard API rather than trying to roll your own timesource with rdtsc). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-29 12:13 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote:>> Much better to have an API for this. Life is hacky enough already. >> > My point is that if an app cares about property X then it should just > measure property X. The fact that gettimeofday is a vsyscall is just an > implementation detail that apps don''t really care about. What they care > about is whether gettimeofday is fast or not. >But we can not make a reliable measurement.> If the environment has such unstable timing that the effect can''t be > measured, then it is moot whether its a vsyscall or not (but in that > case its almost certainly better to use the standard API rather than > trying to roll your own timesource with rdtsc). >If you''re interested in gettimeofday() for a global monotonic counter you can fall back to atomic_fetch_and_add() which will be faster than a syscall even on large systems. Maybe we should provide a vsyscall for global monotonic counters and implement it using a atomics or tsc instead of these hacks (I''m assuming here that the gettimeofday() calls are used to implement an atomic counter - are they?) -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Mason
2009-Oct-29 13:03 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On Thu, Oct 29, 2009 at 02:13:50PM +0200, Avi Kivity wrote:> On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote: > >>Much better to have an API for this. Life is hacky enough already. > >My point is that if an app cares about property X then it should just > >measure property X. The fact that gettimeofday is a vsyscall is just an > >implementation detail that apps don''t really care about. What they care > >about is whether gettimeofday is fast or not. > > But we can not make a reliable measurement.I can''t imagine how we''d decide what fast is? Please don''t make the applications guess. -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-29 14:46 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Avi Kivity [mailto:avi@redhat.com] > > On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote: > >> Much better to have an API for this. Life is hacky enough already. > >> > > My point is that if an app cares about property X then it > should just > > measure property X. The fact that gettimeofday is a > vsyscall is just an > > implementation detail that apps don''t really care about. > What they care > > about is whether gettimeofday is fast or not. > > > > But we can not make a reliable measurement. > > > If the environment has such unstable timing that the effect can''t be > > measured, then it is moot whether its a vsyscall or not (but in that > > case its almost certainly better to use the standard API rather than > > trying to roll your own timesource with rdtsc). > > > > If you''re interested in gettimeofday() for a global monotonic counter > you can fall back to atomic_fetch_and_add() which will be > faster than a > syscall even on large systems. Maybe we should provide a > vsyscall for > global monotonic counters and implement it using a atomics or tsc > instead of these hacks (I''m assuming here that the > gettimeofday() calls > are used to implement an atomic counter - are they?)No, the apps I''m familiar with (a DB and a JVM) need a timestamp not a monotonic counter. The timestamps must be relatively accurate (e.g. we''ve been talking about gettimeofday generically, but these apps would use clock_gettime for nsec resolution), monotonically increasing, and work properly across a VM migration. The timestamps are taken up to a 100K/sec or more so the apps need to ensure they are using the fastest mechanism available that meets those requirements. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Oct-29 15:07 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/29/2009 04:46 PM, Dan Magenheimer wrote:> No, the apps I''m familiar with (a DB and a JVM) need a timestamp > not a monotonic counter. The timestamps must be relatively > accurate (e.g. we''ve been talking about gettimeofday generically, > but these apps would use clock_gettime for nsec resolution), > monotonically increasing, and work properly across a VM > migration. The timestamps are taken up to a 100K/sec or > more so the apps need to ensure they are using the fastest > mechanism available that meets those requirements. >Out of interest, do you know (and can you relate) why those apps need 100k/sec monotonically increasing timestamps? -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-29 15:55 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Avi Kivity [mailto:avi@redhat.com] > Sent: Thursday, October 29, 2009 9:07 AM > To: Dan Magenheimer > Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt > Hackel; the arch/x86 maintainers; Linux Kernel Mailing List; > Glauber de > Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo > Molnar > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > > On 10/29/2009 04:46 PM, Dan Magenheimer wrote: > > No, the apps I''m familiar with (a DB and a JVM) need a timestamp > > not a monotonic counter. The timestamps must be relatively > > accurate (e.g. we''ve been talking about gettimeofday generically, > > but these apps would use clock_gettime for nsec resolution), > > monotonically increasing, and work properly across a VM > > migration. The timestamps are taken up to a 100K/sec or > > more so the apps need to ensure they are using the fastest > > mechanism available that meets those requirements. > > Out of interest, do you know (and can you relate) why those apps need > 100k/sec monotonically increasing timestamps?I don''t have any public data available for this DB usage, but basically assume it is measuring transactions at a very high throughput, some of which are to a memory-resident portion of the DB. Anecdotally, I''m told the difference between non-vsyscall gettimeofday and native rdtsc (on a machine with Invariant TSC support) can affect overall DB performance by as much as 10-20%. I did find the following public link for the JVM: http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/intro/jmc3.html Search for "flight recorder". This feature is intended to be enabled all the time, but with non-vsyscall gettimeofday the performance impact is unacceptably high, so they are using rdtscp instead (on those machines where it is available). With rdtscp, the performance impact is not measureable. Though the processor/server vendors have finally fixed the "unsynced TSC" problem on recent x86 platforms, thus allowing enterprise software to obtain timestamps at rdtsc performance, the problem comes back all over again with virtualization because of migration. Jeremy''s vsyscall+pvclock is a great solution if the app can ensure that it is present; if not, the apps will instead continue to use rdtsc as even emulated rdtsc is 2-3x faster than non-vsyscall gettimeofday. Does that help? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-29 16:15 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On a related note, though some topic drift, many of the problems that occur in virtualization due to migration could be better addressed if Linux had an architected interface to allow it to be signaled if a migration occurred, and if Linux could signal applications of the same. I don''t have any cycles (pun intended) to think about this right now, but if anyone else starts looking at it, I''d love to be cc''ed. Thanks, Dan> -----Original Message----- > From: Dan Magenheimer > Sent: Thursday, October 29, 2009 9:56 AM > To: Avi Kivity > Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Kurt Hackel; Glauber > Costa; the arch/x86 maintainers; Linux Kernel Mailing List; Glauber de > Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Ingo Molnar; Chris > Mason > Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > > > From: Avi Kivity [mailto:avi@redhat.com] > > Sent: Thursday, October 29, 2009 9:07 AM > > To: Dan Magenheimer > > Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt > > Hackel; the arch/x86 maintainers; Linux Kernel Mailing List; > > Glauber de > > Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris > Mason; Ingo > > Molnar > > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > > implementation > > > > > > On 10/29/2009 04:46 PM, Dan Magenheimer wrote: > > > No, the apps I''m familiar with (a DB and a JVM) need a timestamp > > > not a monotonic counter. The timestamps must be relatively > > > accurate (e.g. we''ve been talking about gettimeofday generically, > > > but these apps would use clock_gettime for nsec resolution), > > > monotonically increasing, and work properly across a VM > > > migration. The timestamps are taken up to a 100K/sec or > > > more so the apps need to ensure they are using the fastest > > > mechanism available that meets those requirements. > > > > Out of interest, do you know (and can you relate) why those > apps need > > 100k/sec monotonically increasing timestamps? > > I don''t have any public data available for this DB usage, but > basically > assume it is measuring transactions at a very high throughput, some > of which are to a memory-resident portion of the DB. Anecdotally, > I''m told the difference between non-vsyscall gettimeofday > and native rdtsc (on a machine with Invariant TSC support) can > affect overall DB performance by as much as 10-20%. > > I did find the following public link for the JVM: > > http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/intro/jmc3.html Search for "flight recorder". This feature is intended to be enabled all the time, but with non-vsyscall gettimeofday the performance impact is unacceptably high, so they are using rdtscp instead (on those machines where it is available). With rdtscp, the performance impact is not measureable. Though the processor/server vendors have finally fixed the "unsynced TSC" problem on recent x86 platforms, thus allowing enterprise software to obtain timestamps at rdtsc performance, the problem comes back all over again with virtualization because of migration. Jeremy''s vsyscall+pvclock is a great solution if the app can ensure that it is present; if not, the apps will instead continue to use rdtsc as even emulated rdtsc is 2-3x faster than non-vsyscall gettimeofday. Does that help? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Oct-30 23:30 UTC
[Xen-devel] pvclock implementation in pv_ops kernel: why not __native_read_tsc()?
Hi Jeremy -- I was looking at rdtsc usages in 2.6.31... you may or may not be surprised to hear that there is exactly one usage location, at least through boot and some basic NFS copying: native_read_tsc(). Digging through the pvclock code, I see that pvclock calls native_read_tsc() rather than use the __native_read_tsc() inline function. Any reason for this? It seems an unnecessary extra function call for a routine that is used thousands of times every second. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Oct-31 01:17 UTC
[Xen-devel] Re: pvclock implementation in pv_ops kernel: why not __native_read_tsc()?
On 10/30/09 16:30, Dan Magenheimer wrote:> I was looking at rdtsc usages in 2.6.31... you may > or may not be surprised to hear that there is exactly > one usage location, at least through boot and some > basic NFS copying: native_read_tsc(). Digging through > the pvclock code, I see that pvclock calls native_read_tsc() > rather than use the __native_read_tsc() inline function. >I already changed it to __native_read_tsc() in xen.git, mostly because the vsyscall code can''t call kernel functions. But I''m not sure what the distinction is supposed to be.> Any reason for this? It seems an unnecessary extra > function call for a routine that is used thousands > of times every second. >I suppose, but the function call is trivial compared to the cost of the rdtsc itself, or even the lfence barriers surrounding it. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Nov-01 09:28 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/29/2009 06:15 PM, Dan Magenheimer wrote:> On a related note, though some topic drift, many of > the problems that occur in virtualization due to migration > could be better addressed if Linux had an architected > interface to allow it to be signaled if a migration > occurred, and if Linux could signal applications of > the same. I don''t have any cycles (pun intended) to > think about this right now, but if anyone else starts > looking at it, I''d love to be cc''ed. >IMO that''s not a good direction. The hypervisor should not depend on the guest for migration (the guest may be broken, or malicious, or being debugged, or slow). So the notification must be asynchronous, which means that it will only be delivered to applications after migration has completed. Instead of a "migration has occured, run for the hills" signal we''re better of finding out why applications want to know about this event and addressing specific needs. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Nov-01 09:32 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 10/29/2009 05:55 PM, Dan Magenheimer wrote:>> From: Avi Kivity [mailto:avi@redhat.com] >> Sent: Thursday, October 29, 2009 9:07 AM >> To: Dan Magenheimer >> Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt >> Hackel; the arch/x86 maintainers; Linux Kernel Mailing List; >> Glauber de >> Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris Mason; Ingo >> Molnar >> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall >> implementation >> >> >> On 10/29/2009 04:46 PM, Dan Magenheimer wrote: >> >>> No, the apps I''m familiar with (a DB and a JVM) need a timestamp >>> not a monotonic counter. The timestamps must be relatively >>> accurate (e.g. we''ve been talking about gettimeofday generically, >>> but these apps would use clock_gettime for nsec resolution), >>> monotonically increasing, and work properly across a VM >>> migration. The timestamps are taken up to a 100K/sec or >>> more so the apps need to ensure they are using the fastest >>> mechanism available that meets those requirements. >>> >> Out of interest, do you know (and can you relate) why those apps need >> 100k/sec monotonically increasing timestamps? >> > I don''t have any public data available for this DB usage, but basically > assume it is measuring transactions at a very high throughput, some > of which are to a memory-resident portion of the DB. Anecdotally, > I''m told the difference between non-vsyscall gettimeofday > and native rdtsc (on a machine with Invariant TSC support) can > affect overall DB performance by as much as 10-20%. >Sorry, that doesn''t explain anything.> I did find the following public link for the JVM: > > http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/intro/jmc3.html > > Search for "flight recorder". This feature is intended to > be enabled all the time, but with non-vsyscall gettimeofday > the performance impact is unacceptably high, so they are using > rdtscp instead (on those machines where it is available). With > rdtscp, the performance impact is not measureable. > > Though the processor/server vendors have finally fixed the > "unsynced TSC" problem on recent x86 platforms, thus allowing > enterprise software to obtain timestamps at rdtsc performance, > the problem comes back all over again with virtualization > because of migration. Jeremy''s vsyscall+pvclock is a great > solution if the app can ensure that it is present; if not, > the apps will instead continue to use rdtsc as even emulated > rdtsc is 2-3x faster than non-vsyscall gettimeofday. > > Does that help? >For profiling work fast timestamping is of course great, but surely there is no monotonicity requirement? I don''t think we''ll be able to provide monotonicity with vsyscall on tsc-broken hosts, so we''ll be limited to correcting the tsc frequency after migration for good-tsc hosts. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-02 15:28 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Avi Kivity [mailto:avi@redhat.com] > > On 10/29/2009 06:15 PM, Dan Magenheimer wrote: > > On a related note, though some topic drift, many of > > the problems that occur in virtualization due to migration > > could be better addressed if Linux had an architected > > interface to allow it to be signaled if a migration > > occurred, and if Linux could signal applications of > > the same. I don''t have any cycles (pun intended) to > > think about this right now, but if anyone else starts > > looking at it, I''d love to be cc''ed. > > IMO that''s not a good direction. The hypervisor should not depend on > the guest for migration (the guest may be broken, or > malicious, or being > debugged, or slow). So the notification must be asynchronous, which > means that it will only be delivered to applications after > migration has > completed.I definitely agree that the hypervisor can''t wait for a guest to respond. You''ve likely thought through this a lot more than I have, but I was thinking that if the kernel received the notification as some form of interrupt, it could determine immediately if any running threads had registered for "SIG_MIGRATE" and deliver the signal synchronously.> Instead of a "migration has occured, run for the hills" signal we''re > better of finding out why applications want to know about > this event and > addressing specific needs.Perhaps. It certainly isn''t warranted for this one special case of timestamp handling. But I''ll bet 5-10 years from now, after we''ve handled a few special cases, we''ll wish that we would have handled it more generically. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Nov-02 15:41 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 11/02/2009 05:28 PM, Dan Magenheimer wrote:>> From: Avi Kivity [mailto:avi@redhat.com] >> >> On 10/29/2009 06:15 PM, Dan Magenheimer wrote: >> >>> On a related note, though some topic drift, many of >>> the problems that occur in virtualization due to migration >>> could be better addressed if Linux had an architected >>> interface to allow it to be signaled if a migration >>> occurred, and if Linux could signal applications of >>> the same. I don''t have any cycles (pun intended) to >>> think about this right now, but if anyone else starts >>> looking at it, I''d love to be cc''ed. >>> >> IMO that''s not a good direction. The hypervisor should not depend on >> the guest for migration (the guest may be broken, or >> malicious, or being >> debugged, or slow). So the notification must be asynchronous, which >> means that it will only be delivered to applications after >> migration has >> completed. >> > I definitely agree that the hypervisor can''t wait for a guest > to respond. > > You''ve likely thought through this a lot more than I have, > but I was thinking that if the kernel received the notification > as some form of interrupt, it could determine immediately > if any running threads had registered for "SIG_MIGRATE" > and deliver the signal synchronously. >Interrupts cannot be delivered immediately. Exceptions can, but not all guest code is prepared to handle them. Once you start to handle the exception, migration is complete and you are late.>> Instead of a "migration has occured, run for the hills" signal we''re >> better of finding out why applications want to know about >> this event and >> addressing specific needs. >> > Perhaps. It certainly isn''t warranted for this one > special case of timestamp handling. But I''ll bet 5-10 years > from now, after we''ve handled a few special cases, we''ll > wish that we would have handled it more generically. >Or we''ll find that backwards compatibility for the generic signal is killing some optimization. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-02 15:46 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> > I don''t have any public data available for this DB usage, > > Sorry, that doesn''t explain anything.Well for now just consider the DB usage as another use of profiling. But one can easily draw scenarios where a monotonic timestamp is also used to guarantee transaction ordering.> > Search for "flight recorder". This feature is intended to > > be enabled all the time, but with non-vsyscall gettimeofday > > the performance impact is unacceptably high, so they are using > > For profiling work fast timestamping is of course great, but surely > there is no monotonicity requirement?Yes and no. Monotonicity is a poor substitute for a more generic mechanism that might provide an indication that a discontinuity has occurred (forward or backward); if an app could get both the timestamp AND some kind of "continuity generation counter" (basically a much more sophisticated form of TSC_AUX that changes whenever the timestamp is coming from a different source), perhaps all problems could be solved.> I don''t think we''ll be able to provide monotonicity with vsyscall on > tsc-broken hosts, so we''ll be limited to correcting the tsc frequency > after migration for good-tsc hosts.True, though clock_gettime(CLOCK_MONOTONIC) can provide the monotonicity where it is required. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Nov-03 05:12 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 11/02/2009 05:46 PM, Dan Magenheimer wrote:>>> I don''t have any public data available for this DB usage, >>> >> Sorry, that doesn''t explain anything. >> > Well for now just consider the DB usage as another use > of profiling. But one can easily draw scenarios where > a monotonic timestamp is also used to guarantee transaction > ordering. >In this case we should provide a facility for this. Providing a global monotonic counter may be easier than providing a monotonic clock. Hence my question.>>> Search for "flight recorder". This feature is intended to >>> be enabled all the time, but with non-vsyscall gettimeofday >>> the performance impact is unacceptably high, so they are using >>> >> For profiling work fast timestamping is of course great, but surely >> there is no monotonicity requirement? >> > Yes and no. Monotonicity is a poor substitute for a more > generic mechanism that might provide an indication that a > discontinuity has occurred (forward or backward); if an app > could get both the timestamp AND some kind of "continuity > generation counter" (basically a much more sophisticated > form of TSC_AUX that changes whenever the timestamp is > coming from a different source), perhaps all problems could be solved. >I doubt it. A discontinuity has occured, but what do we know about it? nothing.>> I don''t think we''ll be able to provide monotonicity with vsyscall on >> tsc-broken hosts, so we''ll be limited to correcting the tsc frequency >> after migration for good-tsc hosts. >> > True, though clock_gettime(CLOCK_MONOTONIC) can provide > the monotonicity where it is required. >We have that already. The question is how to implement it in a vsyscall. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-04 20:30 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Avi Kivity [mailto:avi@redhat.com] > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall > implementation > > On 11/02/2009 05:46 PM, Dan Magenheimer wrote: > >>> I don''t have any public data available for this DB usage, > >>> > >> Sorry, that doesn''t explain anything. > >> > > Well for now just consider the DB usage as another use > > of profiling. But one can easily draw scenarios where > > a monotonic timestamp is also used to guarantee transaction > > ordering. > > > > In this case we should provide a facility for this. > Providing a global > monotonic counter may be easier than providing a monotonic > clock. Hence > my question.Maybe I''m misunderstanding something, but enterprise apps can do this entirely on their own without any kernel help, correct? Or are you trying to provide it across guests, e.g. for clusters or something?> >> For profiling work fast timestamping is of course great, but surely > >> there is no monotonicity requirement? > >> > > Yes and no. Monotonicity is a poor substitute for a more > > generic mechanism that might provide an indication that a > > discontinuity has occurred (forward or backward); if an app > > could get both the timestamp AND some kind of "continuity > > generation counter" (basically a much more sophisticated > > form of TSC_AUX that changes whenever the timestamp is > > coming from a different source), perhaps all problems could > be solved. > > I doubt it. A discontinuity has occured, but what do we know > about it? nothing.Actually, I think for many/most profiling applications, just knowing a discontinuity occurred between two timestamps is very useful as that one specific measurement can be discarded. If a discontinuity is invisible, one clearly knows that a negative interval is bad, but if an interval is very small or very large, one never knows if it is due to a discontinuity or due to some other reason. This would argue for a syscall/vsyscall that can "return" two values: the "time" and a second "continuity generation" counter.> >> I don''t think we''ll be able to provide monotonicity with > vsyscall on > >> tsc-broken hosts, so we''ll be limited to correcting the > tsc frequency > >> after migration for good-tsc hosts. > >> > > True, though clock_gettime(CLOCK_MONOTONIC) can provide > > the monotonicity where it is required. > > We have that already. The question is how to implement it in > a vsyscall.Oh, I see. I missed that very crucial point. So, just to verify/clarify... There is NO WAY for a vsyscall to ensure monotonicity (presumably because the previous reading can''t be safely stored?). So speed and "correctness" are mutually exclusive? If true, yes, that''s a potentially significant problem\ though an intelligent app can layer monotonicity on top of the call I suppose. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
john stultz
2009-Nov-04 21:19 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <avi@redhat.com> wrote:> On 10/29/2009 04:46 PM, Dan Magenheimer wrote: >> >> No, the apps I''m familiar with (a DB and a JVM) need a timestamp >> not a monotonic counter. The timestamps must be relatively >> accurate (e.g. we''ve been talking about gettimeofday generically, >> but these apps would use clock_gettime for nsec resolution), >> monotonically increasing, and work properly across a VM >> migration. The timestamps are taken up to a 100K/sec or >> more so the apps need to ensure they are using the fastest >> mechanism available that meets those requirements. >> > > Out of interest, do you know (and can you relate) why those apps need > 100k/sec monotonically increasing timestamps?This is sort of tangential, but depending on the need, this might be of interest: Recently I''ve added a new clock_id, CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which return a HZ granular timestamp (same granularity as filesystem timestamps). Its very fast to access, since there''s no hardware to touch, and is accessible via vsyscall. The idea being, if your hitting clock_gettime 100k/sec but you really don''t have the need for nsec granular timestamps, it might provide a really nice performance boost. Here''s the commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da15cfdae03351c689736f8d142618592e3cebc3 thanks -john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-04 21:28 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: john stultz [mailto:johnstul@us.ibm.com] > On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <avi@redhat.com> wrote: > > > > Out of interest, do you know (and can you relate) why those > apps need > > 100k/sec monotonically increasing timestamps? > > This is sort of tangential, but depending on the need, this might be > of interest: Recently I''ve added a new clock_id, > CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which > return a HZ granular timestamp (same granularity as filesystem > timestamps). Its very fast to access, since there''s no hardware to > touch, and is accessible via vsyscall. > > The idea being, if your hitting clock_gettime 100k/sec but you really > don''t have the need for nsec granular timestamps, it might provide a > really nice performance boost. > > Here''s the commit:Hi John -- Yes, possibly of interest. But does it work with CONFIG_NO_HZ? (I''m expecting that over time NO_HZ will become widespread for VM OS''s, though interested in if you agree.) Also very interested in your thoughts about a variation that returns something similar to a TSC_AUX to notify caller that the underlying reference clock has/may have changed. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
john stultz
2009-Nov-05 00:02 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On Wed, 2009-11-04 at 13:28 -0800, Dan Magenheimer wrote:> > From: john stultz [mailto:johnstul@us.ibm.com] > > On Thu, Oct 29, 2009 at 7:07 AM, Avi Kivity <avi@redhat.com> wrote: > > > > > > Out of interest, do you know (and can you relate) why those > > apps need > > > 100k/sec monotonically increasing timestamps? > > > > This is sort of tangential, but depending on the need, this might be > > of interest: Recently I''ve added a new clock_id, > > CLOCK_MONOTONIC_COARSE (as well as CLOCK_REALTIME_COARSE), which > > return a HZ granular timestamp (same granularity as filesystem > > timestamps). Its very fast to access, since there''s no hardware to > > touch, and is accessible via vsyscall. > > > > The idea being, if your hitting clock_gettime 100k/sec but you really > > don''t have the need for nsec granular timestamps, it might provide a > > really nice performance boost. > > > > Here''s the commit: > > Hi John -- > > Yes, possibly of interest. But does it work with CONFIG_NO_HZ? > (I''m expecting that over time NO_HZ will become widespread > for VM OS''s, though interested in if you agree.)It should work, with CONFIG_NO_HZ, as soon as we come out of a long idle (likely due to a timer tick), the timekeeping code will accumulate all the skipped ticks. If we ever get to non-idle NOHZ, we''ll need some extra work here (probably lazy accumulation done conditionally in the read path), but that''s also true for filesystem timestamps.> Also very interested in your thoughts about a variation > that returns something similar to a TSC_AUX to notify > caller that the underlying reference clock has/may have > changed.I haven''t been following that closely. Personally, experience makes me skeptical of workarounds for unsynced TSCs. But I''m sure there''s sharper folks out there that might make it work. The kernel just requires that it *really really* works, and not "mostly" works. :) thanks -john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-05 00:45 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> > Yes, possibly of interest. But does it work with CONFIG_NO_HZ? > > (I''m expecting that over time NO_HZ will become widespread > > for VM OS''s, though interested in if you agree.) > > It should work, with CONFIG_NO_HZ, as soon as we come out of > a long idle > (likely due to a timer tick), the timekeeping code will accumulate all > the skipped ticks. > > If we ever get to non-idle NOHZ, we''ll need some extra work here > (probably lazy accumulation done conditionally in the read path), but > that''s also true for filesystem timestamps.OK, sounds good.> > Also very interested in your thoughts about a variation > > that returns something similar to a TSC_AUX to notify > > caller that the underlying reference clock has/may have > > changed. > > I haven''t been following that closely. Personally, experience makes me > skeptical of workarounds for unsynced TSCs. But I''m sure > there''s sharper > folks out there that might make it work. The kernel just requires that > it *really really* works, and not "mostly" works. :)This is less a workaround for unsynced TSCs than it is for VM migration (and maybe also time where a VM is out-of-context or moved to a different pcpu) though it could probably be made to work on unsynced TSC boxes also. Basically an application needing hi-res profiling info would do: nsec1 = clock_gettime2(MONOTONIC,&aux1); (time passes) nsec2 = clock_gettime2(MONOTONIC,&aux2); if (aux1 != aux2) discard_measurement(); else use_measurement(nsec2-nsec1); and system software (hypervisor or kernel or both) is responsible for ensuring aux value monotonically increases whenever a different crystal is used. Without something like this as a vsyscall, apps will just use rdtscp (which must be emulated to work properly across a migration). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Nov-05 06:47 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 11/04/2009 10:30 PM, Dan Magenheimer wrote:>> >> In this case we should provide a facility for this. >> Providing a global >> monotonic counter may be easier than providing a monotonic >> clock. Hence >> my question. >> > Maybe I''m misunderstanding something, but enterprise > apps can do this entirely on their own without any > kernel help, correct? >Within a process, yes. Across processes, not without writable shared memory. That''s why I''m trying to understand what the actual requirements are. Real monotonic, accurate, high resolution, low cost time sources are hard to come by.>> I doubt it. A discontinuity has occured, but what do we know >> about it? nothing. >> > Actually, I think for many/most profiling applications, > just knowing a discontinuity occurred between two > timestamps is very useful as that one specific measurement > can be discarded. If a discontinuity is invisible, > one clearly knows that a negative interval is bad, > but if an interval is very small or very large, > one never knows if it is due to a discontinuity or > due to some other reason. > > This would argue for a syscall/vsyscall that can > "return" two values: the "time" and a second > "continuity generation" counter. > >I doubt it. You should expect discontinuities in user space due to being swapped out, scheduled out, migrated to a different cpu, or your laptop lid being closed. There are no guarantees to a userspace application. Even the kernel can expect discontinuities due to SMIs. So an explicit notification about one type of discontinuity adds nothing.>>> True, though clock_gettime(CLOCK_MONOTONIC) can provide >>> the monotonicity where it is required. >>> >> We have that already. The question is how to implement it in >> a vsyscall. >> > Oh, I see. I missed that very crucial point. > > So, just to verify/clarify... There is NO WAY for > a vsyscall to ensure monotonicity (presumably because > the previous reading can''t be safely stored?). So > speed and "correctness" are mutually exclusive? >Yes.> If true, yes, that''s a potentially significant problem\ > though an intelligent app can layer monotonicity > on top of the call I suppose. >Unless it''s a multi-process app with limited trust. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Nov-05 14:52 UTC
RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
> From: Avi Kivity [mailto:avi@redhat.com] > > Within a process, yes. Across processes, not without writable shared > memory. > > That''s why I''m trying to understand what the actual > requirements are. > Real monotonic, accurate, high resolution, low cost time sources are > hard to come by.Hmmm... this has significant implications for the rdtsc emulation discussion on xen-devel. Since that''s not a Linux question, I''ll start another thread on xen-devel with a shorter cc list.> > Actually, I think for many/most profiling applications, > > just knowing a discontinuity occurred between two > > timestamps is very useful as that one specific measurement > > can be discarded. If a discontinuity is invisible, > > one clearly knows that a negative interval is bad, > > but if an interval is very small or very large, > > one never knows if it is due to a discontinuity or > > due to some other reason. > > > > This would argue for a syscall/vsyscall that can > > "return" two values: the "time" and a second > > "continuity generation" counter. > > I doubt it. You should expect discontinuities in user space due to > being swapped out, scheduled out, migrated to a different > cpu, or your > laptop lid being closed. There are no guarantees to a userspace > application. Even the kernel can expect discontinuities due > to SMIs. > So an explicit notification about one type of discontinuity > adds nothing.Good point. I''m interested in enterprise apps that have more control over the machine (and rarely suffer from laptop lid closures :-) and would intend for all discontinuities visible to a hypervisor or kernel to increment "AUX", but bare-metal- kernel-invisible discontinuities such as SMI do throw a wrench in the works. Well, all this discussion has convince me that my original proposals do make sense for enterprise apps to be virtualization-aware and use rdtsc/p directly for timestamping needs rather than OS APIs (with the hypervisor deciding whether or not to emulate rdtsc/p based on the underlying physical machine and whether or not migration is enabled or has occurred). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Nov-05 15:07 UTC
Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall implementation
On 05/11/2009 14:52, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Well, all this discussion has convince me that > my original proposals do make senseYou surprise me, Dan. ;-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel