Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() implementation, which extended the clockid switch case and added yet another slightly different copy of the same code. Especially the extended switch case is problematic as the compiler tends to generate a jump table which then requires to use retpolines. If jump tables are disabled it adds yet another conditional to the existing maze. This series takes a different approach by consolidating the almost identical functions into one implementation for high resolution clocks and one for the coarse grained clock ids by storing the base data for each clock id in an array which is indexed by the clock id. This completely eliminates the switch case and allows further simplifications of the code base, which at the end all together gain a few cycles performance or at least stay on par with todays code. The resulting performance depends heavily on the micro architecture and the compiler. Thanks, tglx 8<------------------- arch/x86/Kconfig | 1 arch/x86/entry/vdso/vclock_gettime.c | 199 ++++++++------------------------ arch/x86/entry/vsyscall/vsyscall_gtod.c | 55 ++++---- arch/x86/include/asm/vgtod.h | 46 ++++--- arch/x86/kernel/time.c | 22 +++ include/linux/clocksource.h | 5 kernel/time/Kconfig | 4 kernel/time/clocksource.c | 2 8 files changed, 144 insertions(+), 190 deletions(-)
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 01/11] clocksource: Provide clocksource_arch_init()
Architectures have extra archdata in the clocksource, e.g. for VDSO support. There are no sanity checks or general initializations for this available. Add support for that. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- include/linux/clocksource.h | 5 +++++ kernel/time/Kconfig | 4 ++++ kernel/time/clocksource.c | 2 ++ 3 files changed, 11 insertions(+) --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -241,6 +241,11 @@ static inline void __clocksource_update_ __clocksource_update_freq_scale(cs, 1000, khz); } +#ifdef CONFIG_ARCH_CLOCKSOURCE_INIT +extern void clocksource_arch_init(struct clocksource *cs); +#else +static inline void clocksource_arch_init(struct clocksource *cs) { } +#endif extern int timekeeping_notify(struct clocksource *clock); --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -12,6 +12,10 @@ config CLOCKSOURCE_WATCHDOG config ARCH_CLOCKSOURCE_DATA bool +# Architecture has extra clocksource init called from registration +config ARCH_CLOCKSOURCE_INIT + bool + # Clocksources require validation of the clocksource against the last # cycle update - x86/TSC misfeature config CLOCKSOURCE_VALIDATE_LAST_CYCLE --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -937,6 +937,8 @@ int __clocksource_register_scale(struct { unsigned long flags; + clocksource_arch_init(cs); + /* Initialize mult/shift and max_idle_ns */ __clocksource_update_freq_scale(cs, scale, freq);
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 02/11] x86/time: Implement clocksource_arch_init()
Runtime validate the VCLOCK_MODE in clocksource::archdata and disable VCLOCK if invalid, which disables the VDSO but keeps the system running. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/Kconfig | 1 + arch/x86/kernel/time.c | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -48,6 +48,7 @@ config X86 select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI select ANON_INODES select ARCH_CLOCKSOURCE_DATA + select ARCH_CLOCKSOURCE_INIT select ARCH_DISCARD_MEMBLOCK select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_DEBUG_VIRTUAL --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -10,6 +10,7 @@ * */ +#include <linux/clocksource.h> #include <linux/clockchips.h> #include <linux/interrupt.h> #include <linux/irq.h> @@ -105,3 +106,18 @@ void __init time_init(void) { late_time_init = x86_late_time_init; } + +/* + * Sanity check the vdso related archdata content. + */ +void clocksource_arch_init(struct clocksource *cs) +{ + if (cs->archdata.vclock_mode == VCLOCK_NONE) + return; + + if (cs->archdata.vclock_mode >= VCLOCK_MAX) { + pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n", + cs->name, cs->archdata.vclock_mode); + cs->archdata.vclock_mode = VCLOCK_NONE; + } +}
All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is no point in masking with all FF. Get rid of it and enforce the mask in the sanity checker. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 2 +- arch/x86/kernel/time.c | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m #endif else return 0; - v = (cycles - gtod->cycle_last) & gtod->mask; + v = cycles - gtod->cycle_last; return v * gtod->mult; } --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks cs->name, cs->archdata.vclock_mode); cs->archdata.vclock_mode = VCLOCK_NONE; } + + if (cs->mask != CLOCKSOURCE_MASK(64)) { + pr_warn("clocksource %s registered with invalid mask %016llx. Disabling vclock.\n", + cs->name, cs->mask); + cs->archdata.vclock_mode = VCLOCK_NONE; + } }
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq
The sequence count in vgtod_data is unsigned int, but the call sites use unsigned long, which is a pointless exercise. Fix the call sites and replace 'unsigned' with unsinged 'int' while at it. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 8 ++++---- arch/x86/include/asm/vgtod.h | 10 +++++----- 2 files changed, 9 insertions(+), 9 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { - unsigned long seq; + unsigned int seq; u64 ns; int mode; @@ -227,7 +227,7 @@ notrace static int __always_inline do_re notrace static int __always_inline do_monotonic(struct timespec *ts) { - unsigned long seq; + unsigned int seq; u64 ns; int mode; @@ -248,7 +248,7 @@ notrace static int __always_inline do_mo notrace static void do_realtime_coarse(struct timespec *ts) { - unsigned long seq; + unsigned int seq; do { seq = gtod_read_begin(gtod); ts->tv_sec = gtod->wall_time_coarse_sec; @@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s notrace static void do_monotonic_coarse(struct timespec *ts) { - unsigned long seq; + unsigned int seq; do { seq = gtod_read_begin(gtod); ts->tv_sec = gtod->monotonic_time_coarse_sec; --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t; * so be carefull by modifying this structure. */ struct vsyscall_gtod_data { - unsigned seq; + unsigned int seq; - int vclock_mode; + int vclock_mode; u64 cycle_last; u64 mask; u32 mult; @@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v return READ_ONCE(vclocks_used) & (1 << vclock); } -static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s) +static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s) { - unsigned ret; + unsigned int ret; repeat: ret = READ_ONCE(s->seq); @@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c } static inline int gtod_read_retry(const struct vsyscall_gtod_data *s, - unsigned start) + unsigned int start) { smp_rmb(); return unlikely(s->seq != start);
It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This results either in indirect calls due to the larger switch case, which then requires retpolines or when the compiler is forced to avoid jump tables it results in even more conditionals. To avoid both variants which are bad for performance the high resolution functions and the coarse grained functions will be collapsed into one for each. That requires to store the clock specific base time in an array. Introcude struct vgtod_ts for storage and convert the data store, the update function and the individual clock functions over to use it. The new storage does not longer use gtod_long_t for seconds depending on 32 or 64 bit compile because this needs to be the full 64bit value even for 32bit when a Y2038 function is added. No point in keeping the distinction alive in the internal representation. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 24 +++++++++------ arch/x86/entry/vsyscall/vsyscall_gtod.c | 51 ++++++++++++++++---------------- arch/x86/include/asm/vgtod.h | 36 ++++++++++++---------- 3 files changed, 61 insertions(+), 50 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ notrace static int __always_inline do_realtime(struct timespec *ts) { + struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME]; unsigned int seq; u64 ns; int mode; @@ -213,8 +214,8 @@ notrace static int __always_inline do_re do { seq = gtod_read_begin(gtod); mode = gtod->vclock_mode; - ts->tv_sec = gtod->wall_time_sec; - ns = gtod->wall_time_snsec; + ts->tv_sec = base->sec; + ns = base->nsec; ns += vgetsns(&mode); ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq))); @@ -227,6 +228,7 @@ notrace static int __always_inline do_re notrace static int __always_inline do_monotonic(struct timespec *ts) { + struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC]; unsigned int seq; u64 ns; int mode; @@ -234,8 +236,8 @@ notrace static int __always_inline do_mo do { seq = gtod_read_begin(gtod); mode = gtod->vclock_mode; - ts->tv_sec = gtod->monotonic_time_sec; - ns = gtod->monotonic_time_snsec; + ts->tv_sec = base->sec; + ns = base->nsec; ns += vgetsns(&mode); ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq))); @@ -248,21 +250,25 @@ notrace static int __always_inline do_mo notrace static void do_realtime_coarse(struct timespec *ts) { + struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME_COARSE]; unsigned int seq; + do { seq = gtod_read_begin(gtod); - ts->tv_sec = gtod->wall_time_coarse_sec; - ts->tv_nsec = gtod->wall_time_coarse_nsec; + ts->tv_sec = base->sec; + ts->tv_nsec = base->nsec; } while (unlikely(gtod_read_retry(gtod, seq))); } notrace static void do_monotonic_coarse(struct timespec *ts) { + struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC_COARSE]; unsigned int seq; + do { seq = gtod_read_begin(gtod); - ts->tv_sec = gtod->monotonic_time_coarse_sec; - ts->tv_nsec = gtod->monotonic_time_coarse_nsec; + ts->tv_sec = base->sec; + ts->tv_nsec = base->nsec; } while (unlikely(gtod_read_retry(gtod, seq))); } @@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc notrace time_t __vdso_time(time_t *t) { /* This is atomic on x86 so we don't need any locks. */ - time_t result = READ_ONCE(gtod->wall_time_sec); + time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec); if (t) *t = result; --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper * { int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data; + struct vgtod_ts *base; + u64 nsec; /* Mark the new vclock used. */ BUILD_BUG_ON(VCLOCK_MAX >= 32); @@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper * vdata->mult = tk->tkr_mono.mult; vdata->shift = tk->tkr_mono.shift; - vdata->wall_time_sec = tk->xtime_sec; - vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; - - vdata->monotonic_time_sec = tk->xtime_sec - + tk->wall_to_monotonic.tv_sec; - vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec - + ((u64)tk->wall_to_monotonic.tv_nsec - << tk->tkr_mono.shift); - while (vdata->monotonic_time_snsec >- (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { - vdata->monotonic_time_snsec -- ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift; - vdata->monotonic_time_sec++; + base = &vdata->basetime[CLOCK_REALTIME]; + base->sec = tk->xtime_sec; + base->nsec = tk->tkr_mono.xtime_nsec; + + base = &vdata->basetime[CLOCK_MONOTONIC]; + base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; + nsec = tk->tkr_mono.xtime_nsec; + nsec += ((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift); + while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { + nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift; + base->sec++; } + base->nsec = nsec; - vdata->wall_time_coarse_sec = tk->xtime_sec; - vdata->wall_time_coarse_nsec = (long)(tk->tkr_mono.xtime_nsec >> - tk->tkr_mono.shift); - - vdata->monotonic_time_coarse_sec - vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec; - vdata->monotonic_time_coarse_nsec - vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec; - - while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) { - vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC; - vdata->monotonic_time_coarse_sec++; + base = &vdata->basetime[CLOCK_REALTIME_COARSE]; + base->sec = tk->xtime_sec; + base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + + base = &vdata->basetime[CLOCK_MONOTONIC_COARSE]; + base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; + nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + nsec += tk->wall_to_monotonic.tv_nsec; + while (nsec >= NSEC_PER_SEC) { + nsec -= NSEC_PER_SEC; + base->sec++; } + base->nsec = nsec; gtod_write_end(vdata); } --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -5,33 +5,37 @@ #include <linux/compiler.h> #include <linux/clocksource.h> +#include <uapi/linux/time.h> + #ifdef BUILD_VDSO32_64 typedef u64 gtod_long_t; #else typedef unsigned long gtod_long_t; #endif + +struct vgtod_ts { + u64 sec; + u64 nsec; +}; + +#define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1) +#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC)) +#define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE)) + /* * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time * so be carefull by modifying this structure. */ struct vsyscall_gtod_data { - unsigned int seq; + unsigned int seq; + + int vclock_mode; + u64 cycle_last; + u64 mask; + u32 mult; + u32 shift; - int vclock_mode; - u64 cycle_last; - u64 mask; - u32 mult; - u32 shift; - - /* open coded 'struct timespec' */ - u64 wall_time_snsec; - gtod_long_t wall_time_sec; - gtod_long_t monotonic_time_sec; - u64 monotonic_time_snsec; - gtod_long_t wall_time_coarse_sec; - gtod_long_t wall_time_coarse_nsec; - gtod_long_t monotonic_time_coarse_sec; - gtod_long_t monotonic_time_coarse_nsec; + struct vgtod_ts basetime[VGTOD_BASES]; int tz_minuteswest; int tz_dsttime;
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 06/11] x86/vdso: Collapse high resolution functions
do_realtime() and do_monotonic() are now the same except for the storage array index. Hand the index in as an argument and collapse the functions. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 35 +++++++---------------------------- 1 file changed, 7 insertions(+), 28 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } -/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static int do_hres(clockid_t clk, struct timespec *ts) { - struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME]; + struct vgtod_ts *base = >od->basetime[clk]; unsigned int seq; - u64 ns; int mode; - - do { - seq = gtod_read_begin(gtod); - mode = gtod->vclock_mode; - ts->tv_sec = base->sec; - ns = base->nsec; - ns += vgetsns(&mode); - ns >>= gtod->shift; - } while (unlikely(gtod_read_retry(gtod, seq))); - - ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); - ts->tv_nsec = ns; - - return mode; -} - -notrace static int __always_inline do_monotonic(struct timespec *ts) -{ - struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC]; - unsigned int seq; u64 ns; - int mode; do { seq = gtod_read_begin(gtod); @@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid { switch (clock) { case CLOCK_REALTIME: - if (do_realtime(ts) == VCLOCK_NONE) + if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_MONOTONIC: - if (do_monotonic(ts) == VCLOCK_NONE) + if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE) goto fallback; break; case CLOCK_REALTIME_COARSE: @@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { if (likely(tv != NULL)) { - if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE)) + struct timespec *ts = (struct timespec *) tv; + + if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)) return vdso_fallback_gtod(tv, tz); tv->tv_usec /= 1000; }
do_realtime_coarse() and do_monotonic_coarse() are now the same except for the storage array index. Hand the index in as an argument and collapse the functions. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 20 ++++---------------- 1 file changed, 4 insertions(+), 16 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static void do_coarse(clockid_t clk, struct timespec *ts) { - struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME_COARSE]; - unsigned int seq; - - do { - seq = gtod_read_begin(gtod); - ts->tv_sec = base->sec; - ts->tv_nsec = base->nsec; - } while (unlikely(gtod_read_retry(gtod, seq))); -} - -notrace static void do_monotonic_coarse(struct timespec *ts) -{ - struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC_COARSE]; + struct vgtod_ts *base = >od->basetime[clk]; unsigned int seq; do { @@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid goto fallback; break; case CLOCK_REALTIME_COARSE: - do_realtime_coarse(ts); + do_coarse(CLOCK_REALTIME_COARSE, ts); break; case CLOCK_MONOTONIC_COARSE: - do_monotonic_coarse(ts); + do_coarse(CLOCK_MONOTONIC_COARSE, ts); break; default: goto fallback;
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 08/11] x86/vdso: Replace the clockid switch case
Now that the time getter functions use the clockid as index into the storage array for the base time access, the switch case can be replaced. - Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first and call the fallback function right away. - After establishing that clockid is < MAX_CLOCKS, convert the clockid to a bitmask - Check for the supported high resolution and coarse functions by anding the bitmask of supported clocks and check whether a bit is set. This completely avoids jump tables, reduces the number of conditionals and makes the VDSO extensible for other clock ids. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 38 ++++++++++++++++------------------- 1 file changed, 18 insertions(+), 20 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { - switch (clock) { - case CLOCK_REALTIME: - if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE) - goto fallback; - break; - case CLOCK_MONOTONIC: - if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE) - goto fallback; - break; - case CLOCK_REALTIME_COARSE: - do_coarse(CLOCK_REALTIME_COARSE, ts); - break; - case CLOCK_MONOTONIC_COARSE: - do_coarse(CLOCK_MONOTONIC_COARSE, ts); - break; - default: - goto fallback; - } + unsigned int msk; + + /* Sort out negative (CPU/FD) and invalid clocks */ + if (unlikely((unsigned int) clock >= MAX_CLOCKS)) + return vdso_fallback_gettime(clock, ts); - return 0; -fallback: + /* + * Convert the clockid to a bitmask and use it to check which + * clocks are handled in the VDSO directly. + */ + msk = 1U << clock; + if (likely(msk & VGTOD_HRES)) { + if (do_hres(clock, ts) != VCLOCK_NONE) + return 0; + } else if (msk & VGTOD_COARSE) { + do_coarse(clock, ts); + return 0; + } return vdso_fallback_gettime(clock, ts); } + int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime")));
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 09/11] x86/vdso: Simplify the invalid vclock case
The code flow for the vclocks is convoluted as it requires the vclocks which can be invalidated separately from the vsyscall_gtod_data sequence to store the fact in a separate variable. That's inefficient. Restructure the code so the vclock readout returns cycles and the conversion to nanoseconds is handled at the call site. If the clock gets invalidated or vclock is already VCLOCK_NONE, return U64_MAX as the cycle value, which is invalid for all clocks and leave the sequence loop immediately in that case by calling the fallback function directly. This allows to remove the gettimeofday fallback as it now uses the clock_gettime() fallback and does the nanoseconds to microseconds conversion in the same way as it does when the vclock is functional. It does not make a difference whether the division by 1000 happens in the kernel fallback or in userspace. Generates way better code and gains a few cycles back. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 81 +++++++++-------------------------- 1 file changed, 21 insertions(+), 60 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim return ret; } -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm("syscall" : "=a" (ret) : - "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory"); - return ret; -} - - #else notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) @@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim return ret; } -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) - : "memory", "edx"); - return ret; -} - #endif #ifdef CONFIG_PARAVIRT_CLOCK @@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys return (const struct pvclock_vsyscall_time_info *)&pvclock_page; } -static notrace u64 vread_pvclock(int *mode) +static notrace u64 vread_pvclock(void) { const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti; u64 ret; @@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo do { version = pvclock_read_begin(pvti); - if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { - *mode = VCLOCK_NONE; - return 0; - } + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) + return U64_MAX; ret = __pvclock_read_cycles(pvti, rdtsc_ordered()); } while (pvclock_read_retry(pvti, version)); @@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo } #endif #ifdef CONFIG_HYPERV_TSCPAGE -static notrace u64 vread_hvclock(int *mode) +static notrace u64 vread_hvclock(void) { const struct ms_hyperv_tsc_page *tsc_pg (const struct ms_hyperv_tsc_page *)&hvclock_page; - u64 current_tick = hv_read_tsc_page(tsc_pg); - - if (current_tick != U64_MAX) - return current_tick; - *mode = VCLOCK_NONE; - return 0; + return hv_read_tsc_page(tsc_pg); } #endif @@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void) return last; } -notrace static inline u64 vgetsns(int *mode) +notrace static inline u64 vgetcyc(int mode) { - u64 v; - cycles_t cycles; - - if (gtod->vclock_mode == VCLOCK_TSC) - cycles = vread_tsc(); + if (mode == VCLOCK_TSC) + return vread_tsc(); #ifdef CONFIG_PARAVIRT_CLOCK - else if (gtod->vclock_mode == VCLOCK_PVCLOCK) - cycles = vread_pvclock(mode); + else if (mode == VCLOCK_PVCLOCK) + return vread_pvclock(); #endif #ifdef CONFIG_HYPERV_TSCPAGE - else if (gtod->vclock_mode == VCLOCK_HVCLOCK) - cycles = vread_hvclock(mode); + else if (mode == VCLOCK_HVCLOCK) + return vread_hvclock(); #endif - else - return 0; - v = cycles - gtod->cycle_last; - return v * gtod->mult; + return U64_MAX; } notrace static int do_hres(clockid_t clk, struct timespec *ts) { struct vgtod_ts *base = >od->basetime[clk]; unsigned int seq; - int mode; - u64 ns; + u64 cycles, ns; do { seq = gtod_read_begin(gtod); - mode = gtod->vclock_mode; ts->tv_sec = base->sec; ns = base->nsec; - ns += vgetsns(&mode); + cycles = vgetcyc(gtod->vclock_mode); + if (unlikely((s64)cycles < 0)) + return vdso_fallback_gettime(clk, ts); + ns += (cycles - gtod->cycle_last) * gtod->mult; ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq))); ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); ts->tv_nsec = ns; - return mode; + return 0; } notrace static void do_coarse(clockid_t clk, struct timespec *ts) @@ -251,8 +214,7 @@ notrace int __vdso_clock_gettime(clockid */ msk = 1U << clock; if (likely(msk & VGTOD_HRES)) { - if (do_hres(clock, ts) != VCLOCK_NONE) - return 0; + return do_hres(clock, ts); } else if (msk & VGTOD_COARSE) { do_coarse(clock, ts); return 0; @@ -268,8 +230,7 @@ notrace int __vdso_gettimeofday(struct t if (likely(tv != NULL)) { struct timespec *ts = (struct timespec *) tv; - if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)) - return vdso_fallback_gtod(tv, tz); + do_hres(CLOCK_REALTIME, ts); tv->tv_usec /= 1000; } if (unlikely(tz != NULL)) {
Thomas Gleixner
2018-Sep-14 12:50 UTC
[patch 10/11] x86/vdso: Move cycle_last handling into the caller
Dereferencing gtod->cycle_last all over the place and foing the cycles < last comparison in the vclock read functions generates horrible code. Doing it at the call site is much better and gains a few cycles both for TSC and pvclock. Caveat: This adds the comparison to the hyperv vclock as well, but I have no way to test that. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 39 ++++++----------------------------- 1 file changed, 7 insertions(+), 32 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys static notrace u64 vread_pvclock(void) { const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti; - u64 ret; - u64 last; u32 version; + u64 ret; /* * Note: The kernel and hypervisor must guarantee that cpu ID @@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void) ret = __pvclock_read_cycles(pvti, rdtsc_ordered()); } while (pvclock_read_retry(pvti, version)); - /* refer to vread_tsc() comment for rationale */ - last = gtod->cycle_last; - - if (likely(ret >= last)) - return ret; - - return last; + return ret; } #endif #ifdef CONFIG_HYPERV_TSCPAGE @@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void) } #endif -notrace static u64 vread_tsc(void) -{ - u64 ret = (u64)rdtsc_ordered(); - u64 last = gtod->cycle_last; - - if (likely(ret >= last)) - return ret; - - /* - * GCC likes to generate cmov here, but this branch is extremely - * predictable (it's just a function of time and the likely is - * very likely) and there's a data dependence, so force GCC - * to generate a branch instead. I don't barrier() because - * we don't actually need a barrier, and if this function - * ever gets inlined it will generate worse code. - */ - asm volatile (""); - return last; -} - notrace static inline u64 vgetcyc(int mode) { if (mode == VCLOCK_TSC) - return vread_tsc(); + return (u64)rdtsc_ordered(); #ifdef CONFIG_PARAVIRT_CLOCK else if (mode == VCLOCK_PVCLOCK) return vread_pvclock(); @@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo notrace static int do_hres(clockid_t clk, struct timespec *ts) { struct vgtod_ts *base = >od->basetime[clk]; + u64 cycles, last, ns; unsigned int seq; - u64 cycles, ns; do { seq = gtod_read_begin(gtod); ts->tv_sec = base->sec; ns = base->nsec; + last = gtod->cycle_last; cycles = vgetcyc(gtod->vclock_mode); if (unlikely((s64)cycles < 0)) return vdso_fallback_gettime(clk, ts); - ns += (cycles - gtod->cycle_last) * gtod->mult; + if (cycles > last) + ns += (cycles - last) * gtod->mult; ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq)));
With the storage array in place it's now trivial to support CLOCK_TAI in the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use of the fact that: - CLOCK ids are set in stone - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so the array slot 3 is unused - CLOCK_TAI is id 11 which results in 3 when masked with 0x3 Add the mask to the basetime array lookup and set up the CLOCK_TAI base time in update_vsyscall(). The performance impact of the mask operation is within the noise. Signed-off-by: Thomas Gleixner <tglx at linutronix.de> --- arch/x86/entry/vdso/vclock_gettime.c | 2 +- arch/x86/entry/vsyscall/vsyscall_gtod.c | 4 ++++ arch/x86/include/asm/vgtod.h | 6 +++++- 3 files changed, 10 insertions(+), 2 deletions(-) --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo notrace static int do_hres(clockid_t clk, struct timespec *ts) { - struct vgtod_ts *base = >od->basetime[clk]; + struct vgtod_ts *base = >od->basetime[clk & VGTOD_HRES_MASK]; u64 cycles, last, ns; unsigned int seq; --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper * base->sec = tk->xtime_sec; base->nsec = tk->tkr_mono.xtime_nsec; + base = &vdata->basetime[VGTOD_TAI]; + base->sec = tk->xtime_sec + (s64)tk->tai_offset; + base->nsec = tk->tkr_mono.xtime_nsec; + base = &vdata->basetime[CLOCK_MONOTONIC]; base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; nsec = tk->tkr_mono.xtime_nsec; --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -19,9 +19,13 @@ struct vgtod_ts { }; #define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1) -#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC)) +#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI)) #define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE)) +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */ +#define VGTOD_HRES_MASK 0x3 +#define VGTOD_TAI (CLOCK_TAI & VGTOD_HRES_MASK) + /* * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time * so be carefull by modifying this structure.
> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx at linutronix.de> wrote: > > With the storage array in place it's now trivial to support CLOCK_TAI in > the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use > of the fact that: > > - CLOCK ids are set in stone > - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so > the array slot 3 is unused > - CLOCK_TAI is id 11 which results in 3 when masked with 0x3 > > Add the mask to the basetime array lookup and set up the CLOCK_TAI base > time in update_vsyscall().That?s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere to assert that this actually works?> > The performance impact of the mask operation is within the noise. > > Signed-off-by: Thomas Gleixner <tglx at linutronix.de> > --- > arch/x86/entry/vdso/vclock_gettime.c | 2 +- > arch/x86/entry/vsyscall/vsyscall_gtod.c | 4 ++++ > arch/x86/include/asm/vgtod.h | 6 +++++- > 3 files changed, 10 insertions(+), 2 deletions(-) > > --- a/arch/x86/entry/vdso/vclock_gettime.c > +++ b/arch/x86/entry/vdso/vclock_gettime.c > @@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo > > notrace static int do_hres(clockid_t clk, struct timespec *ts) > { > - struct vgtod_ts *base = >od->basetime[clk]; > + struct vgtod_ts *base = >od->basetime[clk & VGTOD_HRES_MASK]; > u64 cycles, last, ns; > unsigned int seq; > > --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c > +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c > @@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper * > base->sec = tk->xtime_sec; > base->nsec = tk->tkr_mono.xtime_nsec; > > + base = &vdata->basetime[VGTOD_TAI]; > + base->sec = tk->xtime_sec + (s64)tk->tai_offset; > + base->nsec = tk->tkr_mono.xtime_nsec; > + > base = &vdata->basetime[CLOCK_MONOTONIC]; > base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; > nsec = tk->tkr_mono.xtime_nsec; > --- a/arch/x86/include/asm/vgtod.h > +++ b/arch/x86/include/asm/vgtod.h > @@ -19,9 +19,13 @@ struct vgtod_ts { > }; > > #define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1) > -#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC)) > +#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI)) > #define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE)) > > +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */ > +#define VGTOD_HRES_MASK 0x3 > +#define VGTOD_TAI (CLOCK_TAI & VGTOD_HRES_MASK) > + > /* > * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time > * so be carefull by modifying this structure. > >
Arnd Bergmann
2018-Sep-14 14:22 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx at linutronix.de> wrote:> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. > > This completely eliminates the switch case and allows further > simplifications of the code base, which at the end all together gain a few > cycles performance or at least stay on par with todays code. The resulting > performance depends heavily on the micro architecture and the compiler.No objections from my side, just a few general remarks: Deepa and I have discussed the vdso in the past for 2038. I have started a patch that I'll have to redo on top of yours, but that is fine, your cleanup is only going to help here. More generally speaking, Deepa said that she would like to see some generalization on the vdso before adding the time64_t based variants. Your series probably makes x86 diverge more from the others, which makes it harder to consolidate them again, but we might just as well use your new implementation to base the generic one on, and just move the other ones over to that. A couple of architectures (s390, ia64, riscv, powerpc, arm64) implement the vdso as assembler code at the moment, so they won't be as easy to consolidate (other than outright replacing all the code). The other five: arch/x86/entry/vdso/vclock_gettime.c arch/sparc/vdso/vclock_gettime.c arch/nds32/kernel/vdso/gettimeofday.c arch/mips/vdso/gettimeofday.c arch/arm/vdso/vgettimeofday.c are basically all minor variations of the same code base and could be consolidated to some degree. Any suggestions here? Should we plan to do that consolitdation based on your new version, or just add clock_gettime64 in arm32 and x86-32, and then be done with it? The other ones will obviously still be fast for 32-bit time_t and will have a working non-vdso sys_clock_getttime64(). I also wonder about clock_getres(): half the architectures seem to implement it in vdso, but notably arm32 and x86 don't, and I had not expected it to be performance critical given that the result is easily cached in user space. Arnd
Vitaly Kuznetsov
2018-Sep-14 15:26 UTC
[patch 10/11] x86/vdso: Move cycle_last handling into the caller
Thomas Gleixner <tglx at linutronix.de> writes:> Dereferencing gtod->cycle_last all over the place and foing the cycles < > last comparison in the vclock read functions generates horrible code. Doing > it at the call site is much better and gains a few cycles both for TSC and > pvclock. > > Caveat: This adds the comparison to the hyperv vclock as well, but I have > no way to test that.While the change looks obviously correct for Hyper-V vclock too, by saying that you encouraged me to give the whole series a shot. I did and turns out Hyper-V vclock is broken: simple clock_gettime(CLOCK_REALTIME, ) test goes from 70 cpu cycles to 1000 (meaning that we're falling back to syscall I guess). Bisecting now, stay tuned! -- Vitaly
Vitaly Kuznetsov
2018-Sep-14 15:45 UTC
[patch 02/11] x86/time: Implement clocksource_arch_init()
Thomas Gleixner <tglx at linutronix.de> writes:> Runtime validate the VCLOCK_MODE in clocksource::archdata and disable > VCLOCK if invalid, which disables the VDSO but keeps the system running. > > Signed-off-by: Thomas Gleixner <tglx at linutronix.de> > --- > arch/x86/Kconfig | 1 + > arch/x86/kernel/time.c | 16 ++++++++++++++++ > 2 files changed, 17 insertions(+) > > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -48,6 +48,7 @@ config X86 > select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI > select ANON_INODES > select ARCH_CLOCKSOURCE_DATA > + select ARCH_CLOCKSOURCE_INIT > select ARCH_DISCARD_MEMBLOCK > select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI > select ARCH_HAS_DEBUG_VIRTUAL > --- a/arch/x86/kernel/time.c > +++ b/arch/x86/kernel/time.c > @@ -10,6 +10,7 @@ > * > */ > > +#include <linux/clocksource.h> > #include <linux/clockchips.h> > #include <linux/interrupt.h> > #include <linux/irq.h> > @@ -105,3 +106,18 @@ void __init time_init(void) > { > late_time_init = x86_late_time_init; > } > + > +/* > + * Sanity check the vdso related archdata content. > + */ > +void clocksource_arch_init(struct clocksource *cs) > +{ > + if (cs->archdata.vclock_mode == VCLOCK_NONE) > + return; > + > + if (cs->archdata.vclock_mode >= VCLOCK_MAX) {It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3> + pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n", > + cs->name, cs->archdata.vclock_mode); > + cs->archdata.vclock_mode = VCLOCK_NONE; > + } > +}-- Vitaly
Thomas Gleixner
2018-Sep-17 13:00 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, 14 Sep 2018, Arnd Bergmann wrote:> On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx at linutronix.de> wrote: > A couple of architectures (s390, ia64, riscv, powerpc, arm64) > implement the vdso as assembler code at the moment, so they > won't be as easy to consolidate (other than outright replacing all > the code). > > The other five: > arch/x86/entry/vdso/vclock_gettime.c > arch/sparc/vdso/vclock_gettime.c > arch/nds32/kernel/vdso/gettimeofday.c > arch/mips/vdso/gettimeofday.c > arch/arm/vdso/vgettimeofday.c > > are basically all minor variations of the same code base and could be > consolidated to some degree. > Any suggestions here? Should we plan to do that consolitdation based on > your new version, or just add clock_gettime64 in arm32 and x86-32, and then > be done with it? The other ones will obviously still be fast for 32-bit time_t > and will have a working non-vdso sys_clock_getttime64().In principle consolidating all those implementations should be possible to some extent and probably worthwhile. What's arch specific are the actual accessors to the hardware clocks.> I also wonder about clock_getres(): half the architectures seem to implement > it in vdso, but notably arm32 and x86 don't, and I had not expected it to be > performance critical given that the result is easily cached in user space.getres() is not really performance critical, but adding it does not create a huge problem either. Thanks, tglx
Andy Lutomirski
2018-Sep-17 19:25 UTC
[patch 09/11] x86/vdso: Simplify the invalid vclock case
On Fri, Sep 14, 2018 at 5:50 AM, Thomas Gleixner <tglx at linutronix.de> wrote:> The code flow for the vclocks is convoluted as it requires the vclocks > which can be invalidated separately from the vsyscall_gtod_data sequence to > store the fact in a separate variable. That's inefficient. >> notrace static int do_hres(clockid_t clk, struct timespec *ts) > { > struct vgtod_ts *base = >od->basetime[clk]; > unsigned int seq; > - int mode; > - u64 ns; > + u64 cycles, ns; > > do { > seq = gtod_read_begin(gtod); > - mode = gtod->vclock_mode; > ts->tv_sec = base->sec; > ns = base->nsec; > - ns += vgetsns(&mode); > + cycles = vgetcyc(gtod->vclock_mode); > + if (unlikely((s64)cycles < 0)) > + return vdso_fallback_gettime(clk, ts);i was contemplating this, and I would suggest one of two optimizations: 1. have all the helpers return a struct containing a u64 and a bool, and use that bool. The compiler should do okay. 2. Be sneaky. Later in the series, you do: if (unlikely((s64)cycles < 0)) return vdso_fallback_gettime(clk, ts); - ns += (cycles - gtod->cycle_last) * gtod->mult; + if (cycles > last) + ns += (cycles - last) * gtod->mult; How about: if (unlikely((s64)cycles <= last)) { if (cycles < 0) [or cycles == -1 or whatever] return vdso_fallback_gettime; } else { ns += (cycles - last) * gtod->mult; } which should actually make this whole mess be essentially free. Also, I'm not entirely convinced that this "last" thing is needed at all. John, what's the scenario under which we need it? --Andy --Andy
Thomas Gleixner
2018-Sep-18 07:52 UTC
[patch 09/11] x86/vdso: Simplify the invalid vclock case
On Mon, 17 Sep 2018, John Stultz wrote:> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto at kernel.org> wrote: > > Also, I'm not entirely convinced that this "last" thing is needed at > > all. John, what's the scenario under which we need it? > > So my memory is probably a bit foggy, but I recall that as we > accelerated gettimeofday, we found that even on systems that claimed > to have synced TSCs, they were actually just slightly out of sync. > Enough that right after cycles_last had been updated, a read on > another cpu could come in just behind cycles_last, resulting in a > negative interval causing lots of havoc. > > So the sanity check is needed to avoid that case.Your memory serves you right. That's indeed observable on CPUs which lack TSC_ADJUST. @Andy: Welcome to the wonderful world of TSC. Thanks, tglx
Peter Zijlstra
2018-Sep-18 08:30 UTC
[patch 09/11] x86/vdso: Simplify the invalid vclock case
On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:> On Mon, 17 Sep 2018, John Stultz wrote: > > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto at kernel.org> wrote: > > > Also, I'm not entirely convinced that this "last" thing is needed at > > > all. John, what's the scenario under which we need it? > > > > So my memory is probably a bit foggy, but I recall that as we > > accelerated gettimeofday, we found that even on systems that claimed > > to have synced TSCs, they were actually just slightly out of sync. > > Enough that right after cycles_last had been updated, a read on > > another cpu could come in just behind cycles_last, resulting in a > > negative interval causing lots of havoc. > > > > So the sanity check is needed to avoid that case. > > Your memory serves you right. That's indeed observable on CPUs which > lack TSC_ADJUST.But, if the gtod code can observe this, then why doesn't the code that checks the sync?
Andy Lutomirski
2018-Sep-18 14:01 UTC
[patch 09/11] x86/vdso: Simplify the invalid vclock case
> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx at linutronix.de> wrote: > >> On Mon, 17 Sep 2018, John Stultz wrote: >>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto at kernel.org> wrote: >>> Also, I'm not entirely convinced that this "last" thing is needed at >>> all. John, what's the scenario under which we need it? >> >> So my memory is probably a bit foggy, but I recall that as we >> accelerated gettimeofday, we found that even on systems that claimed >> to have synced TSCs, they were actually just slightly out of sync. >> Enough that right after cycles_last had been updated, a read on >> another cpu could come in just behind cycles_last, resulting in a >> negative interval causing lots of havoc. >> >> So the sanity check is needed to avoid that case. > > Your memory serves you right. That's indeed observable on CPUs which > lack TSC_ADJUST. > > @Andy: Welcome to the wonderful world of TSC. >Do we do better if we use signed arithmetic for the whole calculation? Then a small backwards movement would result in a small backwards result. Or we could offset everything so that we?d have to go back several hundred ms before we cross zero.
Andy Lutomirski
2018-Oct-03 05:15 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Hi Vitaly, Paolo, Radim, etc., On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx at linutronix.de> wrote:> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. >I was trying to understand more of the implications of this patch series, and I was again reminded that there is an entire extra copy of the vclock reading code in arch/x86/kvm/x86.c. And the purpose of that code is very, very opaque. Can one of you explain what the code is even doing? From a couple of attempts to read through it, it's a whole bunch of probably-extremely-buggy code that, drumroll please, tries to atomically read the TSC value and the time. And decide whether the result is "based on the TSC". And then synthesizes a TSC-to-ns multiplier and shift, based on *something other than the actual multiply and shift used*. IOW, unless I'm totally misunderstanding it, the code digs into the private arch clocksource data intended for the vDSO, uses a poorly maintained copy of the vDSO code to read the time (instead of doing the sane thing and using the kernel interfaces for this), and propagates a totally made up copy to the guest. And gets it entirely wrong when doing nested virt, since, unless there's some secret in this maze, it doesn't acutlaly use the scaling factor from the host when it tells the guest what to do. I am really, seriously tempted to send a patch to simply delete all this code. The correct way to do it is to hook And I don't see how it's even possible to pass kvmclock correctly to the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but L1 isn't notified when the data structure changes, so how the heck is it supposed to update the kvmclock structure? --Andy
Vitaly Kuznetsov
2018-Oct-03 09:22 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Andy Lutomirski <luto at kernel.org> writes:> Hi Vitaly, Paolo, Radim, etc., > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx at linutronix.de> wrote: >> >> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() >> implementation, which extended the clockid switch case and added yet >> another slightly different copy of the same code. >> >> Especially the extended switch case is problematic as the compiler tends to >> generate a jump table which then requires to use retpolines. If jump tables >> are disabled it adds yet another conditional to the existing maze. >> >> This series takes a different approach by consolidating the almost >> identical functions into one implementation for high resolution clocks and >> one for the coarse grained clock ids by storing the base data for each >> clock id in an array which is indexed by the clock id. >> > > I was trying to understand more of the implications of this patch > series, and I was again reminded that there is an entire extra copy of > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > that code is very, very opaque. > > Can one of you explain what the code is even doing? From a couple of > attempts to read through it, it's a whole bunch of > probably-extremely-buggy code that, drumroll please, tries to > atomically read the TSC value and the time. And decide whether the > result is "based on the TSC". And then synthesizes a TSC-to-ns > multiplier and shift, based on *something other than the actual > multiply and shift used*. > > IOW, unless I'm totally misunderstanding it, the code digs into the > private arch clocksource data intended for the vDSO, uses a poorly > maintained copy of the vDSO code to read the time (instead of doing > the sane thing and using the kernel interfaces for this), and > propagates a totally made up copy to the guest. And gets it entirely > wrong when doing nested virt, since, unless there's some secret in > this maze, it doesn't acutlaly use the scaling factor from the host > when it tells the guest what to do. > > I am really, seriously tempted to send a patch to simply delete all > this code. The correct way to do it is to hook"I have discovered a truly marvelous proof of this, which this margin is too narrow to contain" :-) There is a very long history of different (hardware) issues Marcelo was fighting with and the current code is the survived Frankenstein. E.g. it is very, very unclear what "catchup", "always catchup" and masterclock-less mode in general are and if we still need them. That said I'm all for simplification. I'm not sure if we still need to care about buggy hardware though.> > And I don't see how it's even possible to pass kvmclock correctly to > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > L1 isn't notified when the data structure changes, so how the heck is > it supposed to update the kvmclock structure?Well, this kind of works in the the followin way: L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC: two numbers provided by L0: offset and scale and KVM was tought to treat this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable clocksource to guests when running nested on Hyper-V"). The notification you're talking about exists, it is called Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V reenlightenment"). When TSC page changes (and this only happens when L1 is migrated to a different host with a different TSC frequency and TSC scaling is not supported by the CPU) we receive an interrupt in L1 (at this moment all TSC accesses are emulated which guarantees the correctness of the readings), pause all L2 guests, update their kvmclock structures with new data (we already know the new TSC frequency) and then tell L0 that we're done and it can stop emulating TSC accesses. (Nothing like this exists for KVM-on-KVM, by the way, when L1's clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.) -- Vitaly
Marcelo Tosatti
2018-Oct-03 19:00 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:> Hi Vitaly, Paolo, Radim, etc., > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx at linutronix.de> wrote: > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > implementation, which extended the clockid switch case and added yet > > another slightly different copy of the same code. > > > > Especially the extended switch case is problematic as the compiler tends to > > generate a jump table which then requires to use retpolines. If jump tables > > are disabled it adds yet another conditional to the existing maze. > > > > This series takes a different approach by consolidating the almost > > identical functions into one implementation for high resolution clocks and > > one for the coarse grained clock ids by storing the base data for each > > clock id in an array which is indexed by the clock id. > > > > I was trying to understand more of the implications of this patch > series, and I was again reminded that there is an entire extra copy of > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > that code is very, very opaque. > > Can one of you explain what the code is even doing? From a couple of > attempts to read through it, it's a whole bunch of > probably-extremely-buggy code that,Yes, probably.> drumroll please, tries to atomically read the TSC value and the time. And decide whether the > result is "based on the TSC".I think "based on the TSC" refers to whether TSC clocksource is being used.> And then synthesizes a TSC-to-ns > multiplier and shift, based on *something other than the actual > multiply and shift used*. > > IOW, unless I'm totally misunderstanding it, the code digs into the > private arch clocksource data intended for the vDSO, uses a poorly > maintained copy of the vDSO code to read the time (instead of doing > the sane thing and using the kernel interfaces for this), and > propagates a totally made up copy to the guest.I posted kernel interfaces for this, and it was suggested to instead write a "in-kernel user of pvclock data". If you can get kernel interfaces to replace that, go for it. I prefer kernel interfaces as well.> And gets it entirely > wrong when doing nested virt, since, unless there's some secret in > this maze, it doesn't acutlaly use the scaling factor from the host > when it tells the guest what to do. > > I am really, seriously tempted to send a patch to simply delete all > this code.If your patch which deletes the code gets the necessary features right, sure, go for it.> The correct way to do it is to hookCan you expand on the correct way to do it?> And I don't see how it's even possible to pass kvmclock correctly to > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > L1 isn't notified when the data structure changes, so how the heck is > it supposed to update the kvmclock structure?I don't parse your question.> > --Andy
Andy Lutomirski
2018-Oct-04 20:32 UTC
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx at linutronix.de> wrote:> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. > > This completely eliminates the switch case and allows further > simplifications of the code base, which at the end all together gain a few > cycles performance or at least stay on par with todays code. The resulting > performance depends heavily on the micro architecture and the compiler.tglx, please consider this whole series Acked-by: Andy Lutomirski <luto at kernel.org> Please feel free to push it top tip:x86/vdso, as long as you pull in tip/x86/urgent first. Once it lands, I'll email out a couple of follow-up patches. --Andy