This is v4 of the HPET series. It now works correctly for the legacy logic used by Nehalem and Westmere systems. Patch 1 is some pre cleanup which was brought forward from the main patch. Patch 2 is the main chunk of work. Patch 3 is some post cleanup which could be deferred. Patches 4 and 5 are just debugging code. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> -- 1.7.10.4
These changes are ones which are able to be pulled out of the subsequent patch, to make it clearer to understand and review. They are all misc fixes with negligible functional changes. * Rename hpet_next_event -> hpet_set_counter and convert it to take an hpet_event_channel pointer rather than a timer index. * Rename reprogram_hpet_evt_channel -> hpet_program_time * Move the position of setting up HPET_EVT_LEGACY in hpet_broadcast_init() It didn''t need to be there. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> --- xen/arch/x86/hpet.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index 3a4f7e8..fd44582 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -94,7 +94,12 @@ static inline unsigned long ns2ticks(unsigned long nsec, int shift, return (unsigned long) tmp; } -static int hpet_next_event(unsigned long delta, int timer) +/* + * Program an HPET channels counter relative to now. ''delta'' is specified in + * ticks, and should be calculated with ns2ticks(). The channel lock should + * be taken and interrupts must be disabled. + */ +static int hpet_set_counter(struct hpet_event_channel *ch, unsigned long delta) { uint32_t cnt, cmp; unsigned long flags; @@ -102,7 +107,7 @@ static int hpet_next_event(unsigned long delta, int timer) local_irq_save(flags); cnt = hpet_read32(HPET_COUNTER); cmp = cnt + delta; - hpet_write32(cmp, HPET_Tn_CMP(timer)); + hpet_write32(cmp, HPET_Tn_CMP(ch->idx)); cmp = hpet_read32(HPET_COUNTER); local_irq_restore(flags); @@ -110,9 +115,12 @@ static int hpet_next_event(unsigned long delta, int timer) return ((cmp + 2 - cnt) > delta) ? -ETIME : 0; } -static int reprogram_hpet_evt_channel( - struct hpet_event_channel *ch, - s_time_t expire, s_time_t now, int force) +/* + * Set the time at which an HPET channel should fire. The channel lock should + * be held. + */ +static int hpet_program_time(struct hpet_event_channel *ch, + s_time_t expire, s_time_t now, int force) { int64_t delta; int ret; @@ -143,11 +151,11 @@ static int reprogram_hpet_evt_channel( delta = max_t(int64_t, delta, MIN_DELTA_NS); delta = ns2ticks(delta, ch->shift, ch->mult); - ret = hpet_next_event(delta, ch->idx); + ret = hpet_set_counter(ch, delta); while ( ret && force ) { delta += delta; - ret = hpet_next_event(delta, ch->idx); + ret = hpet_set_counter(ch, delta); } return ret; @@ -209,7 +217,7 @@ again: spin_lock_irqsave(&ch->lock, flags); if ( next_event < ch->next_event && - reprogram_hpet_evt_channel(ch, next_event, now, 0) ) + hpet_program_time(ch, next_event, now, 0) ) goto again; spin_unlock_irqrestore(&ch->lock, flags); @@ -583,6 +591,8 @@ void __init hpet_broadcast_init(void) cfg |= HPET_CFG_LEGACY; n = 1; + hpet_events->flags = HPET_EVT_LEGACY; + if ( !force_hpet_broadcast ) pv_rtc_handler = handle_rtc_once; } @@ -615,9 +625,6 @@ void __init hpet_broadcast_init(void) hpet_events[i].msi.msi_attrib.maskbit = 1; hpet_events[i].msi.msi_attrib.pos = MSI_TYPE_HPET; } - - if ( !num_hpets_used ) - hpet_events->flags = HPET_EVT_LEGACY; } void hpet_broadcast_resume(void) @@ -716,7 +723,7 @@ void hpet_broadcast_enter(void) spin_lock(&ch->lock); /* reprogram if current cpu expire time is nearer */ if ( per_cpu(timer_deadline, cpu) < ch->next_event ) - reprogram_hpet_evt_channel(ch, per_cpu(timer_deadline, cpu), NOW(), 1); + hpet_program_time(ch, per_cpu(timer_deadline, cpu), NOW(), 1); spin_unlock(&ch->lock); } -- 1.7.10.4
Andrew Cooper
2013-Nov-13 17:59 UTC
[Patch v4 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
This involves rewriting most of the MSI related HPET code, and as a result this patch looks very complicated. It is probably best viewed as an end result, with the following notes explaining what is going on. The new logic is as follows: * A single high priority vector is allocated and uses on all cpus. * Reliance on the irq infrastructure is completely removed. * Tracking of free hpet channels has changed. It is now an individual bitmap, and allocation is based on winning a test_and_clear_bit() operation. * There is a notion of strict ownership of hpet channels. ** A cpu which owns an HPET channel can program it for a desired deadline. ** A cpu which can''t find a free HPET channel will have to share. ** If an HPET firing at an appropriate time can be found, a CPU will simply request to be woken up with that HPET. ** If all HPETs are firing too far into the future, a CPU may request to be woken up and reprogram that channel earlier. * Some functions have been renamed to be more descriptive. Some functions have parameters changed to be more consistent. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> --- xen/arch/x86/hpet.c | 537 +++++++++++++++++++++++---------------------------- 1 file changed, 237 insertions(+), 300 deletions(-) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index fd44582..20339b4 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -4,26 +4,21 @@ * HPET management. */ -#include <xen/config.h> +#include <xen/lib.h> +#include <xen/init.h> +#include <xen/cpuidle.h> #include <xen/errno.h> -#include <xen/time.h> -#include <xen/timer.h> -#include <xen/smp.h> #include <xen/softirq.h> -#include <xen/irq.h> -#include <xen/numa.h> + +#include <mach_apic.h> + #include <asm/fixmap.h> #include <asm/div64.h> #include <asm/hpet.h> -#include <asm/msi.h> -#include <mach_apic.h> -#include <xen/cpuidle.h> #define MAX_DELTA_NS MILLISECS(10*1000) #define MIN_DELTA_NS MICROSECS(20) -#define HPET_EVT_USED_BIT 0 -#define HPET_EVT_USED (1 << HPET_EVT_USED_BIT) #define HPET_EVT_DISABLE_BIT 1 #define HPET_EVT_DISABLE (1 << HPET_EVT_DISABLE_BIT) #define HPET_EVT_LEGACY_BIT 2 @@ -36,8 +31,6 @@ struct hpet_event_channel s_time_t next_event; cpumask_var_t cpumask; spinlock_t lock; - void (*event_handler)(struct hpet_event_channel *); - unsigned int idx; /* physical channel idx */ unsigned int cpu; /* msi target */ struct msi_desc msi;/* msi state */ @@ -48,8 +41,20 @@ static struct hpet_event_channel *__read_mostly hpet_events; /* msi hpet channels used for broadcast */ static unsigned int __read_mostly num_hpets_used; -DEFINE_PER_CPU(struct hpet_event_channel *, cpu_bc_channel); +/* High-priority vector for HPET interrupts */ +static u8 __read_mostly hpet_vector; +/* + * HPET channel used for idling. Either the HPET channel this cpu owns + * (indicated by channel->cpu pointing back), or the HPET channel belonging to + * another cpu with which we have requested to be woken. + */ +static DEFINE_PER_CPU(struct hpet_event_channel *, hpet_channel); + +/* Bitmask of currently-free HPET channels. */ +static uint32_t free_channels; + +/* Data from the HPET ACPI table */ unsigned long __initdata hpet_address; u8 __initdata hpet_blockid; @@ -161,89 +166,43 @@ static int hpet_program_time(struct hpet_event_channel *ch, return ret; } -static void evt_do_broadcast(cpumask_t *mask) +/* Wake up all cpus in the channel mask. Lock should be held. */ +static void hpet_wake_cpus(struct hpet_event_channel *ch) { - unsigned int cpu = smp_processor_id(); + cpuidle_wakeup_mwait(ch->cpumask); - if ( cpumask_test_and_clear_cpu(cpu, mask) ) - raise_softirq(TIMER_SOFTIRQ); - - cpuidle_wakeup_mwait(mask); - - if ( !cpumask_empty(mask) ) - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); + if ( !cpumask_empty(ch->cpumask) ) + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); } -static void handle_hpet_broadcast(struct hpet_event_channel *ch) +/* HPET interrupt handler. Wake all requested cpus. Lock should be held. */ +static void hpet_interrupt_handler(struct hpet_event_channel *ch) { - cpumask_t mask; - s_time_t now, next_event; - unsigned int cpu; - unsigned long flags; - - spin_lock_irqsave(&ch->lock, flags); - -again: - ch->next_event = STIME_MAX; - - spin_unlock_irqrestore(&ch->lock, flags); - - next_event = STIME_MAX; - cpumask_clear(&mask); - now = NOW(); - - /* find all expired events */ - for_each_cpu(cpu, ch->cpumask) - { - s_time_t deadline; - - rmb(); - deadline = per_cpu(timer_deadline, cpu); - rmb(); - if ( !cpumask_test_cpu(cpu, ch->cpumask) ) - continue; - - if ( deadline <= now ) - cpumask_set_cpu(cpu, &mask); - else if ( deadline < next_event ) - next_event = deadline; - } - - /* wakeup the cpus which have an expired event. */ - evt_do_broadcast(&mask); - - if ( next_event != STIME_MAX ) - { - spin_lock_irqsave(&ch->lock, flags); - - if ( next_event < ch->next_event && - hpet_program_time(ch, next_event, now, 0) ) - goto again; - - spin_unlock_irqrestore(&ch->lock, flags); - } + hpet_wake_cpus(ch); + raise_softirq(TIMER_SOFTIRQ); } -static void hpet_interrupt_handler(int irq, void *data, - struct cpu_user_regs *regs) +/* HPET interrupt entry. This is set up as a high priority vector. */ +static void do_hpet_irq(struct cpu_user_regs *regs) { - struct hpet_event_channel *ch = (struct hpet_event_channel *)data; - - this_cpu(irq_count)--; + unsigned int cpu = smp_processor_id(); + struct hpet_event_channel *ch = this_cpu(hpet_channel); - if ( !ch->event_handler ) + if ( ch ) { - printk(XENLOG_WARNING "Spurious HPET timer interrupt on HPET timer %d\n", ch->idx); - return; + spin_lock(&ch->lock); + if ( ch->cpu == cpu ) + hpet_interrupt_handler(ch); + spin_unlock(&ch->lock); } - ch->event_handler(ch); + ack_APIC_irq(); } -static void hpet_msi_unmask(struct irq_desc *desc) +/* Unmask an HPET MSI channel. Lock should be held */ +static void hpet_msi_unmask(struct hpet_event_channel *ch) { u32 cfg; - struct hpet_event_channel *ch = desc->action->dev_id; cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); cfg |= HPET_TN_ENABLE; @@ -251,10 +210,10 @@ static void hpet_msi_unmask(struct irq_desc *desc) ch->msi.msi_attrib.masked = 0; } -static void hpet_msi_mask(struct irq_desc *desc) +/* Mask an HPET MSI channel. Lock should be held */ +static void hpet_msi_mask(struct hpet_event_channel *ch) { u32 cfg; - struct hpet_event_channel *ch = desc->action->dev_id; cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); cfg &= ~HPET_TN_ENABLE; @@ -262,92 +221,36 @@ static void hpet_msi_mask(struct irq_desc *desc) ch->msi.msi_attrib.masked = 1; } -static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg) +/* + * Set up the MSI for an HPET channel to point at the allocated cpu, including + * interrupt remapping entries when appropriate. The channel lock is expected + * to be held, and the MSI must currently be masked. + */ +static int hpet_setup_msi(struct hpet_event_channel *ch) { - ch->msi.msg = *msg; + ASSERT(ch->cpu != -1); + ASSERT(ch->msi.msi_attrib.masked == 1); + + msi_compose_msg(hpet_vector, cpumask_of(ch->cpu), &ch->msi.msg); if ( iommu_intremap ) { - int rc = iommu_update_ire_from_msi(&ch->msi, msg); + int rc = iommu_update_ire_from_msi(&ch->msi, &ch->msi.msg); if ( rc ) return rc; } - hpet_write32(msg->data, HPET_Tn_ROUTE(ch->idx)); - hpet_write32(msg->address_lo, HPET_Tn_ROUTE(ch->idx) + 4); + hpet_write32(ch->msi.msg.data, HPET_Tn_ROUTE(ch->idx)); + hpet_write32(ch->msi.msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4); return 0; } -static void __maybe_unused -hpet_msi_read(struct hpet_event_channel *ch, struct msi_msg *msg) -{ - msg->data = hpet_read32(HPET_Tn_ROUTE(ch->idx)); - msg->address_lo = hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4); - msg->address_hi = MSI_ADDR_BASE_HI; - if ( iommu_intremap ) - iommu_read_msi_from_ire(&ch->msi, msg); -} - -static unsigned int hpet_msi_startup(struct irq_desc *desc) -{ - hpet_msi_unmask(desc); - return 0; -} - -#define hpet_msi_shutdown hpet_msi_mask - -static void hpet_msi_ack(struct irq_desc *desc) -{ - irq_complete_move(desc); - move_native_irq(desc); - ack_APIC_irq(); -} - -static void hpet_msi_set_affinity(struct irq_desc *desc, const cpumask_t *mask) -{ - struct hpet_event_channel *ch = desc->action->dev_id; - struct msi_msg msg = ch->msi.msg; - - msg.dest32 = set_desc_affinity(desc, mask); - if ( msg.dest32 == BAD_APICID ) - return; - - msg.data &= ~MSI_DATA_VECTOR_MASK; - msg.data |= MSI_DATA_VECTOR(desc->arch.vector); - msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; - msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32); - if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 ) - hpet_msi_write(ch, &msg); -} - -/* - * IRQ Chip for MSI HPET Devices, - */ -static hw_irq_controller hpet_msi_type = { - .typename = "HPET-MSI", - .startup = hpet_msi_startup, - .shutdown = hpet_msi_shutdown, - .enable = hpet_msi_unmask, - .disable = hpet_msi_mask, - .ack = hpet_msi_ack, - .set_affinity = hpet_msi_set_affinity, -}; - -static int __hpet_setup_msi_irq(struct irq_desc *desc) -{ - struct msi_msg msg; - - msi_compose_msg(desc->arch.vector, desc->arch.cpu_mask, &msg); - return hpet_msi_write(desc->action->dev_id, &msg); -} - -static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) +static int __init hpet_init_msi(struct hpet_event_channel *ch) { int ret; u32 cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); - irq_desc_t *desc = irq_to_desc(ch->msi.irq); if ( iommu_intremap ) { @@ -358,14 +261,12 @@ static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) } /* set HPET Tn as oneshot */ - cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC); + cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC | HPET_TN_ENABLE); cfg |= HPET_TN_FSB | HPET_TN_32BIT; hpet_write32(cfg, HPET_Tn_CFG(ch->idx)); + ch->msi.msi_attrib.masked = 1; - desc->handler = &hpet_msi_type; - ret = request_irq(ch->msi.irq, hpet_interrupt_handler, "HPET", ch); - if ( ret >= 0 ) - ret = __hpet_setup_msi_irq(desc); + ret = hpet_setup_msi(ch); if ( ret < 0 ) { if ( iommu_intremap ) @@ -373,25 +274,6 @@ static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) return ret; } - desc->msi_desc = &ch->msi; - - return 0; -} - -static int __init hpet_assign_irq(struct hpet_event_channel *ch) -{ - int irq; - - if ( (irq = create_irq(NUMA_NO_NODE)) < 0 ) - return irq; - - ch->msi.irq = irq; - if ( hpet_setup_msi_irq(ch) ) - { - destroy_irq(irq); - return -EINVAL; - } - return 0; } @@ -412,6 +294,8 @@ static void __init hpet_fsb_cap_lookup(void) if ( !hpet_events ) return; + alloc_direct_apic_vector(&hpet_vector, do_hpet_irq); + for ( i = 0; i < num_chs && num_hpets_used < nr_cpu_ids; i++ ) { struct hpet_event_channel *ch = &hpet_events[num_hpets_used]; @@ -434,7 +318,7 @@ static void __init hpet_fsb_cap_lookup(void) ch->flags = 0; ch->idx = i; - if ( hpet_assign_irq(ch) == 0 ) + if ( hpet_init_msi(ch) == 0 ) num_hpets_used++; } @@ -442,102 +326,28 @@ static void __init hpet_fsb_cap_lookup(void) num_hpets_used, num_chs); } -static struct hpet_event_channel *hpet_get_channel(unsigned int cpu) +/* + * Search for, and allocate, a free HPET channel. Returns a pointer to the + * channel, or NULL in the case that none were free. The caller is + * responsible for returning the channel to the free pool. + */ +static struct hpet_event_channel *hpet_get_free_channel(void) { - static unsigned int next_channel; - unsigned int i, next; - struct hpet_event_channel *ch; - - if ( num_hpets_used == 0 ) - return hpet_events; + unsigned ch, tries; - if ( num_hpets_used >= nr_cpu_ids ) - return &hpet_events[cpu]; - - do { - next = next_channel; - if ( (i = next + 1) == num_hpets_used ) - i = 0; - } while ( cmpxchg(&next_channel, next, i) != next ); - - /* try unused channel first */ - for ( i = next; i < next + num_hpets_used; i++ ) + for ( tries = num_hpets_used; tries; --tries ) { - ch = &hpet_events[i % num_hpets_used]; - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - { - ch->cpu = cpu; - return ch; - } - } - - /* share a in-use channel */ - ch = &hpet_events[next]; - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - ch->cpu = cpu; - - return ch; -} - -static void set_channel_irq_affinity(struct hpet_event_channel *ch) -{ - struct irq_desc *desc = irq_to_desc(ch->msi.irq); - - ASSERT(!local_irq_is_enabled()); - spin_lock(&desc->lock); - hpet_msi_mask(desc); - hpet_msi_set_affinity(desc, cpumask_of(ch->cpu)); - hpet_msi_unmask(desc); - spin_unlock(&desc->lock); - - spin_unlock(&ch->lock); - - /* We may have missed an interrupt due to the temporary masking. */ - if ( ch->event_handler && ch->next_event < NOW() ) - ch->event_handler(ch); -} - -static void hpet_attach_channel(unsigned int cpu, - struct hpet_event_channel *ch) -{ - ASSERT(!local_irq_is_enabled()); - spin_lock(&ch->lock); - - per_cpu(cpu_bc_channel, cpu) = ch; - - /* try to be the channel owner again while holding the lock */ - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - ch->cpu = cpu; - - if ( ch->cpu != cpu ) - spin_unlock(&ch->lock); - else - set_channel_irq_affinity(ch); -} - -static void hpet_detach_channel(unsigned int cpu, - struct hpet_event_channel *ch) -{ - spin_lock_irq(&ch->lock); - - ASSERT(ch == per_cpu(cpu_bc_channel, cpu)); + if ( (ch = ffs(free_channels)) == 0 ) + break; - per_cpu(cpu_bc_channel, cpu) = NULL; + --ch; + ASSERT(ch < num_hpets_used); - if ( cpu != ch->cpu ) - spin_unlock_irq(&ch->lock); - else if ( cpumask_empty(ch->cpumask) ) - { - ch->cpu = -1; - clear_bit(HPET_EVT_USED_BIT, &ch->flags); - spin_unlock_irq(&ch->lock); - } - else - { - ch->cpu = cpumask_first(ch->cpumask); - set_channel_irq_affinity(ch); - local_irq_enable(); + if ( test_and_clear_bit(ch, &free_channels) ) + return &hpet_events[ch]; } + + return NULL; } #include <asm/mc146818rtc.h> @@ -574,6 +384,7 @@ void __init hpet_broadcast_init(void) /* Stop HPET legacy interrupts */ cfg &= ~HPET_CFG_LEGACY; n = num_hpets_used; + free_channels = (1U << n) - 1; } else { @@ -585,7 +396,6 @@ void __init hpet_broadcast_init(void) hpet_events = xzalloc(struct hpet_event_channel); if ( !hpet_events || !zalloc_cpumask_var(&hpet_events->cpumask) ) return; - hpet_events->msi.irq = -1; /* Start HPET legacy interrupts */ cfg |= HPET_CFG_LEGACY; @@ -619,9 +429,8 @@ void __init hpet_broadcast_init(void) hpet_events[i].shift = 32; hpet_events[i].next_event = STIME_MAX; spin_lock_init(&hpet_events[i].lock); - wmb(); - hpet_events[i].event_handler = handle_hpet_broadcast; + hpet_events[i].msi.irq = -1; hpet_events[i].msi.msi_attrib.maskbit = 1; hpet_events[i].msi.msi_attrib.pos = MSI_TYPE_HPET; } @@ -658,15 +467,24 @@ void hpet_broadcast_resume(void) for ( i = 0; i < n; i++ ) { - if ( hpet_events[i].msi.irq >= 0 ) - __hpet_setup_msi_irq(irq_to_desc(hpet_events[i].msi.irq)); - /* set HPET Tn as oneshot */ cfg = hpet_read32(HPET_Tn_CFG(hpet_events[i].idx)); cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC); - cfg |= HPET_TN_ENABLE | HPET_TN_32BIT; - if ( !(hpet_events[i].flags & HPET_EVT_LEGACY) ) + cfg |= HPET_TN_32BIT; + + /* + * Legacy HPET channel enabled here. MSI channels enabled in + * hpet_broadcast_init() when claimed by a cpu. + */ + if ( hpet_events[i].flags & HPET_EVT_LEGACY ) + cfg |= HPET_TN_ENABLE; + else + { + cfg &= ~HPET_TN_ENABLE; cfg |= HPET_TN_FSB; + hpet_events[i].msi.msi_attrib.masked = 1; + } + hpet_write32(cfg, HPET_Tn_CFG(hpet_events[i].idx)); hpet_events[i].next_event = STIME_MAX; @@ -703,50 +521,162 @@ void hpet_disable_legacy_broadcast(void) void hpet_broadcast_enter(void) { unsigned int cpu = smp_processor_id(); - struct hpet_event_channel *ch = per_cpu(cpu_bc_channel, cpu); + struct hpet_event_channel *ch = this_cpu(hpet_channel); + s_time_t deadline = this_cpu(timer_deadline); + + ASSERT(!local_irq_is_enabled()); + ASSERT(ch == NULL); - if ( per_cpu(timer_deadline, cpu) == 0 ) + if ( deadline == 0 ) return; - if ( !ch ) - ch = hpet_get_channel(cpu); + /* If using HPET in legacy timer mode */ + if ( num_hpets_used == 0 ) + { + spin_lock(&hpet_events->lock); - ASSERT(!local_irq_is_enabled()); + cpumask_set_cpu(cpu, hpet_events->cpumask); + if ( deadline < hpet_events->next_event ) + hpet_program_time(hpet_events, deadline, NOW(), 1); + + spin_unlock(&hpet_events->lock); + return; + } + + ch = hpet_get_free_channel(); + + if ( ch ) + { + spin_lock(&ch->lock); + + /* This really should be an MSI channel by this point */ + ASSERT(!(ch->flags & HPET_EVT_LEGACY)); + + hpet_msi_mask(ch); + + this_cpu(hpet_channel) = ch; + ch->cpu = cpu; + cpumask_set_cpu(cpu, ch->cpumask); + + hpet_setup_msi(ch); + hpet_program_time(ch, deadline, NOW(), 1); + hpet_msi_unmask(ch); + + spin_unlock(&ch->lock); + } + else + { + /* TODO - this seems very ugly */ + s_time_t fallback_deadline = STIME_MAX; + unsigned int i, fallback_idx = -1; + + for ( i = 0; i < num_hpets_used; ++i ) + { + ch = &hpet_events[i]; + spin_lock(&ch->lock); + + if ( ch->cpu == -1 ) + goto continue_search; + + /* This channel is going to expire far too early */ + if ( ch->next_event < deadline - MICROSECS(50) ) + goto continue_search; + + /* We can deal with being woken with 50us to spare */ + if ( ch->next_event <= deadline ) + break; + + /* Otherwise record our best HPET to borrow. */ + if ( ch->next_event <= fallback_deadline ) + { + fallback_idx = i; + fallback_deadline = ch->next_event; + } + + continue_search: + spin_unlock(&ch->lock); + ch = NULL; + } + + if ( ch ) + { + /* Found HPET with an appropriate time. Request to be woken up */ + cpumask_set_cpu(cpu, ch->cpumask); + this_cpu(hpet_channel) = ch; + spin_unlock(&ch->lock); + } + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) + { + /* + * Else we want to reprogram the fallback HPET sooner if possible, + * but with all the spinlocking, it just might have passed. + */ + ch = &hpet_events[fallback_idx]; + + spin_lock(&ch->lock); - if ( !(ch->flags & HPET_EVT_LEGACY) ) - hpet_attach_channel(cpu, ch); + if ( ch->cpu != 1 && ch->next_event == fallback_deadline ) + { + cpumask_set_cpu(cpu, ch->cpumask); + hpet_program_time(ch, deadline, NOW(), 1); + } + else + /* All else has failed. Wander the idle loop again */ + this_cpu(timer_deadline) = NOW() - 1; + + spin_unlock(&ch->lock); + } + else + /* All HPETs were too soon. Wander the idle loop again */ + this_cpu(timer_deadline) = NOW() - 1; + } /* Disable LAPIC timer interrupts. */ disable_APIC_timer(); - cpumask_set_cpu(cpu, ch->cpumask); - - spin_lock(&ch->lock); - /* reprogram if current cpu expire time is nearer */ - if ( per_cpu(timer_deadline, cpu) < ch->next_event ) - hpet_program_time(ch, per_cpu(timer_deadline, cpu), NOW(), 1); - spin_unlock(&ch->lock); } void hpet_broadcast_exit(void) { unsigned int cpu = smp_processor_id(); - struct hpet_event_channel *ch = per_cpu(cpu_bc_channel, cpu); + struct hpet_event_channel *ch = this_cpu(hpet_channel); - if ( per_cpu(timer_deadline, cpu) == 0 ) + ASSERT(local_irq_is_enabled()); + + if ( this_cpu(timer_deadline) == 0 ) + return; + + /* If using HPET in legacy timer mode */ + if ( num_hpets_used == 0 ) + { + /* This is safe without the spinlock, and will reduce contention. */ + cpumask_clear_cpu(cpu, hpet_events->cpumask); return; + } if ( !ch ) - ch = hpet_get_channel(cpu); + return; - /* Reprogram the deadline; trigger timer work now if it has passed. */ - enable_APIC_timer(); - if ( !reprogram_timer(per_cpu(timer_deadline, cpu)) ) - raise_softirq(TIMER_SOFTIRQ); + spin_lock_irq(&ch->lock); cpumask_clear_cpu(cpu, ch->cpumask); - if ( !(ch->flags & HPET_EVT_LEGACY) ) - hpet_detach_channel(cpu, ch); + /* If we own the channel, detach it */ + if ( ch->cpu == cpu ) + { + hpet_msi_mask(ch); + hpet_wake_cpus(ch); + ch->cpu = -1; + set_bit(ch->idx, &free_channels); + } + + this_cpu(hpet_channel) = NULL; + + spin_unlock_irq(&ch->lock); + + /* Reprogram the deadline; trigger timer work now if it has passed. */ + enable_APIC_timer(); + if ( !reprogram_timer(this_cpu(timer_deadline)) ) + raise_softirq(TIMER_SOFTIRQ); } int hpet_broadcast_is_available(void) @@ -763,7 +693,14 @@ int hpet_legacy_irq_tick(void) (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) ! HPET_EVT_LEGACY ) return 0; - hpet_events->event_handler(hpet_events); + + spin_lock_irq(&hpet_events->lock); + + hpet_interrupt_handler(hpet_events); + hpet_events->next_event = STIME_MAX; + + spin_unlock_irq(&hpet_events->lock); + return 1; } -- 1.7.10.4
These changes are ones which were able to be pulled out of the previous patch. They are all misc cleanup without functional implications * Shift HPET_EVT_* definitions up now that USED has moved out * Shuffle struct hpet_event_channel ** Reflow horizontally and comment current use ** Promote ''shift'' to unsigned. It is the constant 32 but can be more easily optimised. ** Move ''flags'' up to fill 4 byte hole ** Move ''cpumask'' and ''lock'' into second cache line as they are diried from other cpus * The new locking requirements guarentee that interrupts are disabled in hpet_set_counter. Leave an ASSERT() just in case. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> --- xen/arch/x86/hpet.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index 20339b4..0bb3c31 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -19,22 +19,22 @@ #define MAX_DELTA_NS MILLISECS(10*1000) #define MIN_DELTA_NS MICROSECS(20) -#define HPET_EVT_DISABLE_BIT 1 +#define HPET_EVT_DISABLE_BIT 0 #define HPET_EVT_DISABLE (1 << HPET_EVT_DISABLE_BIT) -#define HPET_EVT_LEGACY_BIT 2 +#define HPET_EVT_LEGACY_BIT 1 #define HPET_EVT_LEGACY (1 << HPET_EVT_LEGACY_BIT) struct hpet_event_channel { - unsigned long mult; - int shift; - s_time_t next_event; - cpumask_var_t cpumask; - spinlock_t lock; - unsigned int idx; /* physical channel idx */ - unsigned int cpu; /* msi target */ - struct msi_desc msi;/* msi state */ - unsigned int flags; /* HPET_EVT_x */ + unsigned long mult; /* tick <-> time conversion */ + unsigned int shift; /* tick <-> time conversion */ + unsigned int flags; /* HPET_EVT_x */ + s_time_t next_event; /* expected time of next interrupt */ + unsigned int idx; /* HPET counter index */ + unsigned int cpu; /* owner of channel (or -1) */ + struct msi_desc msi; /* msi state */ + cpumask_var_t cpumask; /* cpus wishing to be woken */ + spinlock_t lock; } __cacheline_aligned; static struct hpet_event_channel *__read_mostly hpet_events; @@ -107,14 +107,13 @@ static inline unsigned long ns2ticks(unsigned long nsec, int shift, static int hpet_set_counter(struct hpet_event_channel *ch, unsigned long delta) { uint32_t cnt, cmp; - unsigned long flags; - local_irq_save(flags); + ASSERT(!local_irq_is_enabled()); + cnt = hpet_read32(HPET_COUNTER); cmp = cnt + delta; hpet_write32(cmp, HPET_Tn_CMP(ch->idx)); cmp = hpet_read32(HPET_COUNTER); - local_irq_restore(flags); /* Are we within two ticks of the deadline passing? Then we may miss. */ return ((cmp + 2 - cnt) > delta) ? -ETIME : 0; -- 1.7.10.4
Andrew Cooper
2013-Nov-13 17:59 UTC
[Patch v4 4/5] x86/hpet: Debug and verbose hpet logging
This was for debugging purposes, but might perhaps be more useful generally. I am happy to keep none, some or all of it, depending on how useful people think it might be. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> --- xen/arch/x86/hpet.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index 0bb3c31..dedfa04 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -66,6 +66,36 @@ u8 __initdata hpet_blockid; static bool_t __initdata force_hpet_broadcast; boolean_param("hpetbroadcast", force_hpet_broadcast); +static bool_t __read_mostly hpet_verbose; +static bool_t __read_mostly hpet_debug; +static void __init parse_hpet_param(char * s) +{ + char *ss; + int val; + + do { + val = !!strncmp(s, "no-", 3); + if ( !val ) + s += 3; + + ss = strchr(s, '',''); + if ( ss ) + *ss = ''\0''; + + if ( !strcmp(s, "verbose") ) + hpet_verbose = val; + else if ( !strcmp(s, "debug") ) + { + hpet_debug = val; + if ( val ) + hpet_verbose = 1; + } + + s = ss + 1; + } while ( ss ); +} +custom_param("hpet", parse_hpet_param); + /* * Calculate a multiplication factor for scaled math, which is used to convert * nanoseconds based values to clock ticks: @@ -99,6 +129,35 @@ static inline unsigned long ns2ticks(unsigned long nsec, int shift, return (unsigned long) tmp; } +static void dump_hpet_timer(unsigned timer) +{ + u32 cfg = hpet_read32(HPET_Tn_CFG(timer)); + + printk(XENLOG_INFO "HPET: Timer %02u CFG: raw 0x%08"PRIx32 + " Caps: %d %c%c", timer, cfg, + cfg & HPET_TN_64BIT_CAP ? 64 : 32, + cfg & HPET_TN_FSB_CAP ? ''M'' : ''-'', + cfg & HPET_TN_PERIODIC_CAP ? ''P'' : ''-''); + + printk("\n Setup: "); + + if ( (cfg & HPET_TN_FSB_CAP) && (cfg & HPET_TN_FSB) ) + printk("FSB "); + + if ( !(cfg & HPET_TN_FSB) ) + printk("GSI %#x ", + (cfg & HPET_TN_ROUTE) >> HPET_TN_ROUTE_SHIFT); + + if ( cfg & HPET_TN_32BIT ) + printk("32bit "); + + if ( cfg & HPET_TN_PERIODIC ) + printk("Periodic "); + + printk("%sabled ", cfg & HPET_TN_ENABLE ? "En" : "Dis"); + printk("%s\n", cfg & HPET_TN_LEVEL ? "Level" : "Edge"); +} + /* * Program an HPET channels counter relative to now. ''delta'' is specified in * ticks, and should be calculated with ns2ticks(). The channel lock should @@ -712,7 +771,14 @@ u64 __init hpet_setup(void) unsigned int last; if ( hpet_rate ) + { + if ( hpet_debug ) + printk(XENLOG_DEBUG "HPET: Skipping re-setup\n"); return hpet_rate; + } + + if ( hpet_debug ) + printk(XENLOG_DEBUG "HPET: Setting up hpet data\n"); if ( hpet_address == 0 ) return 0; @@ -726,6 +792,20 @@ u64 __init hpet_setup(void) return 0; } + if ( hpet_verbose ) + { + printk(XENLOG_INFO "HPET: Vendor: %04"PRIx16", Rev: %u, %u timers\n", + hpet_id >> HPET_ID_VENDOR_SHIFT, + hpet_id & HPET_ID_REV, + ((hpet_id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT) + 1); + printk(XENLOG_INFO "HPET: Caps: "); + if ( hpet_id & HPET_ID_LEGSUP ) + printk("Legacy "); + if ( hpet_id & HPET_ID_64BIT ) + printk("64bit "); + printk("\n"); + } + /* Check for sane period (100ps <= period <= 100ns). */ hpet_period = hpet_read32(HPET_PERIOD); if ( (hpet_period > 100000000) || (hpet_period < 100000) ) @@ -783,6 +863,9 @@ void hpet_resume(u32 *boot_cfg) cfg &= ~HPET_TN_RESERVED; } hpet_write32(cfg, HPET_Tn_CFG(i)); + + if ( hpet_verbose ) + dump_hpet_timer(i); } cfg = hpet_read32(HPET_CFG); -- 1.7.10.4
Debug key for dumping HPET state. This patch is not intended for committing. --- xen/arch/x86/hpet.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index dedfa04..781f0f4 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -9,6 +9,7 @@ #include <xen/cpuidle.h> #include <xen/errno.h> #include <xen/softirq.h> +#include <xen/keyhandler.h> #include <mach_apic.h> @@ -68,6 +69,7 @@ boolean_param("hpetbroadcast", force_hpet_broadcast); static bool_t __read_mostly hpet_verbose; static bool_t __read_mostly hpet_debug; +static bool_t __initdata hpet_debug_tick; static void __init parse_hpet_param(char * s) { char *ss; @@ -90,6 +92,8 @@ static void __init parse_hpet_param(char * s) if ( val ) hpet_verbose = 1; } + else if ( !strcmp(s, "tick") ) + hpet_debug_tick = val; s = ss + 1; } while ( ss ); @@ -764,6 +768,33 @@ int hpet_legacy_irq_tick(void) static u32 *hpet_boot_cfg; +static void do_hpet_dump_state(unsigned char key) +{ + unsigned i; + printk("''%c'' pressed - dumping HPET state\n", key); + + for ( i = 0; i < num_hpets_used; ++i ) + dump_hpet_timer(i); +} + +static struct keyhandler hpet_dump_state = { + .irq_callback = 0, + .u.fn = do_hpet_dump_state, + .desc = "Dump hpet state" +}; + +static struct timer hpet_dbg_tick; +static void hpet_dbg_tick_fn(void *data) +{ + static s_time_t last = 0; + s_time_t now = NOW(); + + printk("In HPET debug tick. Time is %"PRId64", delta is %"PRId64"\n", + now, now - last); + set_timer(&hpet_dbg_tick, now + SECONDS(5)); + last = now; +} + u64 __init hpet_setup(void) { static u64 __initdata hpet_rate; @@ -821,6 +852,14 @@ u64 __init hpet_setup(void) hpet_rate = 1000000000000000ULL; /* 10^15 */ (void)do_div(hpet_rate, hpet_period); + register_keyhandler(''1'', &hpet_dump_state); + + if ( hpet_debug_tick ) + { + init_timer(&hpet_dbg_tick, hpet_dbg_tick_fn, NULL, 0); + set_timer(&hpet_dbg_tick, NOW() + SECONDS(5)); + } + return hpet_rate; } -- 1.7.10.4
Tim Deegan
2013-Nov-14 15:52 UTC
Re: [Patch v4 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
Hi, This looks pretty good to me. At 17:59 +0000 on 13 Nov (1384361951), Andrew Cooper wrote:> + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) > + { > + /* > + * Else we want to reprogram the fallback HPET sooner if possible, > + * but with all the spinlocking, it just might have passed. > + */ > + ch = &hpet_events[fallback_idx]; > + > + spin_lock(&ch->lock); > > - if ( !(ch->flags & HPET_EVT_LEGACY) ) > - hpet_attach_channel(cpu, ch); > + if ( ch->cpu != 1 && ch->next_event == fallback_deadline )DYM ''!= -1'' here? Apart from that, you can put ''Reviewed-by: Tim Deegan <tim@xen.org>'' on patches 1-3. Cheers, Tim.
Andrew Cooper
2013-Nov-14 15:56 UTC
Re: [Patch v4 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 14/11/13 15:52, Tim Deegan wrote:> Hi, > > This looks pretty good to me. > > At 17:59 +0000 on 13 Nov (1384361951), Andrew Cooper wrote: >> + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) >> + { >> + /* >> + * Else we want to reprogram the fallback HPET sooner if possible, >> + * but with all the spinlocking, it just might have passed. >> + */ >> + ch = &hpet_events[fallback_idx]; >> + >> + spin_lock(&ch->lock); >> >> - if ( !(ch->flags & HPET_EVT_LEGACY) ) >> - hpet_attach_channel(cpu, ch); >> + if ( ch->cpu != 1 && ch->next_event == fallback_deadline ) > DYM ''!= -1'' here? > > Apart from that, you can put ''Reviewed-by: Tim Deegan <tim@xen.org>'' > on patches 1-3. > > Cheers, > > Tim. >Eek - I do indeed. (This is a very rare codepath, even when forced using no-arat on newer hardware). ~Andrew
Andrew Cooper
2013-Nov-14 16:01 UTC
[Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
This involves rewriting most of the MSI related HPET code, and as a result this patch looks very complicated. It is probably best viewed as an end result, with the following notes explaining what is going on. The new logic is as follows: * A single high priority vector is allocated and uses on all cpus. * Reliance on the irq infrastructure is completely removed. * Tracking of free hpet channels has changed. It is now an individual bitmap, and allocation is based on winning a test_and_clear_bit() operation. * There is a notion of strict ownership of hpet channels. ** A cpu which owns an HPET channel can program it for a desired deadline. ** A cpu which can''t find a free HPET channel will have to share. ** If an HPET firing at an appropriate time can be found, a CPU will simply request to be woken up with that HPET. ** If all HPETs are firing too far into the future, a CPU may request to be woken up and reprogram that channel earlier. * Some functions have been renamed to be more descriptive. Some functions have parameters changed to be more consistent. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> --- xen/arch/x86/hpet.c | 537 +++++++++++++++++++++++---------------------------- 1 file changed, 237 insertions(+), 300 deletions(-) diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c index fd44582..f1638a9 100644 --- a/xen/arch/x86/hpet.c +++ b/xen/arch/x86/hpet.c @@ -4,26 +4,21 @@ * HPET management. */ -#include <xen/config.h> +#include <xen/lib.h> +#include <xen/init.h> +#include <xen/cpuidle.h> #include <xen/errno.h> -#include <xen/time.h> -#include <xen/timer.h> -#include <xen/smp.h> #include <xen/softirq.h> -#include <xen/irq.h> -#include <xen/numa.h> + +#include <mach_apic.h> + #include <asm/fixmap.h> #include <asm/div64.h> #include <asm/hpet.h> -#include <asm/msi.h> -#include <mach_apic.h> -#include <xen/cpuidle.h> #define MAX_DELTA_NS MILLISECS(10*1000) #define MIN_DELTA_NS MICROSECS(20) -#define HPET_EVT_USED_BIT 0 -#define HPET_EVT_USED (1 << HPET_EVT_USED_BIT) #define HPET_EVT_DISABLE_BIT 1 #define HPET_EVT_DISABLE (1 << HPET_EVT_DISABLE_BIT) #define HPET_EVT_LEGACY_BIT 2 @@ -36,8 +31,6 @@ struct hpet_event_channel s_time_t next_event; cpumask_var_t cpumask; spinlock_t lock; - void (*event_handler)(struct hpet_event_channel *); - unsigned int idx; /* physical channel idx */ unsigned int cpu; /* msi target */ struct msi_desc msi;/* msi state */ @@ -48,8 +41,20 @@ static struct hpet_event_channel *__read_mostly hpet_events; /* msi hpet channels used for broadcast */ static unsigned int __read_mostly num_hpets_used; -DEFINE_PER_CPU(struct hpet_event_channel *, cpu_bc_channel); +/* High-priority vector for HPET interrupts */ +static u8 __read_mostly hpet_vector; +/* + * HPET channel used for idling. Either the HPET channel this cpu owns + * (indicated by channel->cpu pointing back), or the HPET channel belonging to + * another cpu with which we have requested to be woken. + */ +static DEFINE_PER_CPU(struct hpet_event_channel *, hpet_channel); + +/* Bitmask of currently-free HPET channels. */ +static uint32_t free_channels; + +/* Data from the HPET ACPI table */ unsigned long __initdata hpet_address; u8 __initdata hpet_blockid; @@ -161,89 +166,43 @@ static int hpet_program_time(struct hpet_event_channel *ch, return ret; } -static void evt_do_broadcast(cpumask_t *mask) +/* Wake up all cpus in the channel mask. Lock should be held. */ +static void hpet_wake_cpus(struct hpet_event_channel *ch) { - unsigned int cpu = smp_processor_id(); + cpuidle_wakeup_mwait(ch->cpumask); - if ( cpumask_test_and_clear_cpu(cpu, mask) ) - raise_softirq(TIMER_SOFTIRQ); - - cpuidle_wakeup_mwait(mask); - - if ( !cpumask_empty(mask) ) - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); + if ( !cpumask_empty(ch->cpumask) ) + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); } -static void handle_hpet_broadcast(struct hpet_event_channel *ch) +/* HPET interrupt handler. Wake all requested cpus. Lock should be held. */ +static void hpet_interrupt_handler(struct hpet_event_channel *ch) { - cpumask_t mask; - s_time_t now, next_event; - unsigned int cpu; - unsigned long flags; - - spin_lock_irqsave(&ch->lock, flags); - -again: - ch->next_event = STIME_MAX; - - spin_unlock_irqrestore(&ch->lock, flags); - - next_event = STIME_MAX; - cpumask_clear(&mask); - now = NOW(); - - /* find all expired events */ - for_each_cpu(cpu, ch->cpumask) - { - s_time_t deadline; - - rmb(); - deadline = per_cpu(timer_deadline, cpu); - rmb(); - if ( !cpumask_test_cpu(cpu, ch->cpumask) ) - continue; - - if ( deadline <= now ) - cpumask_set_cpu(cpu, &mask); - else if ( deadline < next_event ) - next_event = deadline; - } - - /* wakeup the cpus which have an expired event. */ - evt_do_broadcast(&mask); - - if ( next_event != STIME_MAX ) - { - spin_lock_irqsave(&ch->lock, flags); - - if ( next_event < ch->next_event && - hpet_program_time(ch, next_event, now, 0) ) - goto again; - - spin_unlock_irqrestore(&ch->lock, flags); - } + hpet_wake_cpus(ch); + raise_softirq(TIMER_SOFTIRQ); } -static void hpet_interrupt_handler(int irq, void *data, - struct cpu_user_regs *regs) +/* HPET interrupt entry. This is set up as a high priority vector. */ +static void do_hpet_irq(struct cpu_user_regs *regs) { - struct hpet_event_channel *ch = (struct hpet_event_channel *)data; - - this_cpu(irq_count)--; + unsigned int cpu = smp_processor_id(); + struct hpet_event_channel *ch = this_cpu(hpet_channel); - if ( !ch->event_handler ) + if ( ch ) { - printk(XENLOG_WARNING "Spurious HPET timer interrupt on HPET timer %d\n", ch->idx); - return; + spin_lock(&ch->lock); + if ( ch->cpu == cpu ) + hpet_interrupt_handler(ch); + spin_unlock(&ch->lock); } - ch->event_handler(ch); + ack_APIC_irq(); } -static void hpet_msi_unmask(struct irq_desc *desc) +/* Unmask an HPET MSI channel. Lock should be held */ +static void hpet_msi_unmask(struct hpet_event_channel *ch) { u32 cfg; - struct hpet_event_channel *ch = desc->action->dev_id; cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); cfg |= HPET_TN_ENABLE; @@ -251,10 +210,10 @@ static void hpet_msi_unmask(struct irq_desc *desc) ch->msi.msi_attrib.masked = 0; } -static void hpet_msi_mask(struct irq_desc *desc) +/* Mask an HPET MSI channel. Lock should be held */ +static void hpet_msi_mask(struct hpet_event_channel *ch) { u32 cfg; - struct hpet_event_channel *ch = desc->action->dev_id; cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); cfg &= ~HPET_TN_ENABLE; @@ -262,92 +221,36 @@ static void hpet_msi_mask(struct irq_desc *desc) ch->msi.msi_attrib.masked = 1; } -static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg) +/* + * Set up the MSI for an HPET channel to point at the allocated cpu, including + * interrupt remapping entries when appropriate. The channel lock is expected + * to be held, and the MSI must currently be masked. + */ +static int hpet_setup_msi(struct hpet_event_channel *ch) { - ch->msi.msg = *msg; + ASSERT(ch->cpu != -1); + ASSERT(ch->msi.msi_attrib.masked == 1); + + msi_compose_msg(hpet_vector, cpumask_of(ch->cpu), &ch->msi.msg); if ( iommu_intremap ) { - int rc = iommu_update_ire_from_msi(&ch->msi, msg); + int rc = iommu_update_ire_from_msi(&ch->msi, &ch->msi.msg); if ( rc ) return rc; } - hpet_write32(msg->data, HPET_Tn_ROUTE(ch->idx)); - hpet_write32(msg->address_lo, HPET_Tn_ROUTE(ch->idx) + 4); + hpet_write32(ch->msi.msg.data, HPET_Tn_ROUTE(ch->idx)); + hpet_write32(ch->msi.msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4); return 0; } -static void __maybe_unused -hpet_msi_read(struct hpet_event_channel *ch, struct msi_msg *msg) -{ - msg->data = hpet_read32(HPET_Tn_ROUTE(ch->idx)); - msg->address_lo = hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4); - msg->address_hi = MSI_ADDR_BASE_HI; - if ( iommu_intremap ) - iommu_read_msi_from_ire(&ch->msi, msg); -} - -static unsigned int hpet_msi_startup(struct irq_desc *desc) -{ - hpet_msi_unmask(desc); - return 0; -} - -#define hpet_msi_shutdown hpet_msi_mask - -static void hpet_msi_ack(struct irq_desc *desc) -{ - irq_complete_move(desc); - move_native_irq(desc); - ack_APIC_irq(); -} - -static void hpet_msi_set_affinity(struct irq_desc *desc, const cpumask_t *mask) -{ - struct hpet_event_channel *ch = desc->action->dev_id; - struct msi_msg msg = ch->msi.msg; - - msg.dest32 = set_desc_affinity(desc, mask); - if ( msg.dest32 == BAD_APICID ) - return; - - msg.data &= ~MSI_DATA_VECTOR_MASK; - msg.data |= MSI_DATA_VECTOR(desc->arch.vector); - msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; - msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32); - if ( msg.data != ch->msi.msg.data || msg.dest32 != ch->msi.msg.dest32 ) - hpet_msi_write(ch, &msg); -} - -/* - * IRQ Chip for MSI HPET Devices, - */ -static hw_irq_controller hpet_msi_type = { - .typename = "HPET-MSI", - .startup = hpet_msi_startup, - .shutdown = hpet_msi_shutdown, - .enable = hpet_msi_unmask, - .disable = hpet_msi_mask, - .ack = hpet_msi_ack, - .set_affinity = hpet_msi_set_affinity, -}; - -static int __hpet_setup_msi_irq(struct irq_desc *desc) -{ - struct msi_msg msg; - - msi_compose_msg(desc->arch.vector, desc->arch.cpu_mask, &msg); - return hpet_msi_write(desc->action->dev_id, &msg); -} - -static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) +static int __init hpet_init_msi(struct hpet_event_channel *ch) { int ret; u32 cfg = hpet_read32(HPET_Tn_CFG(ch->idx)); - irq_desc_t *desc = irq_to_desc(ch->msi.irq); if ( iommu_intremap ) { @@ -358,14 +261,12 @@ static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) } /* set HPET Tn as oneshot */ - cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC); + cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC | HPET_TN_ENABLE); cfg |= HPET_TN_FSB | HPET_TN_32BIT; hpet_write32(cfg, HPET_Tn_CFG(ch->idx)); + ch->msi.msi_attrib.masked = 1; - desc->handler = &hpet_msi_type; - ret = request_irq(ch->msi.irq, hpet_interrupt_handler, "HPET", ch); - if ( ret >= 0 ) - ret = __hpet_setup_msi_irq(desc); + ret = hpet_setup_msi(ch); if ( ret < 0 ) { if ( iommu_intremap ) @@ -373,25 +274,6 @@ static int __init hpet_setup_msi_irq(struct hpet_event_channel *ch) return ret; } - desc->msi_desc = &ch->msi; - - return 0; -} - -static int __init hpet_assign_irq(struct hpet_event_channel *ch) -{ - int irq; - - if ( (irq = create_irq(NUMA_NO_NODE)) < 0 ) - return irq; - - ch->msi.irq = irq; - if ( hpet_setup_msi_irq(ch) ) - { - destroy_irq(irq); - return -EINVAL; - } - return 0; } @@ -412,6 +294,8 @@ static void __init hpet_fsb_cap_lookup(void) if ( !hpet_events ) return; + alloc_direct_apic_vector(&hpet_vector, do_hpet_irq); + for ( i = 0; i < num_chs && num_hpets_used < nr_cpu_ids; i++ ) { struct hpet_event_channel *ch = &hpet_events[num_hpets_used]; @@ -434,7 +318,7 @@ static void __init hpet_fsb_cap_lookup(void) ch->flags = 0; ch->idx = i; - if ( hpet_assign_irq(ch) == 0 ) + if ( hpet_init_msi(ch) == 0 ) num_hpets_used++; } @@ -442,102 +326,28 @@ static void __init hpet_fsb_cap_lookup(void) num_hpets_used, num_chs); } -static struct hpet_event_channel *hpet_get_channel(unsigned int cpu) +/* + * Search for, and allocate, a free HPET channel. Returns a pointer to the + * channel, or NULL in the case that none were free. The caller is + * responsible for returning the channel to the free pool. + */ +static struct hpet_event_channel *hpet_get_free_channel(void) { - static unsigned int next_channel; - unsigned int i, next; - struct hpet_event_channel *ch; - - if ( num_hpets_used == 0 ) - return hpet_events; + unsigned ch, tries; - if ( num_hpets_used >= nr_cpu_ids ) - return &hpet_events[cpu]; - - do { - next = next_channel; - if ( (i = next + 1) == num_hpets_used ) - i = 0; - } while ( cmpxchg(&next_channel, next, i) != next ); - - /* try unused channel first */ - for ( i = next; i < next + num_hpets_used; i++ ) + for ( tries = num_hpets_used; tries; --tries ) { - ch = &hpet_events[i % num_hpets_used]; - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - { - ch->cpu = cpu; - return ch; - } - } - - /* share a in-use channel */ - ch = &hpet_events[next]; - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - ch->cpu = cpu; - - return ch; -} - -static void set_channel_irq_affinity(struct hpet_event_channel *ch) -{ - struct irq_desc *desc = irq_to_desc(ch->msi.irq); - - ASSERT(!local_irq_is_enabled()); - spin_lock(&desc->lock); - hpet_msi_mask(desc); - hpet_msi_set_affinity(desc, cpumask_of(ch->cpu)); - hpet_msi_unmask(desc); - spin_unlock(&desc->lock); - - spin_unlock(&ch->lock); - - /* We may have missed an interrupt due to the temporary masking. */ - if ( ch->event_handler && ch->next_event < NOW() ) - ch->event_handler(ch); -} - -static void hpet_attach_channel(unsigned int cpu, - struct hpet_event_channel *ch) -{ - ASSERT(!local_irq_is_enabled()); - spin_lock(&ch->lock); - - per_cpu(cpu_bc_channel, cpu) = ch; - - /* try to be the channel owner again while holding the lock */ - if ( !test_and_set_bit(HPET_EVT_USED_BIT, &ch->flags) ) - ch->cpu = cpu; - - if ( ch->cpu != cpu ) - spin_unlock(&ch->lock); - else - set_channel_irq_affinity(ch); -} - -static void hpet_detach_channel(unsigned int cpu, - struct hpet_event_channel *ch) -{ - spin_lock_irq(&ch->lock); - - ASSERT(ch == per_cpu(cpu_bc_channel, cpu)); + if ( (ch = ffs(free_channels)) == 0 ) + break; - per_cpu(cpu_bc_channel, cpu) = NULL; + --ch; + ASSERT(ch < num_hpets_used); - if ( cpu != ch->cpu ) - spin_unlock_irq(&ch->lock); - else if ( cpumask_empty(ch->cpumask) ) - { - ch->cpu = -1; - clear_bit(HPET_EVT_USED_BIT, &ch->flags); - spin_unlock_irq(&ch->lock); - } - else - { - ch->cpu = cpumask_first(ch->cpumask); - set_channel_irq_affinity(ch); - local_irq_enable(); + if ( test_and_clear_bit(ch, &free_channels) ) + return &hpet_events[ch]; } + + return NULL; } #include <asm/mc146818rtc.h> @@ -574,6 +384,7 @@ void __init hpet_broadcast_init(void) /* Stop HPET legacy interrupts */ cfg &= ~HPET_CFG_LEGACY; n = num_hpets_used; + free_channels = (1U << n) - 1; } else { @@ -585,7 +396,6 @@ void __init hpet_broadcast_init(void) hpet_events = xzalloc(struct hpet_event_channel); if ( !hpet_events || !zalloc_cpumask_var(&hpet_events->cpumask) ) return; - hpet_events->msi.irq = -1; /* Start HPET legacy interrupts */ cfg |= HPET_CFG_LEGACY; @@ -619,9 +429,8 @@ void __init hpet_broadcast_init(void) hpet_events[i].shift = 32; hpet_events[i].next_event = STIME_MAX; spin_lock_init(&hpet_events[i].lock); - wmb(); - hpet_events[i].event_handler = handle_hpet_broadcast; + hpet_events[i].msi.irq = -1; hpet_events[i].msi.msi_attrib.maskbit = 1; hpet_events[i].msi.msi_attrib.pos = MSI_TYPE_HPET; } @@ -658,15 +467,24 @@ void hpet_broadcast_resume(void) for ( i = 0; i < n; i++ ) { - if ( hpet_events[i].msi.irq >= 0 ) - __hpet_setup_msi_irq(irq_to_desc(hpet_events[i].msi.irq)); - /* set HPET Tn as oneshot */ cfg = hpet_read32(HPET_Tn_CFG(hpet_events[i].idx)); cfg &= ~(HPET_TN_LEVEL | HPET_TN_PERIODIC); - cfg |= HPET_TN_ENABLE | HPET_TN_32BIT; - if ( !(hpet_events[i].flags & HPET_EVT_LEGACY) ) + cfg |= HPET_TN_32BIT; + + /* + * Legacy HPET channel enabled here. MSI channels enabled in + * hpet_broadcast_init() when claimed by a cpu. + */ + if ( hpet_events[i].flags & HPET_EVT_LEGACY ) + cfg |= HPET_TN_ENABLE; + else + { + cfg &= ~HPET_TN_ENABLE; cfg |= HPET_TN_FSB; + hpet_events[i].msi.msi_attrib.masked = 1; + } + hpet_write32(cfg, HPET_Tn_CFG(hpet_events[i].idx)); hpet_events[i].next_event = STIME_MAX; @@ -703,50 +521,162 @@ void hpet_disable_legacy_broadcast(void) void hpet_broadcast_enter(void) { unsigned int cpu = smp_processor_id(); - struct hpet_event_channel *ch = per_cpu(cpu_bc_channel, cpu); + struct hpet_event_channel *ch = this_cpu(hpet_channel); + s_time_t deadline = this_cpu(timer_deadline); + + ASSERT(!local_irq_is_enabled()); + ASSERT(ch == NULL); - if ( per_cpu(timer_deadline, cpu) == 0 ) + if ( deadline == 0 ) return; - if ( !ch ) - ch = hpet_get_channel(cpu); + /* If using HPET in legacy timer mode */ + if ( num_hpets_used == 0 ) + { + spin_lock(&hpet_events->lock); - ASSERT(!local_irq_is_enabled()); + cpumask_set_cpu(cpu, hpet_events->cpumask); + if ( deadline < hpet_events->next_event ) + hpet_program_time(hpet_events, deadline, NOW(), 1); + + spin_unlock(&hpet_events->lock); + return; + } + + ch = hpet_get_free_channel(); + + if ( ch ) + { + spin_lock(&ch->lock); + + /* This really should be an MSI channel by this point */ + ASSERT(!(ch->flags & HPET_EVT_LEGACY)); + + hpet_msi_mask(ch); + + this_cpu(hpet_channel) = ch; + ch->cpu = cpu; + cpumask_set_cpu(cpu, ch->cpumask); + + hpet_setup_msi(ch); + hpet_program_time(ch, deadline, NOW(), 1); + hpet_msi_unmask(ch); + + spin_unlock(&ch->lock); + } + else + { + /* TODO - this seems very ugly */ + s_time_t fallback_deadline = STIME_MAX; + unsigned int i, fallback_idx = -1; + + for ( i = 0; i < num_hpets_used; ++i ) + { + ch = &hpet_events[i]; + spin_lock(&ch->lock); + + if ( ch->cpu == -1 ) + goto continue_search; + + /* This channel is going to expire far too early */ + if ( ch->next_event < deadline - MICROSECS(50) ) + goto continue_search; + + /* We can deal with being woken with 50us to spare */ + if ( ch->next_event <= deadline ) + break; + + /* Otherwise record our best HPET to borrow. */ + if ( ch->next_event <= fallback_deadline ) + { + fallback_idx = i; + fallback_deadline = ch->next_event; + } + + continue_search: + spin_unlock(&ch->lock); + ch = NULL; + } + + if ( ch ) + { + /* Found HPET with an appropriate time. Request to be woken up */ + cpumask_set_cpu(cpu, ch->cpumask); + this_cpu(hpet_channel) = ch; + spin_unlock(&ch->lock); + } + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) + { + /* + * Else we want to reprogram the fallback HPET sooner if possible, + * but with all the spinlocking, it just might have passed. + */ + ch = &hpet_events[fallback_idx]; + + spin_lock(&ch->lock); - if ( !(ch->flags & HPET_EVT_LEGACY) ) - hpet_attach_channel(cpu, ch); + if ( ch->cpu != -1 && ch->next_event == fallback_deadline ) + { + cpumask_set_cpu(cpu, ch->cpumask); + hpet_program_time(ch, deadline, NOW(), 1); + } + else + /* All else has failed. Wander the idle loop again */ + this_cpu(timer_deadline) = NOW() - 1; + + spin_unlock(&ch->lock); + } + else + /* All HPETs were too soon. Wander the idle loop again */ + this_cpu(timer_deadline) = NOW() - 1; + } /* Disable LAPIC timer interrupts. */ disable_APIC_timer(); - cpumask_set_cpu(cpu, ch->cpumask); - - spin_lock(&ch->lock); - /* reprogram if current cpu expire time is nearer */ - if ( per_cpu(timer_deadline, cpu) < ch->next_event ) - hpet_program_time(ch, per_cpu(timer_deadline, cpu), NOW(), 1); - spin_unlock(&ch->lock); } void hpet_broadcast_exit(void) { unsigned int cpu = smp_processor_id(); - struct hpet_event_channel *ch = per_cpu(cpu_bc_channel, cpu); + struct hpet_event_channel *ch = this_cpu(hpet_channel); - if ( per_cpu(timer_deadline, cpu) == 0 ) + ASSERT(local_irq_is_enabled()); + + if ( this_cpu(timer_deadline) == 0 ) + return; + + /* If using HPET in legacy timer mode */ + if ( num_hpets_used == 0 ) + { + /* This is safe without the spinlock, and will reduce contention. */ + cpumask_clear_cpu(cpu, hpet_events->cpumask); return; + } if ( !ch ) - ch = hpet_get_channel(cpu); + return; - /* Reprogram the deadline; trigger timer work now if it has passed. */ - enable_APIC_timer(); - if ( !reprogram_timer(per_cpu(timer_deadline, cpu)) ) - raise_softirq(TIMER_SOFTIRQ); + spin_lock_irq(&ch->lock); cpumask_clear_cpu(cpu, ch->cpumask); - if ( !(ch->flags & HPET_EVT_LEGACY) ) - hpet_detach_channel(cpu, ch); + /* If we own the channel, detach it */ + if ( ch->cpu == cpu ) + { + hpet_msi_mask(ch); + hpet_wake_cpus(ch); + ch->cpu = -1; + set_bit(ch->idx, &free_channels); + } + + this_cpu(hpet_channel) = NULL; + + spin_unlock_irq(&ch->lock); + + /* Reprogram the deadline; trigger timer work now if it has passed. */ + enable_APIC_timer(); + if ( !reprogram_timer(this_cpu(timer_deadline)) ) + raise_softirq(TIMER_SOFTIRQ); } int hpet_broadcast_is_available(void) @@ -763,7 +693,14 @@ int hpet_legacy_irq_tick(void) (hpet_events->flags & (HPET_EVT_DISABLE|HPET_EVT_LEGACY)) ! HPET_EVT_LEGACY ) return 0; - hpet_events->event_handler(hpet_events); + + spin_lock_irq(&hpet_events->lock); + + hpet_interrupt_handler(hpet_events); + hpet_events->next_event = STIME_MAX; + + spin_unlock_irq(&hpet_events->lock); + return 1; } -- 1.7.10.4
Jan Beulich
2013-Nov-22 15:45 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > The new logic is as follows: > * A single high priority vector is allocated and uses on all cpus.Does this really need to be a high priority one? I''d think we''d be fine with the lowest priority one we can get, as we only need the wakeup here if nothing else gets a CPU to wake up.> +/* Wake up all cpus in the channel mask. Lock should be held. */ > +static void hpet_wake_cpus(struct hpet_event_channel *ch) > { > - unsigned int cpu = smp_processor_id(); > + cpuidle_wakeup_mwait(ch->cpumask); > > - if ( cpumask_test_and_clear_cpu(cpu, mask) ) > - raise_softirq(TIMER_SOFTIRQ); > - > - cpuidle_wakeup_mwait(mask); > - > - if ( !cpumask_empty(mask) ) > - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); > + if ( !cpumask_empty(ch->cpumask) ) > + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ);Looks like the cpumask_empty() check isn''t really needed here?> +/* HPET interrupt entry. This is set up as a high priority vector. */ > +static void do_hpet_irq(struct cpu_user_regs *regs) > { > - struct hpet_event_channel *ch = (struct hpet_event_channel *)data; > - > - this_cpu(irq_count)--; > + unsigned int cpu = smp_processor_id();This is being used just once, and hence rather pointless to have.> - desc->handler = &hpet_msi_type; > - ret = request_irq(ch->msi.irq, hpet_interrupt_handler, "HPET", ch); > - if ( ret >= 0 ) > - ret = __hpet_setup_msi_irq(desc); > + ret = hpet_setup_msi(ch);Why did you keep this? With that function now being called from hpet_broadcast_enter() I don''t why you''d need this here (just like you''re removing it from hpet_broadcast_resume() without replacement).> @@ -574,6 +384,7 @@ void __init hpet_broadcast_init(void) > /* Stop HPET legacy interrupts */ > cfg &= ~HPET_CFG_LEGACY; > n = num_hpets_used; > + free_channels = (1U << n) - 1;This is undefined when n == 32. Since n > 0, I''d suggest free_channels = (u32)~0 >> (32 - n);> + ch = hpet_get_free_channel(); > + > + if ( ch ) > + { > + spin_lock(&ch->lock); > + > + /* This really should be an MSI channel by this point */ > + ASSERT(!(ch->flags & HPET_EVT_LEGACY)); > + > + hpet_msi_mask(ch); > + > + this_cpu(hpet_channel) = ch; > + ch->cpu = cpu;I think even if benign, you should set ->cpu before storing into hpet_channel.> + cpumask_set_cpu(cpu, ch->cpumask); > + > + hpet_setup_msi(ch); > + hpet_program_time(ch, deadline, NOW(), 1); > + hpet_msi_unmask(ch); > + > + spin_unlock(&ch->lock); > + } > + else > + { > + /* TODO - this seems very ugly */And you nevertheless want it committed this way?> + s_time_t fallback_deadline = STIME_MAX; > + unsigned int i, fallback_idx = -1; > + > + for ( i = 0; i < num_hpets_used; ++i ) > + { > + ch = &hpet_events[i]; > + spin_lock(&ch->lock); > + > + if ( ch->cpu == -1 ) > + goto continue_search; > + > + /* This channel is going to expire far too early */ > + if ( ch->next_event < deadline - MICROSECS(50) ) > + goto continue_search;So this is going to make us wake early. You remember that all this code exists solely for power saving purposes? Iiuc the CPU will then basically spin entering an idle state, until its actual wakeup time is reached.> + > + /* We can deal with being woken with 50us to spare */ > + if ( ch->next_event <= deadline ) > + break; > + > + /* Otherwise record our best HPET to borrow. */ > + if ( ch->next_event <= fallback_deadline ) > + { > + fallback_idx = i; > + fallback_deadline = ch->next_event; > + } > + > + continue_search: > + spin_unlock(&ch->lock); > + ch = NULL; > + } > + > + if ( ch ) > + { > + /* Found HPET with an appropriate time. Request to be woken up */ > + cpumask_set_cpu(cpu, ch->cpumask); > + this_cpu(hpet_channel) = ch; > + spin_unlock(&ch->lock); > + } > + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) > + { > + /* > + * Else we want to reprogram the fallback HPET sooner if possible, > + * but with all the spinlocking, it just might have passed. > + */ > + ch = &hpet_events[fallback_idx]; > + > + spin_lock(&ch->lock); > > - if ( !(ch->flags & HPET_EVT_LEGACY) ) > - hpet_attach_channel(cpu, ch); > + if ( ch->cpu != -1 && ch->next_event == fallback_deadline )Can''t this be "<= deadline", being quite a bit more flexible?> + { > + cpumask_set_cpu(cpu, ch->cpumask); > + hpet_program_time(ch, deadline, NOW(), 1); > + } > + else > + /* All else has failed. Wander the idle loop again */ > + this_cpu(timer_deadline) = NOW() - 1; > + > + spin_unlock(&ch->lock); > + } > + else > + /* All HPETs were too soon. Wander the idle loop again */ > + this_cpu(timer_deadline) = NOW() - 1;Here and above - what good will this do? Is this just in the expectation that the next time round a free channel will likely be found? If so, why can''t you just go back to the start of the function. Further - how do you see this going back to the idle loop? The way this is called from acpi/cpu_idle.c, you''ll end up entering C3 anyway, with nothing going to wake you up. By proceeding to the end of the function, you even manage to disable the LAPIC timer.> + /* If we own the channel, detach it */ > + if ( ch->cpu == cpu ) > + { > + hpet_msi_mask(ch); > + hpet_wake_cpus(ch); > + ch->cpu = -1; > + set_bit(ch->idx, &free_channels);Wouldn''t there be less cache line bouncing if the "free" flag was just one of the channel flags? Jan
Andrew Cooper
2013-Nov-22 16:23 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 22/11/13 15:45, Jan Beulich wrote:>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> The new logic is as follows: >> * A single high priority vector is allocated and uses on all cpus. > Does this really need to be a high priority one? I''d think we''d be > fine with the lowest priority one we can get, as we only need the > wakeup here if nothing else gets a CPU to wake up.Yes - absolutely. We cannot have an HPET interrupt lower priority than a guest line level interrupt. Another cpu could be registered with our HPET channel to be worken up, and we need to service them in a timely fashon.> >> +/* Wake up all cpus in the channel mask. Lock should be held. */ >> +static void hpet_wake_cpus(struct hpet_event_channel *ch) >> { >> - unsigned int cpu = smp_processor_id(); >> + cpuidle_wakeup_mwait(ch->cpumask); >> >> - if ( cpumask_test_and_clear_cpu(cpu, mask) ) >> - raise_softirq(TIMER_SOFTIRQ); >> - >> - cpuidle_wakeup_mwait(mask); >> - >> - if ( !cpumask_empty(mask) ) >> - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); >> + if ( !cpumask_empty(ch->cpumask) ) >> + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); > Looks like the cpumask_empty() check isn''t really needed here?cpuidle_wakeup_mwait(mask) will only wake cpus who have set their bit in cpuidle_mwait_flags. There might be cpus who have registered for waking up who have not yet set their bit in cpuidle_mwait_flags. Out of caution, I did by best to avoid playing with the guts of the timing code, as it seems quite fragile.> >> +/* HPET interrupt entry. This is set up as a high priority vector. */ >> +static void do_hpet_irq(struct cpu_user_regs *regs) >> { >> - struct hpet_event_channel *ch = (struct hpet_event_channel *)data; >> - >> - this_cpu(irq_count)--; >> + unsigned int cpu = smp_processor_id(); > This is being used just once, and hence rather pointless to have.Fair point - this was left over from a previous version of the function which did use cpu twice.> >> - desc->handler = &hpet_msi_type; >> - ret = request_irq(ch->msi.irq, hpet_interrupt_handler, "HPET", ch); >> - if ( ret >= 0 ) >> - ret = __hpet_setup_msi_irq(desc); >> + ret = hpet_setup_msi(ch); > Why did you keep this? With that function now being called from > hpet_broadcast_enter() I don''t why you''d need this here (just > like you''re removing it from hpet_broadcast_resume() without > replacement).Because I optimised functions in the wrong order to obviously spot this. As we leave the MSI masked, this call to setup can be dropped. I would however prefer not to manually inline hpet_setup_msi() into hpet_broadcast_enter(). The compiler can do that for me, and it saves making hpet_broadcast_enter() even more complicated.> >> @@ -574,6 +384,7 @@ void __init hpet_broadcast_init(void) >> /* Stop HPET legacy interrupts */ >> cfg &= ~HPET_CFG_LEGACY; >> n = num_hpets_used; >> + free_channels = (1U << n) - 1; > This is undefined when n == 32. Since n > 0, I''d suggest > > free_channels = (u32)~0 >> (32 - n);Ok> >> + ch = hpet_get_free_channel(); >> + >> + if ( ch ) >> + { >> + spin_lock(&ch->lock); >> + >> + /* This really should be an MSI channel by this point */ >> + ASSERT(!(ch->flags & HPET_EVT_LEGACY)); >> + >> + hpet_msi_mask(ch); >> + >> + this_cpu(hpet_channel) = ch; >> + ch->cpu = cpu; > I think even if benign, you should set ->cpu before storing into > hpet_channel.Ok> >> + cpumask_set_cpu(cpu, ch->cpumask); >> + >> + hpet_setup_msi(ch); >> + hpet_program_time(ch, deadline, NOW(), 1); >> + hpet_msi_unmask(ch); >> + >> + spin_unlock(&ch->lock); >> + } >> + else >> + { >> + /* TODO - this seems very ugly */ > And you nevertheless want it committed this way?Probably best the comment gets dropped.> >> + s_time_t fallback_deadline = STIME_MAX; >> + unsigned int i, fallback_idx = -1; >> + >> + for ( i = 0; i < num_hpets_used; ++i ) >> + { >> + ch = &hpet_events[i]; >> + spin_lock(&ch->lock); >> + >> + if ( ch->cpu == -1 ) >> + goto continue_search; >> + >> + /* This channel is going to expire far too early */ >> + if ( ch->next_event < deadline - MICROSECS(50) ) >> + goto continue_search; > So this is going to make us wake early. You remember that all this > code exists solely for power saving purposes? Iiuc the CPU will > then basically spin entering an idle state, until its actual wakeup > time is reached.What would you suggest instead? We dont really want to be waking up late.> >> + >> + /* We can deal with being woken with 50us to spare */ >> + if ( ch->next_event <= deadline ) >> + break; >> + >> + /* Otherwise record our best HPET to borrow. */ >> + if ( ch->next_event <= fallback_deadline ) >> + { >> + fallback_idx = i; >> + fallback_deadline = ch->next_event; >> + } >> + >> + continue_search: >> + spin_unlock(&ch->lock); >> + ch = NULL; >> + } >> + >> + if ( ch ) >> + { >> + /* Found HPET with an appropriate time. Request to be woken up */ >> + cpumask_set_cpu(cpu, ch->cpumask); >> + this_cpu(hpet_channel) = ch; >> + spin_unlock(&ch->lock); >> + } >> + else if ( fallback_deadline < STIME_MAX && fallback_deadline != -1 ) >> + { >> + /* >> + * Else we want to reprogram the fallback HPET sooner if possible, >> + * but with all the spinlocking, it just might have passed. >> + */ >> + ch = &hpet_events[fallback_idx]; >> + >> + spin_lock(&ch->lock); >> >> - if ( !(ch->flags & HPET_EVT_LEGACY) ) >> - hpet_attach_channel(cpu, ch); >> + if ( ch->cpu != -1 && ch->next_event == fallback_deadline ) > Can''t this be "<= deadline", being quite a bit more flexible?This is a test for whether ch is the one I identified as the best inside the loop. There should be sufficient flexibility inside the loop.> >> + { >> + cpumask_set_cpu(cpu, ch->cpumask); >> + hpet_program_time(ch, deadline, NOW(), 1); >> + } >> + else >> + /* All else has failed. Wander the idle loop again */ >> + this_cpu(timer_deadline) = NOW() - 1; >> + >> + spin_unlock(&ch->lock); >> + } >> + else >> + /* All HPETs were too soon. Wander the idle loop again */ >> + this_cpu(timer_deadline) = NOW() - 1; > Here and above - what good will this do? Is this just in the > expectation that the next time round a free channel will likely be > found? If so, why can''t you just go back to the start of the > function.Or a different HPET programmed to a different time which is now compatible with our wakeup timeframe. We cannot spin in this function, as we have interrupts disabled.> > Further - how do you see this going back to the idle loop? The > way this is called from acpi/cpu_idle.c, you''ll end up entering > C3 anyway, with nothing going to wake you up. By proceeding to > the end of the function, you even manage to disable the LAPIC > timer.Hmm - I will need to revisit this logic then.> >> + /* If we own the channel, detach it */ >> + if ( ch->cpu == cpu ) >> + { >> + hpet_msi_mask(ch); >> + hpet_wake_cpus(ch); >> + ch->cpu = -1; >> + set_bit(ch->idx, &free_channels); > Wouldn''t there be less cache line bouncing if the "free" flag was just > one of the channel flags? > > JanYes, but then finding a free channel would involve searching each hpet_channel rather than searching free_channels with ffs(). ~Andrew
Jan Beulich
2013-Nov-22 16:49 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 22/11/13 15:45, Jan Beulich wrote: >>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> +/* Wake up all cpus in the channel mask. Lock should be held. */ >>> +static void hpet_wake_cpus(struct hpet_event_channel *ch) >>> { >>> - unsigned int cpu = smp_processor_id(); >>> + cpuidle_wakeup_mwait(ch->cpumask); >>> >>> - if ( cpumask_test_and_clear_cpu(cpu, mask) ) >>> - raise_softirq(TIMER_SOFTIRQ); >>> - >>> - cpuidle_wakeup_mwait(mask); >>> - >>> - if ( !cpumask_empty(mask) ) >>> - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); >>> + if ( !cpumask_empty(ch->cpumask) ) >>> + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); >> Looks like the cpumask_empty() check isn''t really needed here? > > cpuidle_wakeup_mwait(mask) will only wake cpus who have set their bit in > cpuidle_mwait_flags. > > There might be cpus who have registered for waking up who have not yet > set their bit in cpuidle_mwait_flags. > > Out of caution, I did by best to avoid playing with the guts of the > timing code, as it seems quite fragile.Some misunderstanding? The cpumask_empty() check doesn''t guard the call to cpuidle_wakeup_mwait(), but the one to cpumask_raise_softirq() (which appears to be prepared to be called with an empty mask).> I would however prefer not to manually inline hpet_setup_msi() into > hpet_broadcast_enter(). The compiler can do that for me, and it saves > making hpet_broadcast_enter() even more complicated.Sure, and I wasn''t suggesting that.>>> + cpumask_set_cpu(cpu, ch->cpumask); >>> + >>> + hpet_setup_msi(ch); >>> + hpet_program_time(ch, deadline, NOW(), 1); >>> + hpet_msi_unmask(ch); >>> + >>> + spin_unlock(&ch->lock); >>> + } >>> + else >>> + { >>> + /* TODO - this seems very ugly */ >> And you nevertheless want it committed this way? > > Probably best the comment gets dropped.But is _does_ seem very ugly, so the comment is true.>>> + s_time_t fallback_deadline = STIME_MAX; >>> + unsigned int i, fallback_idx = -1; >>> + >>> + for ( i = 0; i < num_hpets_used; ++i ) >>> + { >>> + ch = &hpet_events[i]; >>> + spin_lock(&ch->lock); >>> + >>> + if ( ch->cpu == -1 ) >>> + goto continue_search; >>> + >>> + /* This channel is going to expire far too early */ >>> + if ( ch->next_event < deadline - MICROSECS(50) ) >>> + goto continue_search; >> So this is going to make us wake early. You remember that all this >> code exists solely for power saving purposes? Iiuc the CPU will >> then basically spin entering an idle state, until its actual wakeup >> time is reached. > > What would you suggest instead? We dont really want to be waking up late.In fact we''re always kind of waking up late, due to the processing time it takes to come back out of the C state. If someone heavily favors responsiveness (i.e. low wakeup latency) over power savings, (s)he should disallow the use of deep C states.>>> + if ( ch->cpu != -1 && ch->next_event == fallback_deadline ) >> Can''t this be "<= deadline", being quite a bit more flexible? > > This is a test for whether ch is the one I identified as the best inside > the loop. There should be sufficient flexibility inside the loop.The thing is that with the lock dropped, ch->next_event may have changed. But it would still suit us if <= deadline.>>> + else >>> + /* All else has failed. Wander the idle loop again */ >>> + this_cpu(timer_deadline) = NOW() - 1; >>> + >>> + spin_unlock(&ch->lock); >>> + } >>> + else >>> + /* All HPETs were too soon. Wander the idle loop again */ >>> + this_cpu(timer_deadline) = NOW() - 1; >> Here and above - what good will this do? Is this just in the >> expectation that the next time round a free channel will likely be >> found? If so, why can''t you just go back to the start of the >> function. > > Or a different HPET programmed to a different time which is now > compatible with our wakeup timeframe.Are there cases where the wakeup time gets moved backwards (i.e. less far in the future)? I can''t seem to immediately think of such a case.> We cannot spin in this function, as we have interrupts disabled.And I was suggesting this only if the failure condition would provide a positive sign of there some suitable resource having become available. Jan
Andrew Cooper
2013-Nov-22 17:38 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 22/11/13 16:49, Jan Beulich wrote:>>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 22/11/13 15:45, Jan Beulich wrote: >>>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> +/* Wake up all cpus in the channel mask. Lock should be held. */ >>>> +static void hpet_wake_cpus(struct hpet_event_channel *ch) >>>> { >>>> - unsigned int cpu = smp_processor_id(); >>>> + cpuidle_wakeup_mwait(ch->cpumask); >>>> >>>> - if ( cpumask_test_and_clear_cpu(cpu, mask) ) >>>> - raise_softirq(TIMER_SOFTIRQ); >>>> - >>>> - cpuidle_wakeup_mwait(mask); >>>> - >>>> - if ( !cpumask_empty(mask) ) >>>> - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); >>>> + if ( !cpumask_empty(ch->cpumask) ) >>>> + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); >>> Looks like the cpumask_empty() check isn''t really needed here? >> cpuidle_wakeup_mwait(mask) will only wake cpus who have set their bit in >> cpuidle_mwait_flags. >> >> There might be cpus who have registered for waking up who have not yet >> set their bit in cpuidle_mwait_flags. >> >> Out of caution, I did by best to avoid playing with the guts of the >> timing code, as it seems quite fragile. > Some misunderstanding? The cpumask_empty() check doesn''t > guard the call to cpuidle_wakeup_mwait(), but the one to > cpumask_raise_softirq() (which appears to be prepared to be > called with an empty mask).cpuidle_wakeup_mwait() clears bits out of mask. The result is the set of cpus who were not set in cpuidle_mwait_flags.> >> I would however prefer not to manually inline hpet_setup_msi() into >> hpet_broadcast_enter(). The compiler can do that for me, and it saves >> making hpet_broadcast_enter() even more complicated. > Sure, and I wasn''t suggesting that. > >>>> + cpumask_set_cpu(cpu, ch->cpumask); >>>> + >>>> + hpet_setup_msi(ch); >>>> + hpet_program_time(ch, deadline, NOW(), 1); >>>> + hpet_msi_unmask(ch); >>>> + >>>> + spin_unlock(&ch->lock); >>>> + } >>>> + else >>>> + { >>>> + /* TODO - this seems very ugly */ >>> And you nevertheless want it committed this way? >> Probably best the comment gets dropped. > But is _does_ seem very ugly, so the comment is true. > >>>> + s_time_t fallback_deadline = STIME_MAX; >>>> + unsigned int i, fallback_idx = -1; >>>> + >>>> + for ( i = 0; i < num_hpets_used; ++i ) >>>> + { >>>> + ch = &hpet_events[i]; >>>> + spin_lock(&ch->lock); >>>> + >>>> + if ( ch->cpu == -1 ) >>>> + goto continue_search; >>>> + >>>> + /* This channel is going to expire far too early */ >>>> + if ( ch->next_event < deadline - MICROSECS(50) ) >>>> + goto continue_search; >>> So this is going to make us wake early. You remember that all this >>> code exists solely for power saving purposes? Iiuc the CPU will >>> then basically spin entering an idle state, until its actual wakeup >>> time is reached. >> What would you suggest instead? We dont really want to be waking up late. > In fact we''re always kind of waking up late, due to the processing > time it takes to come back out of the C state. If someone heavily > favors responsiveness (i.e. low wakeup latency) over power > savings, (s)he should disallow the use of deep C states. > >>>> + if ( ch->cpu != -1 && ch->next_event == fallback_deadline ) >>> Can''t this be "<= deadline", being quite a bit more flexible? >> This is a test for whether ch is the one I identified as the best inside >> the loop. There should be sufficient flexibility inside the loop. > The thing is that with the lock dropped, ch->next_event may have > changed. But it would still suit us if <= deadline.Very true - I missed that.> >>>> + else >>>> + /* All else has failed. Wander the idle loop again */ >>>> + this_cpu(timer_deadline) = NOW() - 1; >>>> + >>>> + spin_unlock(&ch->lock); >>>> + } >>>> + else >>>> + /* All HPETs were too soon. Wander the idle loop again */ >>>> + this_cpu(timer_deadline) = NOW() - 1; >>> Here and above - what good will this do? Is this just in the >>> expectation that the next time round a free channel will likely be >>> found? If so, why can''t you just go back to the start of the >>> function. >> Or a different HPET programmed to a different time which is now >> compatible with our wakeup timeframe. > Are there cases where the wakeup time gets moved backwards > (i.e. less far in the future)? I can''t seem to immediately think of > such a case.The case I was expecting is an HPET about to fire and being relinquished by its owner. After an iteration of the idle loop it might be free to nab, or a different cpu has nabbed it and programmed it for a reasonable time into the future. ~Andrew> >> We cannot spin in this function, as we have interrupts disabled. > And I was suggesting this only if the failure condition would provide > a positive sign of there some suitable resource having become > available. > > Jan
Jan Beulich
2013-Nov-25 07:50 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 22/11/13 15:45, Jan Beulich wrote: >>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> The new logic is as follows: >>> * A single high priority vector is allocated and uses on all cpus. >> Does this really need to be a high priority one? I''d think we''d be >> fine with the lowest priority one we can get, as we only need the >> wakeup here if nothing else gets a CPU to wake up. > > Yes - absolutely. We cannot have an HPET interrupt lower priority than > a guest line level interrupt. > > Another cpu could be registered with our HPET channel to be worken up, > and we need to service them in a timely fashon.Which I meanwhile think hints at an issue with the (re)design: These wakeups, from an abstract pov, shouldn''t be high priority interrupts - they''re meant to wake a CPU only when nothing else would wake them in time. And this could be accomplished by transferring ownership of the channel during wakeup from the waking CPU to the next one to wake. WHich at once would eliminate the bogus logic selecting a channel for a CPU to re-use when no free one is available: It then wouldn''t really matter which one gets re-used (i.e. could be assigned in e.g. a round robin fashion). The fundamental requirement would be to run the wakeup (in particular channel re-assignment) logic not just from the HPET interrupt, but inside an exit_idle() construct called from all IRQ paths (similar to how Linux does this). Jan
Jan Beulich
2013-Nov-25 07:52 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> On 22.11.13 at 18:38, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 22/11/13 16:49, Jan Beulich wrote: >>>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 22/11/13 15:45, Jan Beulich wrote: >>>>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>> +/* Wake up all cpus in the channel mask. Lock should be held. */ >>>>> +static void hpet_wake_cpus(struct hpet_event_channel *ch) >>>>> { >>>>> - unsigned int cpu = smp_processor_id(); >>>>> + cpuidle_wakeup_mwait(ch->cpumask); >>>>> >>>>> - if ( cpumask_test_and_clear_cpu(cpu, mask) ) >>>>> - raise_softirq(TIMER_SOFTIRQ); >>>>> - >>>>> - cpuidle_wakeup_mwait(mask); >>>>> - >>>>> - if ( !cpumask_empty(mask) ) >>>>> - cpumask_raise_softirq(mask, TIMER_SOFTIRQ); >>>>> + if ( !cpumask_empty(ch->cpumask) ) >>>>> + cpumask_raise_softirq(ch->cpumask, TIMER_SOFTIRQ); >>>> Looks like the cpumask_empty() check isn''t really needed here? >>> cpuidle_wakeup_mwait(mask) will only wake cpus who have set their bit in >>> cpuidle_mwait_flags. >>> >>> There might be cpus who have registered for waking up who have not yet >>> set their bit in cpuidle_mwait_flags. >>> >>> Out of caution, I did by best to avoid playing with the guts of the >>> timing code, as it seems quite fragile. >> Some misunderstanding? The cpumask_empty() check doesn''t >> guard the call to cpuidle_wakeup_mwait(), but the one to >> cpumask_raise_softirq() (which appears to be prepared to be >> called with an empty mask). > > cpuidle_wakeup_mwait() clears bits out of mask. The result is the set > of cpus who were not set in cpuidle_mwait_flags.Sure - but there''s still no big gain from doing the cpumask_empty() check here. Unless I''m missing something... Jan
Andrew Cooper
2013-Nov-26 18:32 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 25/11/13 07:50, Jan Beulich wrote:>>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 22/11/13 15:45, Jan Beulich wrote: >>>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> The new logic is as follows: >>>> * A single high priority vector is allocated and uses on all cpus. >>> Does this really need to be a high priority one? I''d think we''d be >>> fine with the lowest priority one we can get, as we only need the >>> wakeup here if nothing else gets a CPU to wake up. >> Yes - absolutely. We cannot have an HPET interrupt lower priority than >> a guest line level interrupt. >> >> Another cpu could be registered with our HPET channel to be worken up, >> and we need to service them in a timely fashon. > Which I meanwhile think hints at an issue with the (re)design: > These wakeups, from an abstract pov, shouldn''t be high > priority interrupts - they''re meant to wake a CPU only when > nothing else would wake them in time. And this could be > accomplished by transferring ownership of the channel during > wakeup from the waking CPU to the next one to wake. > > WHich at once would eliminate the bogus logic selecting a channel > for a CPU to re-use when no free one is available: It then wouldn''t > really matter which one gets re-used (i.e. could be assigned in e.g. > a round robin fashion). > > The fundamental requirement would be to run the wakeup (in > particular channel re-assignment) logic not just from the HPET > interrupt, but inside an exit_idle() construct called from all IRQ > paths (similar to how Linux does this). > > Jan >Irrespective of the problem of ownership, the HPET interrupt still needs to be high priority. Consider the following scenario: * A line level interrupt is received on pcpu 0. It is left outstanding at the LAPIC. * A domain is scheduled on pcpu 0, and has has an event injected for the line level interrupt. * The event handler takes a long time, and during the process, the domains vcpu is rescheduled elsewhere * pcpu0 is now completely idle and goes to sleep. This scenario has pcpu 0 going to sleep with an outstanding line level irq unacked at the LAPIC, with a low priority HPET interrupt blocked until the domain has signalled the completion of the event. There is no safe scenario (given Xen''s handling of line level interrupt) for timer interrupts to be lower priority than the highest possible line level priority. ~Andrew
Jan Beulich
2013-Nov-27 08:35 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> On 26.11.13 at 19:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 25/11/13 07:50, Jan Beulich wrote: >>>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 22/11/13 15:45, Jan Beulich wrote: >>>>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>> The new logic is as follows: >>>>> * A single high priority vector is allocated and uses on all cpus. >>>> Does this really need to be a high priority one? I''d think we''d be >>>> fine with the lowest priority one we can get, as we only need the >>>> wakeup here if nothing else gets a CPU to wake up. >>> Yes - absolutely. We cannot have an HPET interrupt lower priority than >>> a guest line level interrupt. >>> >>> Another cpu could be registered with our HPET channel to be worken up, >>> and we need to service them in a timely fashon. >> Which I meanwhile think hints at an issue with the (re)design: >> These wakeups, from an abstract pov, shouldn''t be high >> priority interrupts - they''re meant to wake a CPU only when >> nothing else would wake them in time. And this could be >> accomplished by transferring ownership of the channel during >> wakeup from the waking CPU to the next one to wake. >> >> WHich at once would eliminate the bogus logic selecting a channel >> for a CPU to re-use when no free one is available: It then wouldn''t >> really matter which one gets re-used (i.e. could be assigned in e.g. >> a round robin fashion). >> >> The fundamental requirement would be to run the wakeup (in >> particular channel re-assignment) logic not just from the HPET >> interrupt, but inside an exit_idle() construct called from all IRQ >> paths (similar to how Linux does this). >> >> Jan >> > > Irrespective of the problem of ownership, the HPET interrupt still needs > to be high priority. Consider the following scenario: > > * A line level interrupt is received on pcpu 0. It is left outstanding > at the LAPIC. > * A domain is scheduled on pcpu 0, and has has an event injected for the > line level interrupt. > * The event handler takes a long time, and during the process, the > domains vcpu is rescheduled elsewhere > * pcpu0 is now completely idle and goes to sleep. > > This scenario has pcpu 0 going to sleep with an outstanding line level > irq unacked at the LAPIC, with a low priority HPET interrupt blocked > until the domain has signalled the completion of the event. > > There is no safe scenario (given Xen''s handling of line level interrupt) > for timer interrupts to be lower priority than the highest possible line > level priority.That''s true for the "new ack" model, but not the "old" one (which is in particular also being used when using directed EOI). Which may in turn be an argument to consider whether the vector selection (low or high priority) shouldn''t depend on the IO-APIC ack model. Jan
Andrew Cooper
2013-Nov-27 22:37 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 27/11/2013 08:35, Jan Beulich wrote:>>>> On 26.11.13 at 19:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 25/11/13 07:50, Jan Beulich wrote: >>>>>> On 22.11.13 at 17:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> On 22/11/13 15:45, Jan Beulich wrote: >>>>>>>> On 14.11.13 at 17:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>>> The new logic is as follows: >>>>>> * A single high priority vector is allocated and uses on all cpus. >>>>> Does this really need to be a high priority one? I''d think we''d be >>>>> fine with the lowest priority one we can get, as we only need the >>>>> wakeup here if nothing else gets a CPU to wake up. >>>> Yes - absolutely. We cannot have an HPET interrupt lower priority than >>>> a guest line level interrupt. >>>> >>>> Another cpu could be registered with our HPET channel to be worken up, >>>> and we need to service them in a timely fashon. >>> Which I meanwhile think hints at an issue with the (re)design: >>> These wakeups, from an abstract pov, shouldn''t be high >>> priority interrupts - they''re meant to wake a CPU only when >>> nothing else would wake them in time. And this could be >>> accomplished by transferring ownership of the channel during >>> wakeup from the waking CPU to the next one to wake. >>> >>> WHich at once would eliminate the bogus logic selecting a channel >>> for a CPU to re-use when no free one is available: It then wouldn''t >>> really matter which one gets re-used (i.e. could be assigned in e.g. >>> a round robin fashion). >>> >>> The fundamental requirement would be to run the wakeup (in >>> particular channel re-assignment) logic not just from the HPET >>> interrupt, but inside an exit_idle() construct called from all IRQ >>> paths (similar to how Linux does this). >>> >>> Jan >>> >> Irrespective of the problem of ownership, the HPET interrupt still needs >> to be high priority. Consider the following scenario: >> >> * A line level interrupt is received on pcpu 0. It is left outstanding >> at the LAPIC. >> * A domain is scheduled on pcpu 0, and has has an event injected for the >> line level interrupt. >> * The event handler takes a long time, and during the process, the >> domains vcpu is rescheduled elsewhere >> * pcpu0 is now completely idle and goes to sleep. >> >> This scenario has pcpu 0 going to sleep with an outstanding line level >> irq unacked at the LAPIC, with a low priority HPET interrupt blocked >> until the domain has signalled the completion of the event. >> >> There is no safe scenario (given Xen''s handling of line level interrupt) >> for timer interrupts to be lower priority than the highest possible line >> level priority. > That''s true for the "new ack" model, but not the "old" one (which is > in particular also being used when using directed EOI). Which may > in turn be an argument to consider whether the vector selection > (low or high priority) shouldn''t depend on the IO-APIC ack model. > > Jan >How would you go about allocating vectors then? All IRQs are allocated randomly between vector 0x21 and 0xdf. You can certainly preferentially prefer lower vectors for line level interrupts, but the only way to guarantee that the HPET vector is greater than line level interrupts is to have it higher than 0xdf. I do think that allocating line level vectors lower is a good idea, given how long they remain outstanding. I also think that a useful performance tweak would be for device driver domains to be able to request a preferentially higher vector for their interrupts. From a pragmatic point of view, there are plenty of spare high priority vectors for use, where as we at XenServer already have usecases where we are running out of free vectors in the range 0x21 -> 0xdf due to shear quantities of SRIOV. I already have half a mind to see whether I can optimise the current allocation of vectors to make the dynamic range larger. ~Andrew
Jan Beulich
2013-Nov-28 14:33 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
>>> Andrew Cooper <andrew.cooper3@citrix.com> 11/27/13 11:37 PM >>> >On 27/11/2013 08:35, Jan Beulich wrote: >>>>> On 26.11.13 at 19:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> There is no safe scenario (given Xen''s handling of line level interrupt) >>> for timer interrupts to be lower priority than the highest possible line >>> level priority. >> That''s true for the "new ack" model, but not the "old" one (which is >> in particular also being used when using directed EOI). Which may >> in turn be an argument to consider whether the vector selection >> (low or high priority) shouldn''t depend on the IO-APIC ack model. > >How would you go about allocating vectors then? All IRQs are allocated >randomly between vector 0x21 and 0xdf. You can certainly preferentially >prefer lower vectors for line level interrupts, but the only way to >guarantee that the HPET vector is greater than line level interrupts is >to have it higher than 0xdf.I''m not proposing any change to vector allocation. Once again, with the new ack model I agree that the HPET one must be high priority. With the old ack model, however, it could be low priority (and we might consider putting it at 0x20, preventing it from being used in dynamic allocation, or even at 0x1f - that ought to work as long as there''s no CPU exception at that vector, since only the range 0x00...0x0f is reserved in the APIC architecture).>I do think that allocating line level vectors lower is a good idea, >given how long they remain outstanding. I also think that a useful >performance tweak would be for device driver domains to be able to >request a preferentially higher vector for their interrupts.I''m not sure about this one.>From a pragmatic point of view, there are plenty of spare high priority >vectors for use, where as we at XenServer already have usecases where we >are running out of free vectors in the range 0x21 -> 0xdf due to shear >quantities of SRIOV. I already have half a mind to see whether I can >optimise the current allocation of vectors to make the dynamic range larger.Growing that range will only be possible by a small amount anyway, so this won''t buy you much (you''ll run out of vectors very quickly again). But then again - with the vector ranges being available once per CPU, are there really setups where the requirements exceed the vectors * CPUs value? Jan
Andrew Cooper
2013-Nov-28 15:06 UTC
Re: [Patch v5 2/5] x86/hpet: Use singe apic vector rather than irq_descs for HPET interrupts
On 28/11/13 14:33, Jan Beulich wrote:>>>> Andrew Cooper <andrew.cooper3@citrix.com> 11/27/13 11:37 PM >>> >> On 27/11/2013 08:35, Jan Beulich wrote: >>>>>> On 26.11.13 at 19:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> There is no safe scenario (given Xen''s handling of line level interrupt) >>>> for timer interrupts to be lower priority than the highest possible line >>>> level priority. >>> That''s true for the "new ack" model, but not the "old" one (which is >>> in particular also being used when using directed EOI). Which may >>> in turn be an argument to consider whether the vector selection >>> (low or high priority) shouldn''t depend on the IO-APIC ack model. >> How would you go about allocating vectors then? All IRQs are allocated >> randomly between vector 0x21 and 0xdf. You can certainly preferentially >> prefer lower vectors for line level interrupts, but the only way to >> guarantee that the HPET vector is greater than line level interrupts is >> to have it higher than 0xdf. > I''m not proposing any change to vector allocation. Once again, with the > new ack model I agree that the HPET one must be high priority. With the > old ack model, however, it could be low priority (and we might consider > putting it at 0x20, preventing it from being used in dynamic allocation, or > even at 0x1f - that ought to work as long as there''s no CPU exception at > that vector, since only the range 0x00...0x0f is reserved in the APIC > architecture).Because of XSA-3, the TPR settings block any external vectors below 0x20, to protect from external devices trying to trigger exceptions (Alignment check specifically which expects to have an error code on the stack) 0x20 is currently the dynamic irq cleanup vector, but as it is only used with "int 0x20", it could be moved into the reserved region. (Not that I suggest we do use the reserved region)> >> I do think that allocating line level vectors lower is a good idea, >> given how long they remain outstanding. I also think that a useful >> performance tweak would be for device driver domains to be able to >> request a preferentially higher vector for their interrupts. > I''m not sure about this one. > > >From a pragmatic point of view, there are plenty of spare high priority >> vectors for use, where as we at XenServer already have usecases where we >> are running out of free vectors in the range 0x21 -> 0xdf due to shear >> quantities of SRIOV. I already have half a mind to see whether I can >> optimise the current allocation of vectors to make the dynamic range larger. > Growing that range will only be possible by a small amount anyway, so this > won''t buy you much (you''ll run out of vectors very quickly again). But then > again - with the vector ranges being available once per CPU, are there > really setups where the requirements exceed the vectors * CPUs value? > > JanNetscalar MPX systems have (off the top of my head) 24x 10GB network cards with 64 functions and 3 interrupts per function 4x SSL offload cards with 64 functions and 2 interrupts per function On a dual Sandy-bridge server with 32 cores in total. That comes to 160 of the 189 available entries in the dynamic region used before the consideration of other dom0 interrupts, and the fact that irq migration logic needs to reserve a vector in the target IDT before starting the move. Even a fractional increase in available space will have a large impact on the contention. ~Andrew> > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel