George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 00 of 16] credit2: Scale to multiple sockets
This patch series allows credit2 to scale reasonably past one socket. The to key things it introduces are: * Code to detect cpu topology, and create one runqueue per socket * A first-cut at a load-average based load balancer Quick descriptions below: 01: Make some debug messages quieter. 02-03: Clean up some of the per-cpu runqueue lock pointer code. This is the only part of the patch that touches code outside the credit2 scheduler. 04-07: Infrastructure for having multiple sockets, including hooks for load calculation 08: Detect socket layout, make one runqueue per socket 09-13: Infrastructure for load-balancing, including load averages for runqueues and sockets 14: A load-average-based balancer 15: Refinement on when to balance 16: Debug key output _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 01 of 16] credit2: Quieten some debug messages
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r f69037cc4674 -r 42291c2cebf1 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Wed Dec 22 17:48:31 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:23:56 2010 +0000 @@ -538,7 +538,7 @@ if ( new_weight > rqd->max_weight ) { rqd->max_weight = new_weight; - printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight); + d2printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight); } else if ( old_weight == rqd->max_weight ) { @@ -554,7 +554,7 @@ } rqd->max_weight = max_weight; - printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight); + d2printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight); } } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 02 of 16] scheduler: Update vcpu_schedule_lock to check for changed lock pointer as well
Credit2 has different cpus share a lock; which means that as cpus are added, and as they''re moved between pools, the pointer to the scheduler lock may also change as well. Since we don''t want to have to grab a lock before grabbing the per-cpu scheduler lock, we use the lock itself to protect against the pointer changing. However, since it may change between reading and locking, after we grab the lock we need to check to make sure it''s still the right one. Update the vcpu_schedule_lock() definition to reflect this: both v->processor and that processor''s schedule lock are liable to change; check both after grabbing the lock, and release / re-acquire if necessary. Signed-off-by: George Dunlap <george.dunlap@eu.citix.com> diff -r 42291c2cebf1 -r b4be3457d8bd xen/include/xen/sched-if.h --- a/xen/include/xen/sched-if.h Thu Dec 23 12:23:56 2010 +0000 +++ b/xen/include/xen/sched-if.h Thu Dec 23 12:24:10 2010 +0000 @@ -41,23 +41,25 @@ static inline void vcpu_schedule_lock(struct vcpu *v) { - unsigned int cpu; + spinlock_t * lock; for ( ; ; ) { - /* NB: For schedulers with multiple cores per runqueue, - * a vcpu may change processor w/o changing runqueues; - * so we may release a lock only to grab it again. + /* v->processor may change when grabbing the lock; but + * per_cpu(v->processor) may also change, if changing + * cpu pool also changes the scheduler lock. Retry + * until they match. * - * If that is measured to be an issue, then the check - * should be changed to checking if the locks pointed to - * by cpu and v->processor are still the same. + * It may also be the case that v->processor may change + * but the lock may be the same; this will succeed + * in that case. */ - cpu = v->processor; - spin_lock(per_cpu(schedule_data, cpu).schedule_lock); - if ( likely(v->processor == cpu) ) + lock=per_cpu(schedule_data, v->processor).schedule_lock; + + spin_lock(lock); + if ( likely(lock == per_cpu(schedule_data, v->processor).schedule_lock) ) break; - spin_unlock(per_cpu(schedule_data, cpu).schedule_lock); + spin_unlock(lock); } } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 03 of 16] scheduler: Introduce pcpu_schedule_lock
Many places in Xen, particularly schedule.c, grab the per-cpu spinlock directly, rather than through vcpu_schedule_lock(). Since the lock pointer may change between the time it''s read and the time the lock is successfully acquired, we need to check after acquiring the lock to make sure that the pcpu''s lock hasn''t changed, due to cpu initialization or cpupool activity. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r b4be3457d8bd -r 666ca1dd3ca7 xen/arch/ia64/vmx/vmmu.c --- a/xen/arch/ia64/vmx/vmmu.c Thu Dec 23 12:24:10 2010 +0000 +++ b/xen/arch/ia64/vmx/vmmu.c Thu Dec 23 12:24:29 2010 +0000 @@ -394,7 +394,7 @@ if (cpu != current->processor) return; local_irq_save(flags); - if (!spin_trylock(per_cpu(schedule_data, cpu).schedule_lock)) + if (!pcpu_schedule_trylock(cpu)) goto bail2; if (v->processor != cpu) goto bail1; @@ -416,7 +416,7 @@ ia64_dv_serialize_data(); args->vcpu = NULL; bail1: - spin_unlock(per_cpu(schedule_data, cpu).schedule_lock); + pcpu_schedule_unlock(cpu); bail2: local_irq_restore(flags); } diff -r b4be3457d8bd -r 666ca1dd3ca7 xen/common/sched_credit.c --- a/xen/common/sched_credit.c Thu Dec 23 12:24:10 2010 +0000 +++ b/xen/common/sched_credit.c Thu Dec 23 12:24:29 2010 +0000 @@ -905,7 +905,7 @@ spc->runq_sort_last = sort_epoch; - spin_lock_irqsave(per_cpu(schedule_data, cpu).schedule_lock, flags); + pcpu_schedule_lock_irqsave(cpu, flags); runq = &spc->runq; elem = runq->next; @@ -930,7 +930,7 @@ elem = next; } - spin_unlock_irqrestore(per_cpu(schedule_data, cpu).schedule_lock, flags); + pcpu_schedule_unlock_irqrestore(cpu, flags); } static void @@ -1259,7 +1259,7 @@ * cause a deadlock if the peer CPU is also load balancing and trying * to lock this CPU. */ - if ( !spin_trylock(per_cpu(schedule_data, peer_cpu).schedule_lock) ) + if ( !pcpu_schedule_trylock(peer_cpu) ) { CSCHED_STAT_CRANK(steal_trylock_failed); continue; @@ -1269,7 +1269,7 @@ * Any work over there to steal? */ speer = csched_runq_steal(peer_cpu, cpu, snext->pri); - spin_unlock(per_cpu(schedule_data, peer_cpu).schedule_lock); + pcpu_schedule_unlock(peer_cpu); if ( speer != NULL ) { *stolen = 1; diff -r b4be3457d8bd -r 666ca1dd3ca7 xen/common/schedule.c --- a/xen/common/schedule.c Thu Dec 23 12:24:10 2010 +0000 +++ b/xen/common/schedule.c Thu Dec 23 12:24:29 2010 +0000 @@ -424,7 +424,8 @@ atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count); } - /* Switch to new CPU, then unlock old CPU. */ + /* Switch to new CPU, then unlock old CPU. This is safe because + * the lock pointer cant'' change while the current lock is held. */ v->processor = new_cpu; spin_unlock_irqrestore( per_cpu(schedule_data, old_cpu).schedule_lock, flags); @@ -1302,7 +1303,7 @@ ppriv = SCHED_OP(new_ops, alloc_pdata, cpu); vpriv = SCHED_OP(new_ops, alloc_vdata, idle, idle->domain->sched_priv); - spin_lock_irqsave(per_cpu(schedule_data, cpu).schedule_lock, flags); + pcpu_schedule_lock_irqsave(cpu, flags); SCHED_OP(old_ops, tick_suspend, cpu); vpriv_old = idle->sched_priv; @@ -1313,7 +1314,7 @@ SCHED_OP(new_ops, tick_resume, cpu); SCHED_OP(new_ops, insert_vcpu, idle); - spin_unlock_irqrestore(per_cpu(schedule_data, cpu).schedule_lock, flags); + pcpu_schedule_unlock_irqrestore(cpu, flags); SCHED_OP(old_ops, free_vdata, vpriv_old); SCHED_OP(old_ops, free_pdata, ppriv_old, cpu); @@ -1369,10 +1370,10 @@ for_each_cpu_mask (i, *cpus) { - spin_lock(per_cpu(schedule_data, i).schedule_lock); + pcpu_schedule_lock(i); printk("CPU[%02d] ", i); SCHED_OP(sched, dump_cpu_state, i); - spin_unlock(per_cpu(schedule_data, i).schedule_lock); + pcpu_schedule_unlock(i); } } diff -r b4be3457d8bd -r 666ca1dd3ca7 xen/include/xen/sched-if.h --- a/xen/include/xen/sched-if.h Thu Dec 23 12:24:10 2010 +0000 +++ b/xen/include/xen/sched-if.h Thu Dec 23 12:24:29 2010 +0000 @@ -39,6 +39,57 @@ DECLARE_PER_CPU(struct scheduler *, scheduler); DECLARE_PER_CPU(struct cpupool *, cpupool); +static inline spinlock_t * pcpu_schedule_lock(int cpu) +{ + spinlock_t * lock=NULL; + + for ( ; ; ) + { + /* The per_cpu(v->processor) may also change, if changing + * cpu pool also changes the scheduler lock. Retry + * until they match. + */ + lock=per_cpu(schedule_data, cpu).schedule_lock; + + spin_lock(lock); + if ( likely(lock == per_cpu(schedule_data, cpu).schedule_lock) ) + break; + spin_unlock(lock); + } + return lock; +} + +static inline int pcpu_schedule_trylock(int cpu) +{ + spinlock_t * lock=NULL; + + lock=per_cpu(schedule_data, cpu).schedule_lock; + if ( ! spin_trylock(lock) ) + return 0; + if ( lock == per_cpu(schedule_data, cpu).schedule_lock ) + return 1; + else + { + spin_unlock(lock); + return 0; + } +} + +#define pcpu_schedule_lock_irq(p) \ + do { local_irq_disable(); pcpu_schedule_lock(p); } while ( 0 ) +#define pcpu_schedule_lock_irqsave(p, flags) \ + do { local_irq_save(flags); pcpu_schedule_lock(p); } while ( 0 ) + +static inline void pcpu_schedule_unlock(int cpu) +{ + spin_unlock(per_cpu(schedule_data, cpu).schedule_lock); +} + +#define pcpu_schedule_unlock_irq(p) \ + do { pcpu_schedule_unlock(p); local_irq_enable(); } while ( 0 ) +#define pcpu_schedule_unlock_irqrestore(p, flags) \ + do { pcpu_schedule_unlock(p); local_irq_restore(flags); } while ( 0 ) + static inline void vcpu_schedule_lock(struct vcpu *v) { spinlock_t * lock; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 04 of 16] credit2: Refactor runqueue initialization
Several refactorizations: * Add prv->initialized cpu mask * Replace prv->runq_count with active_queue mask * Replace rqd->cpu_min,cpu_mask with active cpu mask * Put locks in the runqueue structure, rather than borrowing the existing cpu locks * init() initializes all runqueues to NULL, inactive, and maps all pcpus to runqueue -q * alloc_pcpu() will add cpus to runqueues, "activating" the runqueue if necessary. All cpus are currently assigned to runqueue 0. End-to-end behavior of the system should remain largely the same. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 666ca1dd3ca7 -r b4aeb2f27741 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:24:29 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:24:48 2010 +0000 @@ -176,10 +176,14 @@ */ struct csched_runqueue_data { int id; + + spinlock_t lock; /* Lock for this runqueue. */ + cpumask_t active; /* CPUs enabled for this runqueue */ + struct list_head runq; /* Ordered list of runnable vms */ struct list_head svc; /* List of all vcpus assigned to this runqueue */ int max_weight; - int cpu_min, cpu_max; /* Range of physical cpus this runqueue runs */ + cpumask_t idle, /* Currently idle */ tickled; /* Another cpu in the queue is already targeted for this one */ }; @@ -189,12 +193,12 @@ */ struct csched_private { spinlock_t lock; - uint32_t ncpus; - + cpumask_t initialized; /* CPU is initialized for this pool */ + struct list_head sdom; /* Used mostly for dump keyhandler. */ int runq_map[NR_CPUS]; - uint32_t runq_count; + cpumask_t active_queues; /* Queues which may have active cpus */ struct csched_runqueue_data rqd[NR_CPUS]; }; @@ -341,7 +345,7 @@ int i, ipid=-1; s_time_t lowest=(1<<30); struct csched_runqueue_data *rqd = RQD(ops, cpu); - cpumask_t *online, mask; + cpumask_t mask; struct csched_vcpu * cur; d2printk("rqt d%dv%d cd%dv%d\n", @@ -374,9 +378,7 @@ /* Otherwise, look for the non-idle cpu with the lowest credit, * skipping cpus which have been tickled but not scheduled yet */ - online = CSCHED_CPUONLINE(per_cpu(cpupool, cpu)); - - cpus_andnot(mask, *online, rqd->idle); + cpus_andnot(mask, rqd->active, rqd->idle); cpus_andnot(mask, mask, rqd->tickled); for_each_cpu_mask(i, mask) @@ -997,7 +999,7 @@ const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled) { const int cpu = smp_processor_id(); - struct csched_runqueue_data *rqd = RQD(ops, cpu); + struct csched_runqueue_data *rqd; struct csched_vcpu * const scurr = CSCHED_VCPU(current); struct csched_vcpu *snext = NULL; struct task_slice ret; @@ -1010,6 +1012,10 @@ scurr->vcpu->vcpu_id, now); + BUG_ON(!cpu_isset(cpu, CSCHED_PRIV(ops)->initialized)); + + rqd = RQD(ops, cpu); + BUG_ON(!cpu_isset(cpu, rqd->active)); /* Protected by runqueue lock */ @@ -1166,14 +1172,22 @@ { struct list_head *iter_sdom, *iter_svc; struct csched_private *prv = CSCHED_PRIV(ops); - int loop; + int i, loop; - printk("info:\n" - "\tncpus = %u\n" + printk("Active queues: %d\n" "\tdefault-weight = %d\n", - prv->ncpus, + cpus_weight(prv->active_queues), CSCHED_DEFAULT_WEIGHT); + for_each_cpu_mask(i, prv->active_queues) + { + printk("Runqueue %d:\n" + "\tncpus = %u\n" + "\tmax_weight = %d\n", + i, + cpus_weight(prv->rqd[i].active), + prv->rqd[i].max_weight); + } /* FIXME: Locking! */ printk("Domain info:\n"); @@ -1199,16 +1213,82 @@ } } -static void -csched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) +static void activate_runqueue(struct csched_private *prv, int rqi) { - unsigned long flags; + struct csched_runqueue_data *rqd; + + rqd = prv->rqd + rqi; + + BUG_ON(!cpus_empty(rqd->active)); + + rqd->max_weight = 1; + rqd->id = rqi; + INIT_LIST_HEAD(&rqd->svc); + INIT_LIST_HEAD(&rqd->runq); + spin_lock_init(&rqd->lock); + + cpu_set(rqi, prv->active_queues); +} + +static void deactivate_runqueue(struct csched_private *prv, int rqi) +{ + struct csched_runqueue_data *rqd; + + rqd = prv->rqd + rqi; + + BUG_ON(!cpus_empty(rqd->active)); + + rqd->id = -1; + + cpu_clear(rqi, prv->active_queues); +} + +static void init_pcpu(const struct scheduler *ops, int cpu) +{ + int rqi, old_rqi, flags; struct csched_private *prv = CSCHED_PRIV(ops); + struct csched_runqueue_data *rqd; + spinlock_t *old_lock; spin_lock_irqsave(&prv->lock, flags); - prv->ncpus--; - cpu_clear(cpu, RQD(ops, cpu)->idle); - printk("Removing cpu %d to pool (%d total)\n", cpu, prv->ncpus); + + if ( cpu_isset(cpu, prv->initialized) ) + { + printk("%s: Strange, cpu %d already initialized!\n", __func__, cpu); + spin_unlock_irqrestore(&prv->lock, flags); + return; + } + + old_rqi = prv->runq_map[cpu]; + + /* Figure out which runqueue to put it in */ + rqi = 0; + + rqd=prv->rqd + rqi; + + printk("Adding cpu %d to runqueue %d\n", cpu, rqi); + if ( ! cpu_isset(rqi, prv->active_queues) ) + { + printk(" First cpu on runqueue, activating\n"); + activate_runqueue(prv, rqi); + } + + /* IRQs already disabled */ + old_lock=pcpu_schedule_lock(cpu); + + /* Move spinlock to new runq lock. */ + per_cpu(schedule_data, cpu).schedule_lock = &rqd->lock; + + /* Set the runqueue map */ + prv->runq_map[cpu]=rqi; + + cpu_set(cpu, rqd->idle); + cpu_set(cpu, rqd->active); + + spin_unlock(old_lock); + + cpu_set(cpu, prv->initialized); + spin_unlock_irqrestore(&prv->lock, flags); return; @@ -1217,33 +1297,51 @@ static void * csched_alloc_pdata(const struct scheduler *ops, int cpu) { - spinlock_t *new_lock; - spinlock_t *old_lock = per_cpu(schedule_data, cpu).schedule_lock; - unsigned long flags; - struct csched_private *prv = CSCHED_PRIV(ops); - - spin_lock_irqsave(old_lock, flags); - new_lock = &per_cpu(schedule_data, prv->runq_map[cpu])._lock; - per_cpu(schedule_data, cpu).schedule_lock = new_lock; - spin_unlock_irqrestore(old_lock, flags); - - spin_lock_irqsave(&prv->lock, flags); - prv->ncpus++; - cpu_set(cpu, RQD(ops, cpu)->idle); - printk("Adding cpu %d to pool (%d total)\n", cpu, prv->ncpus); - spin_unlock_irqrestore(&prv->lock, flags); + init_pcpu(ops, cpu); return (void *)1; } static void -make_runq_map(struct csched_private *prv) +csched_free_pdata(const struct scheduler *ops, void *pcpu, int cpu) { - /* FIXME: Read pcpu layout and do this properly */ - prv->runq_count = 1; - prv->rqd[0].cpu_min = 0; - prv->rqd[0].cpu_max = NR_CPUS; - memset(prv->runq_map, 0, sizeof(prv->runq_map)); + unsigned long flags; + struct csched_private *prv = CSCHED_PRIV(ops); + struct csched_runqueue_data *rqd; + int rqi; + + spin_lock_irqsave(&prv->lock, flags); + + BUG_ON( !cpu_isset(cpu, prv->initialized)); + + /* Find the old runqueue and remove this cpu from it */ + rqi = prv->runq_map[cpu]; + + rqd = prv->rqd + rqi; + + /* No need to save IRQs here, they''re already disabled */ + spin_lock(&rqd->lock); + + BUG_ON(!cpu_isset(cpu, rqd->idle)); + + printk("Removing cpu %d from runqueue %d\n", cpu, rqi); + + cpu_clear(cpu, rqd->idle); + cpu_clear(cpu, rqd->active); + + if ( cpus_empty(rqd->active) ) + { + printk(" No cpus left on runqueue, disabling\n"); + deactivate_runqueue(prv, rqi); + } + + spin_unlock(&rqd->lock); + + cpu_clear(cpu, prv->initialized); + + spin_unlock_irqrestore(&prv->lock, flags); + + return; } static int @@ -1265,18 +1363,11 @@ spin_lock_init(&prv->lock); INIT_LIST_HEAD(&prv->sdom); - prv->ncpus = 0; - - make_runq_map(prv); - - for ( i=0; i<prv->runq_count ; i++ ) + /* But un-initialize all runqueues */ + for ( i=0; i<NR_CPUS; i++) { - struct csched_runqueue_data *rqd = prv->rqd + i; - - rqd->max_weight = 1; - rqd->id = i; - INIT_LIST_HEAD(&rqd->svc); - INIT_LIST_HEAD(&rqd->runq); + prv->runq_map[i] = -1; + prv->rqd[i].id = -1; } return 0; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 05 of 16] credit2: Handle runqueue changes
In preparation for cross-runqueue migration, make changes to make that more robust. Changes include: * An up-pointer from the svc struct to the runqueue it''s assigned to * Explicit runqueue assign/desassings, with appropriate ASSERTs * cpu_pick will de-assign a vcpu from a runqueue if it''s migrating, and wake will re-assign it Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r b4aeb2f27741 -r 91cdbd71585b xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:24:48 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:25:02 2010 +0000 @@ -42,6 +42,7 @@ #define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS + 6 #define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7 #define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8 +#define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS + 10 /* * WARNING: This is still in an experimental phase. Status and work can be found at the @@ -209,6 +210,7 @@ struct list_head rqd_elem; /* On the runqueue data list */ struct list_head sdom_elem; /* On the domain vcpu list */ struct list_head runq_elem; /* On the runqueue */ + struct csched_runqueue_data *rqd; /* Up-pointer to the runqueue */ /* Up-pointers */ struct csched_dom *sdom; @@ -274,6 +276,7 @@ svc->vcpu->domain->domain_id, svc->vcpu->vcpu_id); + BUG_ON(&svc->rqd->runq != runq); /* Idle vcpus not allowed on the runqueue anymore */ BUG_ON(is_idle_vcpu(svc->vcpu)); BUG_ON(svc->vcpu->is_running); @@ -355,6 +358,7 @@ current->vcpu_id); BUG_ON(new->vcpu->processor != cpu); + BUG_ON(new->rqd != rqd); /* Look at the cpu it''s running on first */ cur = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); @@ -445,15 +449,17 @@ */ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now) { + struct csched_runqueue_data *rqd = RQD(ops, cpu); struct list_head *iter; - list_for_each( iter, &RQD(ops, cpu)->svc ) + list_for_each( iter, &rqd->svc ) { struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, rqd_elem); int start_credit; BUG_ON( is_idle_vcpu(svc->vcpu) ); + BUG_ON( svc->rqd != rqd ); start_credit = svc->credit; @@ -620,12 +626,69 @@ return svc; } +/* Add and remove from runqueue assignment (not active run queue) */ +static void +__runq_assign(struct csched_vcpu *svc, struct csched_runqueue_data *rqd) +{ + + svc->rqd = rqd; + list_add_tail(&svc->rqd_elem, &svc->rqd->svc); + + update_max_weight(svc->rqd, svc->weight, 0); + + /* TRACE */ + { + struct { + unsigned dom:16,vcpu:16; + unsigned rqi:16; + } d; + d.dom = svc->vcpu->domain->domain_id; + d.vcpu = svc->vcpu->vcpu_id; + d.rqi=rqd->id; + trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1, + sizeof(d), + (unsigned char *)&d); + } + +} + +static void +runq_assign(const struct scheduler *ops, struct vcpu *vc) +{ + struct csched_vcpu *svc = vc->sched_priv; + + BUG_ON(svc->rqd != NULL); + + __runq_assign(svc, RQD(ops, vc->processor)); +} + +static void +__runq_deassign(struct csched_vcpu *svc) +{ + BUG_ON(__vcpu_on_runq(svc)); + + list_del_init(&svc->rqd_elem); + update_max_weight(svc->rqd, 0, svc->weight); + + svc->rqd = NULL; +} + +static void +runq_deassign(const struct scheduler *ops, struct vcpu *vc) +{ + struct csched_vcpu *svc = vc->sched_priv; + + BUG_ON(svc->rqd != RQD(ops, vc->processor)); + + __runq_deassign(svc); +} + static void csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc) { struct csched_vcpu *svc = vc->sched_priv; struct domain * const dom = vc->domain; - struct csched_dom *sdom = CSCHED_DOM(dom); + struct csched_dom * const sdom = svc->sdom; printk("%s: Inserting d%dv%d\n", __func__, dom->domain_id, vc->vcpu_id); @@ -639,8 +702,7 @@ /* FIXME: Abstract for multiple runqueues */ vcpu_schedule_lock_irq(vc); - list_add_tail(&svc->rqd_elem, &RQD(ops, vc->processor)->svc); - update_max_weight(RQD(ops, vc->processor), svc->weight, 0); + runq_assign(ops, vc); vcpu_schedule_unlock_irq(vc); @@ -672,8 +734,7 @@ /* Remove from runqueue */ vcpu_schedule_lock_irq(vc); - list_del_init(&svc->rqd_elem); - update_max_weight(RQD(ops, vc->processor), 0, svc->weight); + runq_deassign(ops, vc); vcpu_schedule_unlock_irq(vc); @@ -734,6 +795,12 @@ goto out; } + /* Add into the new runqueue if necessary */ + if ( svc->rqd == NULL ) + runq_assign(ops, vc); + else + BUG_ON(RQD(ops, vc->processor) != svc->rqd ); + now = NOW(); /* Put the VCPU on the runq */ @@ -753,6 +820,8 @@ vcpu_schedule_lock_irq(vc); + BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor)); + /* This vcpu is now eligible to be put on the runqueue again */ clear_bit(__CSFLAG_scheduled, &svc->flags); @@ -777,7 +846,7 @@ } static int -csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc) +choose_cpu(const struct scheduler *ops, struct vcpu *vc) { /* FIXME: Chose a schedule group based on load */ /* FIXME: Migrate the vcpu to the new runqueue list, updating @@ -786,6 +855,36 @@ } static int +csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc) +{ + struct csched_vcpu * const svc = CSCHED_VCPU(vc); + int new_cpu; + + /* The scheduler interface doesn''t have an explicit mechanism to + * involve the choosable scheduler in the migrate process, so we + * infer that a change may happen by the call to cpu_pick, and + * remove it from the old runqueue while the lock for the old + * runqueue is held. It can''t be actively waiting to run. It + * will be added to the new runqueue when it next wakes. + * + * If we want to be able to call pick() separately, we need + * to add a mechansim to remove a vcpu from an old processor / + * runqueue before releasing the lock. */ + BUG_ON(__vcpu_on_runq(svc)); + + new_cpu = choose_cpu(ops, vc); + + /* If we''re suggesting moving to a different runqueue, remove it + * from the old runqueue while we have the lock. It will be added + * to the new one when it wakes. */ + if ( svc->rqd != NULL + && RQD(ops, new_cpu) != svc->rqd ) + runq_deassign(ops, vc); + + return new_cpu; +} + +static int csched_dom_cntl( const struct scheduler *ops, struct domain *d, @@ -826,8 +925,10 @@ * lock. */ vcpu_schedule_lock_irq(svc->vcpu); + BUG_ON(svc->rqd != RQD(ops, svc->vcpu->processor)); + svc->weight = sdom->weight; - update_max_weight(RQD(ops, svc->vcpu->processor), svc->weight, old_weight); + update_max_weight(svc->rqd, svc->weight, old_weight); vcpu_schedule_unlock_irq(svc->vcpu); } @@ -1017,7 +1118,9 @@ rqd = RQD(ops, cpu); BUG_ON(!cpu_isset(cpu, rqd->active)); - /* Protected by runqueue lock */ + /* Protected by runqueue lock */ + + BUG_ON(!is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd); /* Clear "tickled" bit now that we''ve been scheduled */ if ( cpu_isset(cpu, rqd->tickled) ) @@ -1067,6 +1170,8 @@ /* If switching, remove this from the runqueue and mark it scheduled */ if ( snext != scurr ) { + BUG_ON(snext->rqd != rqd); + __runq_remove(snext); if ( snext->vcpu->is_running ) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 06 of 16] credit2: Calculate instantaneous runqueue load
Add hooks in the various places to detect vcpus becoming active or inactive. At the moment, record only instantaneous runqueue load; but this lays the groundwork for having a load average. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 91cdbd71585b -r 216144f8656f xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:25:02 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:25:16 2010 +0000 @@ -42,6 +42,7 @@ #define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS + 6 #define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7 #define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8 +#define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS + 9 #define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS + 10 /* @@ -187,6 +188,7 @@ cpumask_t idle, /* Currently idle */ tickled; /* Another cpu in the queue is already targeted for this one */ + int load; /* Instantaneous load: Length of queue + num non-idle threads */ }; /* @@ -266,6 +268,23 @@ return list_entry(elem, struct csched_vcpu, runq_elem); } +static void +update_load(const struct scheduler *ops, + struct csched_runqueue_data *rqd, int change, s_time_t now) +{ + rqd->load += change; + + { + struct { + unsigned load:4; + } d; + d.load = rqd->load; + trace_var(TRC_CSCHED2_UPDATE_LOAD, 0, + sizeof(d), + (unsigned char *)&d); + } +} + static int __runq_insert(struct list_head *runq, struct csched_vcpu *svc) { @@ -756,7 +775,11 @@ if ( per_cpu(schedule_data, vc->processor).curr == vc ) cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ); else if ( __vcpu_on_runq(svc) ) + { + BUG_ON(svc->rqd != RQD(ops, vc->processor)); + update_load(ops, svc->rqd, -1, NOW()); __runq_remove(svc); + } else if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) ) clear_bit(__CSFLAG_delayed_runq_add, &svc->flags); } @@ -803,6 +826,8 @@ now = NOW(); + update_load(ops, svc->rqd, 1, now); + /* Put the VCPU on the runq */ runq_insert(ops, vc->processor, svc); runq_tickle(ops, vc->processor, svc, now); @@ -841,6 +866,8 @@ runq_insert(ops, vc->processor, svc); runq_tickle(ops, vc->processor, svc, now); } + else if ( !is_idle_vcpu(vc) ) + update_load(ops, svc->rqd, -1, now); vcpu_schedule_unlock_irq(vc); } @@ -1209,6 +1236,9 @@ /* Update the idle mask if necessary */ if ( !cpu_isset(cpu, rqd->idle) ) cpu_set(cpu, rqd->idle); + /* Make sure avgload gets updated periodically even + * if there''s no activity */ + update_load(ops, rqd, 0, now); } /* @@ -1287,10 +1317,12 @@ { printk("Runqueue %d:\n" "\tncpus = %u\n" - "\tmax_weight = %d\n", + "\tmax_weight = %d\n" + "\tload = %d\n", i, cpus_weight(prv->rqd[i].active), - prv->rqd[i].max_weight); + prv->rqd[i].max_weight, + prv->rqd[i].load); } /* FIXME: Locking! */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 07 of 16] credit2: Simple cpu picker based on instantaneous load
In preparation for multiple runqueues, add a simple cpu picker that will look for the runqueue with the lowest instantaneous load to assign the vcpu to. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 216144f8656f -r 5eab276bbe4a xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:25:16 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:25:30 2010 +0000 @@ -872,13 +872,75 @@ vcpu_schedule_unlock_irq(vc); } +#define MAX_LOAD (1<<30); static int choose_cpu(const struct scheduler *ops, struct vcpu *vc) { - /* FIXME: Chose a schedule group based on load */ - /* FIXME: Migrate the vcpu to the new runqueue list, updating - max_weight for each runqueue */ - return 0; + struct csched_private *prv = CSCHED_PRIV(ops); + int i, min_load, min_rqi = -1, new_cpu; + struct csched_vcpu *svc = CSCHED_VCPU(vc); + + BUG_ON(cpus_empty(prv->active_queues)); + + /* Locking: + * - vc->processor is already locked + * - Need to grab prv lock to make sure active runqueues don''t + * change + * - Need to grab locks for other runqueues while checking + * avgload + * Locking constraint is: + * - Lock prv before runqueue locks + * - Trylock between runqueue locks (no ordering) + * + * Since one of the runqueue locks is already held, we can''t + * just grab the prv lock. Instead, we''ll have to trylock, and + * do something else reasonable if we fail. + */ + + if ( !spin_trylock(&prv->lock) ) + { + /* Leave it where it is for now. When we actually pay attention + * to affinity we''ll have to figure something out... */ + return vc->processor; + } + + /* FIXME: Pay attention to cpu affinity */ + + min_load = MAX_LOAD; + + /* Find the runqueue with the lowest instantaneous load */ + for_each_cpu_mask(i, prv->active_queues) + { + struct csched_runqueue_data *rqd; + + rqd = prv->rqd + i; + + /* If checking a different runqueue, grab the lock, + * read the avg, and then release the lock. */ + if ( rqd != svc->rqd + && ! spin_trylock(&rqd->lock) ) + continue; + if ( prv->rqd[i].load < min_load ) + { + min_load=prv->rqd[i].load; + min_rqi=i; + } + if ( rqd != svc->rqd ) + spin_unlock(&rqd->lock); + } + + /* We didn''t find anyone (most likely because of spinlock contention); leave it where it is */ + if ( min_rqi == -1 ) + new_cpu = vc->processor; + else + { + BUG_ON(cpus_empty(prv->rqd[min_rqi].active)); + new_cpu = first_cpu(prv->rqd[min_rqi].active); + } + + spin_unlock(&prv->lock); + + return new_cpu; } static int @@ -894,13 +956,12 @@ * runqueue is held. It can''t be actively waiting to run. It * will be added to the new runqueue when it next wakes. * - * If we want to be able to call pick() separately, we need - * to add a mechansim to remove a vcpu from an old processor / - * runqueue before releasing the lock. */ + * If we want to be able to call pick() separately, we need to add + * a mechansim to remove a vcpu from an old processor / runqueue + * before releasing the lock. */ BUG_ON(__vcpu_on_runq(svc)); new_cpu = choose_cpu(ops, vc); - /* If we''re suggesting moving to a different runqueue, remove it * from the old runqueue while we have the lock. It will be added * to the new one when it wakes. */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 08 of 16] credit2: Detect socket layout and assign one runqueue per socket
Because alloc_pdata() is called before the cpu layout information is available, we grab a callback to the newly-created CPU_STARTING notifier. cpu 0 doesn''t get a callback, so we simply hard-code it to runqueue 0. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 5eab276bbe4a -r c557f4c76911 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:25:30 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:25:44 2010 +0000 @@ -24,6 +24,7 @@ #include <asm/atomic.h> #include <xen/errno.h> #include <xen/trace.h> +#include <xen/cpu.h> #if __i386__ #define PRI_stime "lld" @@ -712,13 +713,15 @@ printk("%s: Inserting d%dv%d\n", __func__, dom->domain_id, vc->vcpu_id); + /* NB: On boot, idle vcpus are inserted before alloc_pdata() has + * been called for that cpu. + */ if ( ! is_idle_vcpu(vc) ) { /* FIXME: Do we need the private lock here? */ list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu); /* Add vcpu to runqueue of initial processor */ - /* FIXME: Abstract for multiple runqueues */ vcpu_schedule_lock_irq(vc); runq_assign(ops, vc); @@ -1462,6 +1465,20 @@ /* Figure out which runqueue to put it in */ rqi = 0; + /* Figure out which runqueue to put it in */ + /* NB: cpu 0 doesn''t get a STARTING callback, so we hard-code it to runqueue 0. */ + if ( cpu == 0 ) + rqi = 0; + else + rqi = cpu_to_socket(cpu); + + if ( rqi < 0 ) + { + printk("%s: cpu_to_socket(%d) returned %d!\n", + __func__, cpu, rqi); + BUG(); + } + rqd=prv->rqd + rqi; printk("Adding cpu %d to runqueue %d\n", cpu, rqi); @@ -1495,7 +1512,13 @@ static void * csched_alloc_pdata(const struct scheduler *ops, int cpu) { - init_pcpu(ops, cpu); + /* Check to see if the cpu is online yet */ + /* Note: cpu 0 doesn''t get a STARTING callback */ + if ( cpu == 0 || cpu_to_socket(cpu) >= 0 ) + init_pcpu(ops, cpu); + else + printk("%s: cpu %d not online yet, deferring initializatgion\n", + __func__, cpu); return (void *)1; } @@ -1543,6 +1566,41 @@ } static int +csched_cpu_starting(int cpu) +{ + struct scheduler *ops; + + /* Hope this is safe from cpupools switching things around. :-) */ + ops = per_cpu(scheduler, cpu); + + init_pcpu(ops, cpu); + + return NOTIFY_DONE; +} + +static int cpu_credit2_callback( + struct notifier_block *nfb, unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + int rc = 0; + + switch ( action ) + { + case CPU_STARTING: + csched_cpu_starting(cpu); + break; + default: + break; + } + + return !rc ? NOTIFY_DONE : notifier_from_errno(rc); +} + +static struct notifier_block cpu_credit2_nfb = { + .notifier_call = cpu_credit2_callback +}; + +static int csched_init(struct scheduler *ops) { int i; @@ -1552,15 +1610,20 @@ " WARNING: This is experimental software in development.\n" \ " Use at your own risk.\n"); + /* Basically no CPU information is available at this point; just + * set up basic structures, and a callback when the CPU info is + * available. */ + prv = xmalloc(struct csched_private); if ( prv == NULL ) return -ENOMEM; memset(prv, 0, sizeof(*prv)); ops->sched_data = prv; - spin_lock_init(&prv->lock); INIT_LIST_HEAD(&prv->sdom); + register_cpu_notifier(&cpu_credit2_nfb); + /* But un-initialize all runqueues */ for ( i=0; i<NR_CPUS; i++) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 09 of 16] credit2: Calculate load average
Calculate a per-runqueue decaying load average. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r c557f4c76911 -r ab53467fcc74 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:25:44 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:25:58 2010 +0000 @@ -175,6 +175,18 @@ #define RQD(_ops, _cpu) (&CSCHED_PRIV(_ops)->rqd[c2r(_ops, _cpu)]) /* + * Shifts for load average. + * - granularity: Reduce granularity of time by a factor of 1000, so we can use 32-bit maths + * - window shift: Given granularity shift, make the window about 1 second + * - scale shift: Shift up load by this amount rather than using fractions; 128 corresponds + * to a load of 1. + */ +#define LOADAVG_GRANULARITY_SHIFT (10) +int opt_load_window_shift=18; +#define LOADAVG_WINDOW_SHIFT_MIN 4 +integer_param("credit2_load_window_shift", opt_load_window_shift); + +/* * Per-runqueue data */ struct csched_runqueue_data { @@ -190,6 +202,8 @@ cpumask_t idle, /* Currently idle */ tickled; /* Another cpu in the queue is already targeted for this one */ int load; /* Instantaneous load: Length of queue + num non-idle threads */ + s_time_t load_last_update; /* Last time average was updated */ + s_time_t avgload; /* Decaying queue load */ }; /* @@ -204,6 +218,8 @@ int runq_map[NR_CPUS]; cpumask_t active_queues; /* Queues which may have active cpus */ struct csched_runqueue_data rqd[NR_CPUS]; + + int load_window_shift; }; /* @@ -273,13 +289,34 @@ update_load(const struct scheduler *ops, struct csched_runqueue_data *rqd, int change, s_time_t now) { + struct csched_private *prv = CSCHED_PRIV(ops); + s_time_t delta=-1; + + now >>= LOADAVG_GRANULARITY_SHIFT; + + if ( rqd->load_last_update + (1ULL<<prv->load_window_shift) < now ) + { + rqd->avgload = rqd->load << (1ULL<prv->load_window_shift); + } + else + { + delta = now - rqd->load_last_update; + + rqd->avgload + ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) + + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift; + } + rqd->load += change; - + rqd->load_last_update = now; { struct { - unsigned load:4; + unsigned load:4, avgload:28; + int delta; } d; d.load = rqd->load; + d.avgload = rqd->avgload; + d.delta = delta; trace_var(TRC_CSCHED2_UPDATE_LOAD, 0, sizeof(d), (unsigned char *)&d); @@ -1610,6 +1647,15 @@ " WARNING: This is experimental software in development.\n" \ " Use at your own risk.\n"); + printk(" load_window_shift: %d\n", opt_load_window_shift); + + if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) + { + printk("%s: opt_load_window_shift %d below min %d, resetting\n", + __func__, opt_load_window_shift, LOADAVG_WINDOW_SHIFT_MIN); + opt_load_window_shift = LOADAVG_WINDOW_SHIFT_MIN; + } + /* Basically no CPU information is available at this point; just * set up basic structures, and a callback when the CPU info is * available. */ @@ -1631,6 +1677,8 @@ prv->rqd[i].id = -1; } + prv->load_window_shift = opt_load_window_shift; + return 0; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 10 of 16] credit2: Track average load contributed by a vcpu
Track the amount of load contributed by a particular vcpu, to help us make informed decisions about what will happen if we make a move. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r ab53467fcc74 -r 0b950f9d3332 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:25:58 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:26:13 2010 +0000 @@ -45,6 +45,8 @@ #define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8 #define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS + 9 #define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS + 10 +#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS + 11 +#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS + 12 /* * WARNING: This is still in an experimental phase. Status and work can be found at the @@ -241,6 +243,9 @@ s_time_t start_time; /* When we were scheduled (used for credit) */ unsigned flags; /* 16 bits doesn''t seem to play well with clear_bit() */ + /* Individual contribution to load */ + s_time_t load_last_update; /* Last time average was updated */ + s_time_t avgload; /* Decaying queue load */ }; /* @@ -286,8 +291,8 @@ } static void -update_load(const struct scheduler *ops, - struct csched_runqueue_data *rqd, int change, s_time_t now) +__update_runq_load(const struct scheduler *ops, + struct csched_runqueue_data *rqd, int change, s_time_t now) { struct csched_private *prv = CSCHED_PRIV(ops); s_time_t delta=-1; @@ -296,7 +301,7 @@ if ( rqd->load_last_update + (1ULL<<prv->load_window_shift) < now ) { - rqd->avgload = rqd->load << (1ULL<prv->load_window_shift); + rqd->avgload = (unsigned long long)rqd->load << prv->load_window_shift; } else { @@ -306,23 +311,78 @@ ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift; } - rqd->load += change; rqd->load_last_update = now; + { struct { - unsigned load:4, avgload:28; - int delta; + unsigned rq_load:4, rq_avgload:28; + unsigned rq_id:4; } d; - d.load = rqd->load; - d.avgload = rqd->avgload; - d.delta = delta; - trace_var(TRC_CSCHED2_UPDATE_LOAD, 0, + d.rq_id=rqd->id; + d.rq_load = rqd->load; + d.rq_avgload = rqd->avgload; + trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1, sizeof(d), (unsigned char *)&d); } } +static void +__update_svc_load(const struct scheduler *ops, + struct csched_vcpu *svc, int change, s_time_t now) +{ + struct csched_private *prv = CSCHED_PRIV(ops); + s_time_t delta=-1; + int vcpu_load; + + if ( change == -1 ) + vcpu_load = 1; + else if ( change == 1 ) + vcpu_load = 0; + else + vcpu_load = vcpu_runnable(svc->vcpu); + + now >>= LOADAVG_GRANULARITY_SHIFT; + + if ( svc->load_last_update + (1ULL<<prv->load_window_shift) < now ) + { + svc->avgload = (unsigned long long)vcpu_load << prv->load_window_shift; + } + else + { + delta = now - svc->load_last_update; + + svc->avgload + ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) ) + + ( ((1ULL<<prv->load_window_shift) - delta) * svc->avgload ) ) >> prv->load_window_shift; + } + svc->load_last_update = now; + + { + struct { + unsigned dom:16,vcpu:16; + unsigned v_avgload:32; + } d; + d.dom = svc->vcpu->domain->domain_id; + d.vcpu = svc->vcpu->vcpu_id; + d.v_avgload = svc->avgload; + trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1, + sizeof(d), + (unsigned char *)&d); + } +} + +static void +update_load(const struct scheduler *ops, + struct csched_runqueue_data *rqd, + struct csched_vcpu *svc, int change, s_time_t now) +{ + __update_runq_load(ops, rqd, change, now); + if ( svc ) + __update_svc_load(ops, svc, change, now); +} + static int __runq_insert(struct list_head *runq, struct csched_vcpu *svc) { @@ -672,6 +732,9 @@ svc->credit = CSCHED_CREDIT_INIT; svc->weight = svc->sdom->weight; + /* Starting load of 50% */ + svc->avgload = 1ULL << (CSCHED_PRIV(ops)->load_window_shift - 1); + svc->load_last_update = NOW(); } else { @@ -817,7 +880,7 @@ else if ( __vcpu_on_runq(svc) ) { BUG_ON(svc->rqd != RQD(ops, vc->processor)); - update_load(ops, svc->rqd, -1, NOW()); + update_load(ops, svc->rqd, svc, -1, NOW()); __runq_remove(svc); } else if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) ) @@ -866,7 +929,7 @@ now = NOW(); - update_load(ops, svc->rqd, 1, now); + update_load(ops, svc->rqd, svc, 1, now); /* Put the VCPU on the runq */ runq_insert(ops, vc->processor, svc); @@ -907,7 +970,7 @@ runq_tickle(ops, vc->processor, svc, now); } else if ( !is_idle_vcpu(vc) ) - update_load(ops, svc->rqd, -1, now); + update_load(ops, svc->rqd, svc, -1, now); vcpu_schedule_unlock_irq(vc); } @@ -1339,7 +1402,7 @@ cpu_set(cpu, rqd->idle); /* Make sure avgload gets updated periodically even * if there''s no activity */ - update_load(ops, rqd, 0, now); + update_load(ops, rqd, NULL, 0, now); } /* _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 11 of 16] credit2: Track expected load
As vcpus are migrated, track how we expect the load to change. This helps smooth migrations when the balancing doesn''t take immediate effect on the load average. In theory, if vcpu activity remains constant, then the measured avgload should converge to the balanced avgload. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 0b950f9d3332 -r 7bb94bc97cc0 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:26:13 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:26:28 2010 +0000 @@ -206,6 +206,7 @@ int load; /* Instantaneous load: Length of queue + num non-idle threads */ s_time_t load_last_update; /* Last time average was updated */ s_time_t avgload; /* Decaying queue load */ + s_time_t b_avgload; /* Decaying queue load modified by balancing */ }; /* @@ -302,6 +303,7 @@ if ( rqd->load_last_update + (1ULL<<prv->load_window_shift) < now ) { rqd->avgload = (unsigned long long)rqd->load << prv->load_window_shift; + rqd->b_avgload = (unsigned long long)rqd->load << prv->load_window_shift; } else { @@ -310,6 +312,10 @@ rqd->avgload ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift; + + rqd->b_avgload + ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) ) + + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->b_avgload ) ) >> prv->load_window_shift; } rqd->load += change; rqd->load_last_update = now; @@ -317,11 +323,12 @@ { struct { unsigned rq_load:4, rq_avgload:28; - unsigned rq_id:4; + unsigned rq_id:4, b_avgload:28; } d; d.rq_id=rqd->id; d.rq_load = rqd->load; d.rq_avgload = rqd->avgload; + d.b_avgload = rqd->b_avgload; trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1, sizeof(d), (unsigned char *)&d); @@ -756,6 +763,9 @@ update_max_weight(svc->rqd, svc->weight, 0); + /* Expected new load based on adding this vcpu */ + rqd->b_avgload += svc->avgload; + /* TRACE */ { struct { @@ -790,6 +800,9 @@ list_del_init(&svc->rqd_elem); update_max_weight(svc->rqd, 0, svc->weight); + /* Expected new load based on removing this vcpu */ + svc->rqd->b_avgload -= svc->avgload; + svc->rqd = NULL; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 12 of 16] credit2: Migrate request infrastructure
Put in infrastructure to allow a vcpu to requeset to migrate to a specific runqueue. This will allow a load balancer to choose running VMs to migrate, and know they will go where expected when the VM is descheduled. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 7bb94bc97cc0 -r 97c754b8c144 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:26:28 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:26:43 2010 +0000 @@ -157,6 +157,12 @@ */ #define __CSFLAG_delayed_runq_add 2 #define CSFLAG_delayed_runq_add (1<<__CSFLAG_delayed_runq_add) +/* CSFLAG_runq_migrate_request: This vcpu is being migrated as a result of a + * credit2-initiated runq migrate request; migrate it to the runqueue indicated + * in the svc struct. + */ +#define __CSFLAG_runq_migrate_request 3 +#define CSFLAG_runq_migrate_request (1<<__CSFLAG_runq_migrate_request) int opt_migrate_resist=500; @@ -247,6 +253,8 @@ /* Individual contribution to load */ s_time_t load_last_update; /* Last time average was updated */ s_time_t avgload; /* Decaying queue load */ + + struct csched_runqueue_data *migrate_rqd; /* Pre-determined rqd to which to migrate */ }; /* @@ -974,10 +982,10 @@ * it seems a bit pointless; especially as we have plenty of * bits free. */ - if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) ) + if ( test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags) + && likely(vcpu_runnable(vc)) ) { BUG_ON(__vcpu_on_runq(svc)); - clear_bit(__CSFLAG_delayed_runq_add, &svc->flags); runq_insert(ops, vc->processor, svc); runq_tickle(ops, vc->processor, svc, now); @@ -1015,11 +1023,34 @@ if ( !spin_trylock(&prv->lock) ) { + if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) ) + { + d2printk("d%dv%d -\n", svc->vcpu->domain->domain_id, svc->vcpu->vcpu_id); + clear_bit(__CSFLAG_runq_migrate_request, &svc->flags); + } /* Leave it where it is for now. When we actually pay attention * to affinity we''ll have to figure something out... */ return vc->processor; } + /* First check to see if we''re here because someone else suggested a place + * for us to move. */ + if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) ) + { + if ( unlikely(svc->migrate_rqd->id < 0) ) + { + printk("%s: Runqueue migrate aborted because target runqueue disappeared!\n", + __func__); + /* Fall-through to normal cpu pick */ + } + else + { + d2printk("d%dv%d +\n", svc->vcpu->domain->domain_id, svc->vcpu->vcpu_id); + new_cpu = first_cpu(svc->migrate_rqd->active); + goto out_up; + } + } + /* FIXME: Pay attention to cpu affinity */ min_load = MAX_LOAD; @@ -1053,7 +1084,8 @@ BUG_ON(cpus_empty(prv->rqd[min_rqi].active)); new_cpu = first_cpu(prv->rqd[min_rqi].active); } - + +out_up: spin_unlock(&prv->lock); return new_cpu; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 13 of 16] credit2: Use loadavg to pick cpus, instead of instantaneous load
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 97c754b8c144 -r 648a4a86b8c8 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:26:43 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:26:58 2010 +0000 @@ -996,13 +996,14 @@ vcpu_schedule_unlock_irq(vc); } -#define MAX_LOAD (1<<30); +#define MAX_LOAD (1ULL<<60); static int choose_cpu(const struct scheduler *ops, struct vcpu *vc) { struct csched_private *prv = CSCHED_PRIV(ops); - int i, min_load, min_rqi = -1, new_cpu; + int i, min_rqi = -1, new_cpu; struct csched_vcpu *svc = CSCHED_VCPU(vc); + s_time_t min_avgload; BUG_ON(cpus_empty(prv->active_queues)); @@ -1053,27 +1054,39 @@ /* FIXME: Pay attention to cpu affinity */ - min_load = MAX_LOAD; + min_avgload = MAX_LOAD; /* Find the runqueue with the lowest instantaneous load */ for_each_cpu_mask(i, prv->active_queues) { struct csched_runqueue_data *rqd; + s_time_t rqd_avgload; rqd = prv->rqd + i; /* If checking a different runqueue, grab the lock, - * read the avg, and then release the lock. */ - if ( rqd != svc->rqd - && ! spin_trylock(&rqd->lock) ) + * read the avg, and then release the lock. + * + * If on our own runqueue, don''t grab or release the lock; + * but subtract our own load from the runqueue load to simulate + * impartiality */ + if ( rqd == svc->rqd ) + { + rqd_avgload = rqd->b_avgload - svc->avgload; + } + else if ( spin_trylock(&rqd->lock) ) + { + rqd_avgload = rqd->b_avgload; + spin_unlock(&rqd->lock); + } + else continue; - if ( prv->rqd[i].load < min_load ) + + if ( rqd_avgload < min_avgload ) { - min_load=prv->rqd[i].load; + min_avgload = rqd_avgload; min_rqi=i; } - if ( rqd != svc->rqd ) - spin_unlock(&rqd->lock); } /* We didn''t find anyone (most likely because of spinlock contention); leave it where it is */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 14 of 16] credit2: Introduce a loadavg-based load balancer
This is a first-cut at getting load balancing. I''m first working on looking at behavior I want to get correct; then, once I know what kind of behavior works well, then I''ll work on getting it efficient. The general idea is when balancing runqueues, look for the runqueue whose loadavg is the most different from ours (higher or lower). Then, look for a transaction which will bring the loads closest together: either pushing a vcpu, pulling a vcpu, or swapping them. Use the per-vcpu load to calculate the expected load after the exchange. The current algorithm looks at every combination, which is O(N^2). That''s not going to be suitable for workloads with large numbers of vcpus (such as highly consolidated VDI deployments). I''ll make a more efficient algorithm once I''ve experimented and determined what I think is the best load-balancing behavior. At the moment, balance from a runqueue every time the credit resets. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 648a4a86b8c8 -r dca9ad897502 xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:26:58 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:27:14 2010 +0000 @@ -1104,6 +1104,216 @@ return new_cpu; } +static void balance_load(const struct scheduler *ops, int cpu, s_time_t now) +{ + struct csched_private *prv = CSCHED_PRIV(ops); + int i, max_delta_rqi = -1; + struct list_head *push_iter, *pull_iter; + + /* NB: Modified by consider() */ + s_time_t load_delta; + struct csched_vcpu * best_push_svc=NULL, *best_pull_svc=NULL; + /* NB: Read by consider() */ + struct csched_runqueue_data *lrqd; + struct csched_runqueue_data *orqd; + + void consider(struct csched_vcpu *push_svc, + struct csched_vcpu *pull_svc) + { + s_time_t l_load, o_load, delta; + + l_load = lrqd->b_avgload; + o_load = orqd->b_avgload; + if ( push_svc ) + { + /* What happens to the load on both if we push? */ + l_load -= push_svc->avgload; + o_load += push_svc->avgload; + } + if ( pull_svc ) + { + /* What happens to the load on both if we pull? */ + l_load += pull_svc->avgload; + o_load -= pull_svc->avgload; + } + + delta = l_load - o_load; + if ( delta < 0 ) + delta = -delta; + + if ( delta < load_delta ) + { + load_delta = delta; + best_push_svc=push_svc; + best_pull_svc=pull_svc; + } + } + + void migrate(struct csched_vcpu *svc, struct csched_runqueue_data *trqd) + { + if ( test_bit(__CSFLAG_scheduled, &svc->flags) ) + { + d2printk("d%dv%d %d-%d a\n", svc->vcpu->domain->domain_id, svc->vcpu->vcpu_id, + svc->rqd->id, trqd->id); + /* It''s running; mark it to migrate. */ + svc->migrate_rqd = trqd; + set_bit(_VPF_migrating, &svc->vcpu->pause_flags); + set_bit(__CSFLAG_runq_migrate_request, &svc->flags); + } + else + { + int on_runq=0; + /* It''s not running; just move it */ + d2printk("d%dv%d %d-%d i\n", svc->vcpu->domain->domain_id, svc->vcpu->vcpu_id, + svc->rqd->id, trqd->id); + if ( __vcpu_on_runq(svc) ) + { + __runq_remove(svc); + update_load(ops, svc->rqd, svc, -1, now); + on_runq=1; + } + __runq_deassign(svc); + svc->vcpu->processor = first_cpu(trqd->active); + __runq_assign(svc, trqd); + if ( on_runq ) + { + update_load(ops, svc->rqd, svc, 1, now); + runq_insert(ops, svc->vcpu->processor, svc); + runq_tickle(ops, svc->vcpu->processor, svc, now); + } + } + } + + + /* + * Basic algorithm: Push, pull, or swap. + * - Find the runqueue with the furthest load distance + * - Find a pair that makes the difference the least (where one + * on either side may be empty). + */ + + /* Locking: + * - pcpu schedule lock should be already locked + */ + lrqd = RQD(ops, cpu); + + __update_runq_load(ops, lrqd, 0, now); + +retry: + if ( !spin_trylock(&prv->lock) ) + return; + + load_delta = 0; + + for_each_cpu_mask(i, prv->active_queues) + { + s_time_t delta; + + orqd = prv->rqd + i; + + if ( orqd == lrqd + || !spin_trylock(&orqd->lock) ) + continue; + + __update_runq_load(ops, orqd, 0, now); + + delta = lrqd->b_avgload - orqd->b_avgload; + if ( delta < 0 ) + delta = -delta; + + if ( delta > load_delta ) + { + load_delta = delta; + max_delta_rqi = i; + } + + spin_unlock(&orqd->lock); + } + + /* Minimize holding the big lock */ + spin_unlock(&prv->lock); + + if ( max_delta_rqi == -1 ) + goto out; + + /* Don''t bother with load differences less than 25%. */ + if ( load_delta < (1ULL<<(prv->load_window_shift - 2)) ) + goto out; + + /* Try to grab the other runqueue lock; if it''s been taken in the + * meantime, try the process over again. This can''t deadlock + * because if it doesn''t get any other rqd locks, it will simply + * give up and return. */ + orqd = prv->rqd + max_delta_rqi; + if ( !spin_trylock(&orqd->lock) ) + goto retry; + + /* Make sure the runqueue hasn''t been deactivated since we released prv->lock */ + if ( unlikely(orqd->id < 0) ) + goto out_up; + + /* Look for "swap" which gives the best load average + * FIXME: O(n^2)! */ + + /* Reuse load delta (as we''re trying to minimize it) */ + list_for_each( push_iter, &lrqd->svc ) + { + int inner_load_updated = 0; + struct csched_vcpu * push_svc = list_entry(push_iter, struct csched_vcpu, rqd_elem); + + __update_svc_load(ops, push_svc, 0, now); + + /* Skip this one if it''s already been flagged to migrate */ + if ( test_bit(__CSFLAG_runq_migrate_request, &push_svc->flags) ) + continue; + + list_for_each( pull_iter, &orqd->svc ) + { + struct csched_vcpu * pull_svc = list_entry(pull_iter, struct csched_vcpu, rqd_elem); + + if ( ! inner_load_updated ) + { + __update_svc_load(ops, pull_svc, 0, now); + } + + /* Skip this one if it''s already been flagged to migrate */ + if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) ) + continue; + + consider(push_svc, pull_svc); + } + + inner_load_updated = 1; + + /* Consider push only */ + consider(push_svc, NULL); + } + + list_for_each( pull_iter, &orqd->svc ) + { + struct csched_vcpu * pull_svc = list_entry(pull_iter, struct csched_vcpu, rqd_elem); + + /* Skip this one if it''s already been flagged to migrate */ + if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) ) + continue; + + /* Consider pull only */ + consider(NULL, pull_svc); + } + + /* OK, now we have some candidates; do the moving */ + if ( best_push_svc ) + migrate(best_push_svc, orqd); + if ( best_pull_svc ) + migrate(best_pull_svc, lrqd); + +out_up: + spin_unlock(&orqd->lock); + +out: + return; +} + static int csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc) { @@ -1437,7 +1647,10 @@ /* Check for the reset condition */ if ( snext->credit <= CSCHED_CREDIT_RESET ) + { reset_credit(ops, cpu, now); + balance_load(ops, cpu, now); + } /* Clear the idle mask if necessary */ if ( cpu_isset(cpu, rqd->idle) ) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 15 of 16] credit2: Different unbalance tolerance for underloaded and overloaded queues
Allow the "unbalance tolerance" -- the amount of difference between two runqueues that will be allowed before rebalancing -- to differ depending on how busy the runqueue is. If it''s less than 100%, default to a difference of 1.0; if it''s more than 100%, default to a tolerance of 0.125. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r dca9ad897502 -r 975588ecb94e xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:27:14 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:27:27 2010 +0000 @@ -193,6 +193,10 @@ int opt_load_window_shift=18; #define LOADAVG_WINDOW_SHIFT_MIN 4 integer_param("credit2_load_window_shift", opt_load_window_shift); +int opt_underload_balance_tolerance=0; +integer_param("credit2_balance_under", opt_underload_balance_tolerance); +int opt_overload_balance_tolerance=-3; +integer_param("credit2_balance_over", opt_overload_balance_tolerance); /* * Per-runqueue data @@ -1232,14 +1236,34 @@ /* Minimize holding the big lock */ spin_unlock(&prv->lock); - if ( max_delta_rqi == -1 ) goto out; - /* Don''t bother with load differences less than 25%. */ - if ( load_delta < (1ULL<<(prv->load_window_shift - 2)) ) - goto out; + { + s_time_t load_max; + int cpus_max; + + load_max = lrqd->b_avgload; + if ( orqd->b_avgload > load_max ) + load_max = orqd->b_avgload; + + cpus_max=cpus_weight(lrqd->active); + if ( cpus_weight(orqd->active) > cpus_max ) + cpus_max = cpus_weight(orqd->active); + + /* If we''re under 100% capacaty, only shift if load difference + * is > 1. otherwise, shift if under 12.5% */ + if ( load_max < (1ULL<<(prv->load_window_shift))*cpus_max ) + { + if ( load_delta < (1ULL<<(prv->load_window_shift+opt_underload_balance_tolerance) ) ) + goto out; + } + else + if ( load_delta < (1ULL<<(prv->load_window_shift+opt_overload_balance_tolerance)) ) + goto out; + } + /* Try to grab the other runqueue lock; if it''s been taken in the * meantime, try the process over again. This can''t deadlock * because if it doesn''t get any other rqd locks, it will simply @@ -1982,6 +2006,8 @@ " Use at your own risk.\n"); printk(" load_window_shift: %d\n", opt_load_window_shift); + printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); + printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2010-Dec-23 12:38 UTC
[Xen-devel] [PATCH 16 of 16] credit2: On debug keypress print load average as a fraction
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff -r 975588ecb94e -r feda9b7cd63a xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c Thu Dec 23 12:27:27 2010 +0000 +++ b/xen/common/sched_credit2.c Thu Dec 23 12:27:41 2010 +0000 @@ -1774,14 +1774,20 @@ CSCHED_DEFAULT_WEIGHT); for_each_cpu_mask(i, prv->active_queues) { + s_time_t fraction; + + fraction = prv->rqd[i].avgload * 100 / (1ULL<<prv->load_window_shift); + printk("Runqueue %d:\n" "\tncpus = %u\n" "\tmax_weight = %d\n" - "\tload = %d\n", + "\tinstload = %d\n" + "\taveload = %3ld\n", i, cpus_weight(prv->rqd[i].active), prv->rqd[i].max_weight, - prv->rqd[i].load); + prv->rqd[i].load, + fraction); } /* FIXME: Locking! */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel