Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 0 of 6 v2] xen: sched_credit: fix picking & tickling and also add some tracing
Hello everyone, This is v2 of my previously submitted series about fixing scheduling anomalies and introducing some tracing in the credit scheduler (with a couple of other side effects). All comments v1 got have been addressed and the series grew a couple of more patches as I found some other issues, still falling under the broad description given in the above paragraph. Details are given in the single changelogs but, trying to make review as easy as possible, here it comes a short overview. [1 of 6] xen: sched_credit: improve picking up the idlal CPU for a VCPU [2 of 6] xen: sched_credit: improve tickling of idle CPUs Are the fixes to the scheduling anomalies, happening during PCPU picking and tickling, respectively. The latter has already been extensively discussed (by me and George, mainly); the former is a new --small but nasty-- thing I discovered during a couple of heavy tracing sessions. :-) All the benchmarks have been rerun. No big changes in trends or anything, what held true for v1 still does here (although, honestly, numbers looks even a little bit better). [3 of 6] xen: sched_credit: use current_on_cpu() when appropriate Is just (an attempt) to improve code readability. [4 of 6] xen: tracing: report where a VCPU wakes up Is just (an attempt) to improve trace readability. [5 of 6] xen: tracing: introduce per-scheduler trace event IDs [6 of 6] xen: sched_credit: add some tracing Finally, are what enables per-scheduler trace record generation already discussed (again, mostly by me and George) and reworked as suggested and requested during review of v1 of this series. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
In _csched_cpu_pick() we try to select the best possible CPU for running a VCPU, considering the characteristics of the underlying hardware (i.e., how many threads, core, sockets, and how busy they are). What we want is "the idle execution vehicle with the most idling neighbours in its grouping". In order to achieve it, we select a CPU from the VCPU''s affinity, giving preference to its current processor if possible, as the basis for the comparison with all the other CPUs. Problem is, to discount the VCPU itself when computing this "idleness" (in an attempt to be fair wrt its current processor), we arbitrarily and unconditionally consider that selected CPU as idle, even when it is not the case, for instance: 1. If the CPU is not the one where the VCPU is running (perhaps due to the affinity being changed); 2. The CPU is where the VCPU is running, but it has other VCPUs in its runq, so it won''t go idle even if the VCPU in question goes. This is exemplified in the trace below: ] 3.466115364 x|------|------| d10v1 22005(2:2:5) 3 [ a 1 8 ] ... ... ... 3.466122856 x|------|------| d10v1 runstate_change d10v1 running->offline 3.466123046 x|------|------| d?v? runstate_change d32767v0 runnable->running ... ... ... ] 3.466126887 x|------|------| d32767v0 28004(2:8:4) 3 [ a 1 8 ] 22005(...) line (the first line) means _csched_cpu_pick() was called on VCPU 1 of domain 10, while it is running on CPU 0, and it choose CPU 8, which is busy (''|''), even if there are plenty of idle CPUs. That is because, as a consequence of changing the VCPU affinity, CPU 8 was chosen as the basis for the comparison, and therefore considered idle (its bit gets unconditionally set in the bitmask representing the idle CPUs). 28004(...) line means the VCPU is woken up and queued on CPU 8''s runq, where it waits for a context switch or a migration, in order to be able to execute. This change fixes things by only considering the "guessed" CPU idle if the VCPU in question is both running there and is its only runnable VCPU. While at it, change the name of the two variables (within _csched_cpu_pick() ) counting the numbers of idlers for `cpu'' and `nxt'' in `nr_idlers_cpu'' and `nr_idlers_nxt'', which makes their job a little more evident than now that they''re just called `weight_*''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -59,6 +59,8 @@ #define CSCHED_VCPU(_vcpu) ((struct csched_vcpu *) (_vcpu)->sched_priv) #define CSCHED_DOM(_dom) ((struct csched_dom *) (_dom)->sched_priv) #define RUNQ(_cpu) (&(CSCHED_PCPU(_cpu)->runq)) +/* Is the first element of _cpu''s runq its idle vcpu? */ +#define IS_RUNQ_IDLE(_cpu) (is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu)) /* @@ -479,9 +481,14 @@ static int * distinct cores first and guarantees we don''t do something stupid * like run two VCPUs on co-hyperthreads while there are idle cores * or sockets. + * + * Notice that, when computing the "idleness" of cpu, we may want to + * discount vc. That is, iff vc is the currently running and the only + * runnable vcpu on cpu, we add cpu to the idlers. */ cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); - cpumask_set_cpu(cpu, &idlers); + if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) ) + cpumask_set_cpu(cpu, &idlers); cpumask_and(&cpus, &cpus, &idlers); cpumask_clear_cpu(cpu, &cpus); @@ -489,7 +496,7 @@ static int { cpumask_t cpu_idlers; cpumask_t nxt_idlers; - int nxt, weight_cpu, weight_nxt; + int nxt, nr_idlers_cpu, nr_idlers_nxt; int migrate_factor; nxt = cpumask_cycle(cpu, &cpus); @@ -513,12 +520,12 @@ static int cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); } - weight_cpu = cpumask_weight(&cpu_idlers); - weight_nxt = cpumask_weight(&nxt_idlers); + nr_idlers_cpu = cpumask_weight(&cpu_idlers); + nr_idlers_nxt = cpumask_weight(&nxt_idlers); /* smt_power_savings: consolidate work rather than spreading it */ if ( sched_smt_power_savings ? - weight_cpu > weight_nxt : - weight_cpu * migrate_factor < weight_nxt ) + nr_idlers_cpu > nr_idlers_nxt : + nr_idlers_cpu * migrate_factor < nr_idlers_nxt ) { cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); spc = CSCHED_PCPU(nxt); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS]; #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE) #define is_idle_vcpu(v) (is_idle_domain((v)->domain)) +#define current_on_cpu(_c) \ + ( (per_cpu(schedule_data, _c).curr) ) + #define DOMAIN_DESTROYED (1<<31) /* assumes atomic_t is >= 32 bits */ #define put_domain(_d) \ if ( atomic_dec_and_test(&(_d)->refcnt) ) domain_destroy(_d)
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs
Right now, when a VCPU wakes-up, we check whether it should preempt what is running on the PCPU, and whether or not the waking VCPU can be migrated (by tickling some idlers). However, this can result in suboptimal or even wrong behaviour, as explained here: http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html This change, instead, when deciding which PCPU(s) to tickle, upon VCPU wake-up, considers both what it is likely to happen on the PCPU where the wakeup occurs,and whether or not there are idlers where the woken-up VCPU can run. In fact, if there are, we can avoid interrupting the running VCPU. Only in case there aren''t any of these PCPUs, preemption and migration are the way to go. This has been tested (on top of the previous change) by running the following benchmarks inside 2, 6 and 10 VMs, concurrently, on a shared host, each with 2 VCPUs and 960 MB of memory (host had 16 ways and 12 GB RAM). 1) All VMs had ''cpus="all"'' in their config file. $ sysbench --test=cpu ... (time, lower is better) | VMs | w/o this change | w/ this change | | 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 | | 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 | | 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 | $ sysbench --test=memory ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 | | 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 | | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 | $ specjbb2005 ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 | | 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 | | 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 | 2) All VMs had their VCPUs statically pinned to the host''s PCPUs. $ sysbench --test=cpu ... (time, lower is better) | VMs | w/o this change | w/ this change | | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 | | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 | | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 | $ sysbench --test=memory ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 | | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 | | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 | $ specjbb2005 ... (throughput, higher is better) | 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 | | 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 | | 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 | Numbers show how the change has either no or very limited impact (specjbb2005 case) or, when it does have some impact, that is a real improvement in performances (sysbench-memory case). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * Rewritten as per George''s suggestion, in order to improve readability. * Killed some of the stats, with the only exception of `tickle_idlers_none'' and `tickle_idlers_some''. They don''t make things look that terrible and I think they could be useful. * The preemption+migration of the currently running VCPU has been turned into a migration request, instead than just tickling. I traced this thing some more, and it looks like that is the way to go. Tickling is not effective here, because the woken PCPU would expect cur to be out of the scheduler tail, which is likely false (cur->is_running is still set to 1). diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -133,6 +133,7 @@ struct csched_vcpu { uint32_t state_idle; uint32_t migrate_q; uint32_t migrate_r; + uint32_t kicked_away; } stats; #endif }; @@ -251,54 +252,67 @@ static inline void struct csched_vcpu * const cur CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); - cpumask_t mask; + cpumask_t mask, idle_mask; + int idlers_empty; ASSERT(cur); cpumask_clear(&mask); - /* If strictly higher priority than current VCPU, signal the CPU */ - if ( new->pri > cur->pri ) + idlers_empty = cpumask_empty(prv->idlers); + + /* + * If the pcpu is idle, or there are no idlers and the new + * vcpu is a higher priority than the old vcpu, run it here. + * + * If there are idle cpus, first try to find one suitable to run + * new, so we can avoid preempting cur. If we cannot find a + * suitable idler on which to run new, run it here, but try to + * find a suitable idler on which to run cur instead. + */ + if ( cur->pri == CSCHED_PRI_IDLE + || (idlers_empty && new->pri > cur->pri) ) { - if ( cur->pri == CSCHED_PRI_IDLE ) - SCHED_STAT_CRANK(tickle_local_idler); - else if ( cur->pri == CSCHED_PRI_TS_OVER ) - SCHED_STAT_CRANK(tickle_local_over); - else if ( cur->pri == CSCHED_PRI_TS_UNDER ) - SCHED_STAT_CRANK(tickle_local_under); - else - SCHED_STAT_CRANK(tickle_local_other); - + if ( cur->pri != CSCHED_PRI_IDLE ) + SCHED_STAT_CRANK(tickle_idlers_none); cpumask_set_cpu(cpu, &mask); } + else if ( !idlers_empty ) + { + /* Check whether or not there are idlers that can run new */ + cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); - /* - * If this CPU has at least two runnable VCPUs, we tickle any idlers to - * let them know there is runnable work in the system... - */ - if ( cur->pri > CSCHED_PRI_IDLE ) - { - if ( cpumask_empty(prv->idlers) ) + /* + * If there are no suitable idlers for new, and it''s higher + * priority than cur, ask the scheduler to migrate cur away. + * We have to act like this (instead of just waking some of + * the idlers suitable for cur) because cur is running. + * + * If there are suitable idlers for new, no matter priorities, + * leave cur alone (as it is running and is, likely, cache-hot) + * and wake some of them (which is waking up and so is, likely, + * cache cold anyway). + */ + if ( cpumask_empty(&idle_mask) && new->pri > cur->pri ) { SCHED_STAT_CRANK(tickle_idlers_none); + SCHED_VCPU_STAT_CRANK(cur, kicked_away); + SCHED_VCPU_STAT_CRANK(cur, migrate_r); + SCHED_STAT_CRANK(migrate_kicked_away); + set_bit(_VPF_migrating, &cur->vcpu->pause_flags); + cpumask_set_cpu(cpu, &mask); } - else + else if ( !cpumask_empty(&idle_mask) ) { - cpumask_t idle_mask; - - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); - if ( !cpumask_empty(&idle_mask) ) + /* Which of the idlers suitable for new shall we wake up? */ + SCHED_STAT_CRANK(tickle_idlers_some); + if ( opt_tickle_one_idle ) { - SCHED_STAT_CRANK(tickle_idlers_some); - if ( opt_tickle_one_idle ) - { - this_cpu(last_tickle_cpu) = - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); - } - else - cpumask_or(&mask, &mask, &idle_mask); + this_cpu(last_tickle_cpu) + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); } - cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); + else + cpumask_or(&mask, &mask, &idle_mask); } } @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc { printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight); #ifdef CSCHED_STATS - printk(" (%d+%u) {a/i=%u/%u m=%u+%u}", + printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}", svc->stats.credit_last, svc->stats.credit_incr, svc->stats.state_active, svc->stats.state_idle, svc->stats.migrate_q, - svc->stats.migrate_r); + svc->stats.migrate_r, + svc->stats.kicked_away); #endif } diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h --- a/xen/include/xen/perfc_defn.h +++ b/xen/include/xen/perfc_defn.h @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable, "csc PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable") PERFCOUNTER(vcpu_park, "csched: vcpu_park") PERFCOUNTER(vcpu_unpark, "csched: vcpu_unpark") -PERFCOUNTER(tickle_local_idler, "csched: tickle_local_idler") -PERFCOUNTER(tickle_local_over, "csched: tickle_local_over") -PERFCOUNTER(tickle_local_under, "csched: tickle_local_under") -PERFCOUNTER(tickle_local_other, "csched: tickle_local_other") PERFCOUNTER(tickle_idlers_none, "csched: tickle_idlers_none") PERFCOUNTER(tickle_idlers_some, "csched: tickle_idlers_some") PERFCOUNTER(load_balance_idle, "csched: load_balance_idle") @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed, "csc PERFCOUNTER(steal_peer_idle, "csched: steal_peer_idle") PERFCOUNTER(migrate_queued, "csched: migrate_queued") PERFCOUNTER(migrate_running, "csched: migrate_running") +PERFCOUNTER(migrate_kicked_away, "csched: migrate_kicked_away") PERFCOUNTER(vcpu_hot, "csched: vcpu_hot") PERFCOUNTER(need_flush_tlb_flush, "PG_need_flush tlb flushes")
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate
Defined by the previous change, using it wherever it is appropriate throughout sched_credit.c makes the code easier to read. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -231,7 +231,7 @@ static void burn_credits(struct csched_v unsigned int credits; /* Assert svc is current */ - ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr)); + ASSERT( svc == CSCHED_VCPU(current_on_cpu(svc->vcpu->processor)) ); if ( (delta = now - svc->start_time) <= 0 ) return; @@ -249,8 +249,7 @@ DEFINE_PER_CPU(unsigned int, last_tickle static inline void __runq_tickle(unsigned int cpu, struct csched_vcpu *new) { - struct csched_vcpu * const cur - CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); + struct csched_vcpu * const cur = CSCHED_VCPU(current_on_cpu(cpu)); struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); cpumask_t mask, idle_mask; int idlers_empty; @@ -387,7 +386,7 @@ csched_alloc_pdata(const struct schedule per_cpu(schedule_data, cpu).sched_priv = spc; /* Start off idling... */ - BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr)); + BUG_ON(!is_idle_vcpu(current_on_cpu(cpu))); cpumask_set_cpu(cpu, prv->idlers); spin_unlock_irqrestore(&prv->lock, flags); @@ -730,7 +729,7 @@ csched_vcpu_sleep(const struct scheduler BUG_ON( is_idle_vcpu(vc) ); - if ( per_cpu(schedule_data, vc->processor).curr == vc ) + if ( current_on_cpu(vc->processor) == vc ) cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ); else if ( __vcpu_on_runq(svc) ) __runq_remove(svc); @@ -744,7 +743,7 @@ csched_vcpu_wake(const struct scheduler BUG_ON( is_idle_vcpu(vc) ); - if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) ) + if ( unlikely(current_on_cpu(cpu) == vc) ) { SCHED_STAT_CRANK(vcpu_wake_running); return; @@ -1213,7 +1212,7 @@ static struct csched_vcpu * csched_runq_steal(int peer_cpu, int cpu, int pri) { const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); - const struct vcpu * const peer_vcpu = per_cpu(schedule_data, peer_cpu).curr; + const struct vcpu * const peer_vcpu = current_on_cpu(peer_cpu); struct csched_vcpu *speer; struct list_head *iter; struct vcpu *vc; @@ -1502,7 +1501,7 @@ csched_dump_pcpu(const struct scheduler printk("core=%s\n", cpustr); /* current VCPU */ - svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); + svc = CSCHED_VCPU(current_on_cpu(cpu)); if ( svc ) { printk("\trun: ");
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up
When looking at traces, it turns out to be useful to know where a waking-up VCPU is being queued. Yes, that is always the CPU where it ran last, but that information can well be lost in past trace records! Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/xen/common/schedule.c b/xen/common/schedule.c --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -371,7 +371,7 @@ void vcpu_wake(struct vcpu *v) vcpu_schedule_unlock_irqrestore(v, flags); - TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id); + TRACE_3D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id, v->processor); } void vcpu_unblock(struct vcpu *v)
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs
So that it becomes possible to create scheduler specific trace records, within each scheduler, without worrying about the overlapping, and also without giving up being able to recognise them univocally. The latter is deemed as useful, since we can have more than one scheduler running at the same time, thanks to cpupools. The event ID is 12 bits, and this change uses the upper 3 of them for the ''scheduler ID''. This means we''re limited to 8 schedulers and to 512 scheduler specific tracing events. Both seem reasonable limitations as of now. This also converts the existing credit2 tracing (the only scheduler generating tracing events up to now) to the new system. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * The event ID generaion macro is now called `TRC_SCHED_CLASS_EVT()'', and has been generalized and put in trace.h, as suggested. * The handling of per-scheduler tracing IDs and masks have been restructured, properly naming "ID" the numerical identifiers and "MASK" the bitmasks, as requested. diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -29,18 +29,22 @@ #define d2printk(x...) //#define d2printk printk -#define TRC_CSCHED2_TICK TRC_SCHED_CLASS + 1 -#define TRC_CSCHED2_RUNQ_POS TRC_SCHED_CLASS + 2 -#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3 -#define TRC_CSCHED2_CREDIT_ADD TRC_SCHED_CLASS + 4 -#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5 -#define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS + 6 -#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7 -#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8 -#define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS + 9 -#define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS + 10 -#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS + 11 -#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS + 12 +/* + * Credit2 tracing events ("only" 512 available!). Check + * include/public/trace.h for more details. + */ +#define TRC_CSCHED2_TICK TRC_SCHED_CLASS_EVT(CSCHED2, 1) +#define TRC_CSCHED2_RUNQ_POS TRC_SCHED_CLASS_EVT(CSCHED2, 2) +#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS_EVT(CSCHED2, 3) +#define TRC_CSCHED2_CREDIT_ADD TRC_SCHED_CLASS_EVT(CSCHED2, 4) +#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS_EVT(CSCHED2, 5) +#define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS_EVT(CSCHED2, 6) +#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS_EVT(CSCHED2, 7) +#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED2, 8) +#define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 9) +#define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS_EVT(CSCHED2, 10) +#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11) +#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12) /* * WARNING: This is still in an experimental phase. Status and work can be found at the diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h --- a/xen/include/public/trace.h +++ b/xen/include/public/trace.h @@ -57,6 +57,32 @@ #define TRC_SCHED_CLASS 0x00022000 /* Scheduler-specific */ #define TRC_SCHED_VERBOSE 0x00028000 /* More inclusive scheduling */ +/* + * The highest 3 bits of the last 12 bits of TRC_SCHED_CLASS above are + * reserved for encoding what scheduler produced the information. The + * actual event is encoded in the last 9 bits. + * + * This means we have 8 scheduling IDs available (which means at most 8 + * schedulers generating events) and, in each scheduler, up to 512 + * different events. + */ +#define TRC_SCHED_ID_BITS 3 +#define TRC_SCHED_ID_SHIFT (TRC_SUBCLS_SHIFT - TRC_SCHED_ID_BITS) +#define TRC_SCHED_ID_MASK (((1UL<<TRC_SCHED_ID_BITS) - 1) << TRC_SCHED_ID_SHIFT) +#define TRC_SCHED_EVT_MASK (~(TRC_SCHED_ID_MASK)) + +/* Per-scheduler IDs, to identify scheduler specific events */ +#define TRC_SCHED_CSCHED 0 +#define TRC_SCHED_CSCHED2 1 +#define TRC_SCHED_SEDF 2 +#define TRC_SCHED_ARINC653 3 + +/* Per-scheduler tracing */ +#define TRC_SCHED_CLASS_EVT(_c, _e) \ + ( ( TRC_SCHED_CLASS | \ + ((TRC_SCHED_##_c << TRC_SCHED_ID_SHIFT) & TRC_SCHED_ID_MASK) ) + \ + (_e & TRC_SCHED_EVT_MASK) ) + /* Trace classes for Hardware */ #define TRC_HW_PM 0x00801000 /* Power management traces */ #define TRC_HW_IRQ 0x00802000 /* Traces relating to the handling of IRQs */
Dario Faggioli
2012-Dec-12 02:52 UTC
[PATCH 6 of 6 v2] xen: sched_credit: add some tracing
About tickling, and PCPU selection. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * Dummy `struct d {}'', accommodating `cpu'' only, removed in spite of the much more readable `trace_var(..., sizeof(cpu), &cpu)'', as suggested. diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -21,6 +21,7 @@ #include <asm/atomic.h> #include <xen/errno.h> #include <xen/keyhandler.h> +#include <xen/trace.h> /* @@ -97,6 +98,18 @@ /* + * Credit tracing events ("only" 512 available!). Check + * include/public/trace.h for more details. + */ +#define TRC_CSCHED_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED, 1) +#define TRC_CSCHED_ACCOUNT_START TRC_SCHED_CLASS_EVT(CSCHED, 2) +#define TRC_CSCHED_ACCOUNT_STOP TRC_SCHED_CLASS_EVT(CSCHED, 3) +#define TRC_CSCHED_STOLEN_VCPU TRC_SCHED_CLASS_EVT(CSCHED, 4) +#define TRC_CSCHED_PICKED_CPU TRC_SCHED_CLASS_EVT(CSCHED, 5) +#define TRC_CSCHED_TICKLE TRC_SCHED_CLASS_EVT(CSCHED, 6) + + +/* * Boot parameters */ static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS; @@ -315,9 +328,18 @@ static inline void } } - /* Send scheduler interrupts to designated CPUs */ if ( !cpumask_empty(&mask) ) + { + if ( unlikely(tb_init_done) ) + { + /* Avoid TRACE_*: saves checking !tb_init_done each step */ + for_each_cpu(cpu, &mask) + trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu), &cpu); + } + + /* Send scheduler interrupts to designated CPUs */ cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ); + } } static void @@ -554,6 +576,8 @@ static int if ( commit && spc ) spc->idle_bias = cpu; + TRACE_3D(TRC_CSCHED_PICKED_CPU, vc->domain->domain_id, vc->vcpu_id, cpu); + return cpu; } @@ -586,6 +610,9 @@ static inline void } } + TRACE_3D(TRC_CSCHED_ACCOUNT_START, sdom->dom->domain_id, + svc->vcpu->vcpu_id, sdom->active_vcpu_count); + spin_unlock_irqrestore(&prv->lock, flags); } @@ -608,6 +635,9 @@ static inline void { list_del_init(&sdom->active_sdom_elem); } + + TRACE_3D(TRC_CSCHED_ACCOUNT_STOP, sdom->dom->domain_id, + svc->vcpu->vcpu_id, sdom->active_vcpu_count); } static void @@ -1241,6 +1271,8 @@ csched_runq_steal(int peer_cpu, int cpu, if (__csched_vcpu_is_migrateable(vc, cpu)) { /* We got a candidate. Grab it! */ + TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, + vc->domain->domain_id, vc->vcpu_id); SCHED_VCPU_STAT_CRANK(speer, migrate_q); SCHED_STAT_CRANK(migrate_queued); WARN_ON(vc->is_urgent); @@ -1401,6 +1433,7 @@ csched_schedule( /* Tasklet work (which runs in idle VCPU context) overrides all else. */ if ( tasklet_work_scheduled ) { + TRACE_0D(TRC_CSCHED_SCHED_TASKLET); snext = CSCHED_VCPU(idle_vcpu[cpu]); snext->pri = CSCHED_PRI_TS_BOOST; }
Jan Beulich
2012-Dec-12 10:04 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
>>> On 12.12.12 at 03:52, Dario Faggioli <dario.faggioli@citrix.com> wrote: > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -59,6 +59,8 @@ > #define CSCHED_VCPU(_vcpu) ((struct csched_vcpu *) (_vcpu)->sched_priv) > #define CSCHED_DOM(_dom) ((struct csched_dom *) (_dom)->sched_priv) > #define RUNQ(_cpu) (&(CSCHED_PCPU(_cpu)->runq)) > +/* Is the first element of _cpu''s runq its idle vcpu? */ > +#define IS_RUNQ_IDLE(_cpu) (is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu)) > > > /* > @@ -479,9 +481,14 @@ static int > * distinct cores first and guarantees we don''t do something stupid > * like run two VCPUs on co-hyperthreads while there are idle cores > * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > */ > cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > - cpumask_set_cpu(cpu, &idlers); > + if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu, &idlers); > cpumask_and(&cpus, &cpus, &idlers); > cpumask_clear_cpu(cpu, &cpus); > > @@ -489,7 +496,7 @@ static int > { > cpumask_t cpu_idlers; > cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > + int nxt, nr_idlers_cpu, nr_idlers_nxt; > int migrate_factor; > > nxt = cpumask_cycle(cpu, &cpus); > @@ -513,12 +520,12 @@ static int > cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); > } > > - weight_cpu = cpumask_weight(&cpu_idlers); > - weight_nxt = cpumask_weight(&nxt_idlers); > + nr_idlers_cpu = cpumask_weight(&cpu_idlers); > + nr_idlers_nxt = cpumask_weight(&nxt_idlers); > /* smt_power_savings: consolidate work rather than spreading it */ > if ( sched_smt_power_savings ? > - weight_cpu > weight_nxt : > - weight_cpu * migrate_factor < weight_nxt ) > + nr_idlers_cpu > nr_idlers_nxt : > + nr_idlers_cpu * migrate_factor < nr_idlers_nxt ) > { > cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > spc = CSCHED_PCPU(nxt);Despite you mentioning this in the description, these last two hunks are, afaict, only renaming variables (and that''s even debatable, as the current names aren''t really misleading imo), and hence I don''t think belong in a patch that clearly has the potential for causing (performance) regressions. That said - I don''t think it will (and even more, I''m agreeable to the change done).> --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS]; > #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE) > #define is_idle_vcpu(v) (is_idle_domain((v)->domain)) > > +#define current_on_cpu(_c) \ > + ( (per_cpu(schedule_data, _c).curr) ) > +This, imo, really belings into sched-if.h. Plus - what''s the point of double parentheses, when in fact none at all would be needed? And finally, why "_c" and not just "c"? Jan> #define DOMAIN_DESTROYED (1<<31) /* assumes atomic_t is >= 32 bits */ > #define put_domain(_d) \ > if ( atomic_dec_and_test(&(_d)->refcnt) ) domain_destroy(_d)
Dario Faggioli
2012-Dec-12 10:19 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
On Wed, 2012-12-12 at 10:04 +0000, Jan Beulich wrote:> > - weight_cpu = cpumask_weight(&cpu_idlers); > > - weight_nxt = cpumask_weight(&nxt_idlers); > > + nr_idlers_cpu = cpumask_weight(&cpu_idlers); > > + nr_idlers_nxt = cpumask_weight(&nxt_idlers); > > /* smt_power_savings: consolidate work rather than spreading it */ > > if ( sched_smt_power_savings ? > > - weight_cpu > weight_nxt : > > - weight_cpu * migrate_factor < weight_nxt ) > > + nr_idlers_cpu > nr_idlers_nxt : > > + nr_idlers_cpu * migrate_factor < nr_idlers_nxt ) > > { > > cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > > spc = CSCHED_PCPU(nxt); > > Despite you mentioning this in the description, these last two hunks > are, afaict, only renaming variables (and that''s even debatable, as > the current names aren''t really misleading imo), and hence I don''t > think belong in a patch that clearly has the potential for causing > (performance) regressions. >Ok, I think I can live with the current names too... Just a matter of taste. :-)> That said - I don''t think it will (and even more, I''m agreeable to the > change done). >It has been benchmarked, together with the next change, and the results are in the changelog of 2/6. Numbers there show that the combination of those two changes are much more an improvement than anything else, at least for the workloads I considered (which includes sysbench and specjbb2005). Anyway, I think I see your point, and I can either move the remane somewhere else or kill it entirely.> > --- a/xen/include/xen/sched.h > > +++ b/xen/include/xen/sched.h > > @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS]; > > #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE) > > #define is_idle_vcpu(v) (is_idle_domain((v)->domain)) > > > > +#define current_on_cpu(_c) \ > > + ( (per_cpu(schedule_data, _c).curr) ) > > + > > This, imo, really belings into sched-if.h. >Ok.> Plus - what''s the point of double parentheses, when in fact none > at all would be needed? > > And finally, why "_c" and not just "c"? >Nothing particular, just "personal macro style", I guess, which I can convert to what you ask and resend. Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Dec-12 10:30 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
>>> On 12.12.12 at 11:19, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On Wed, 2012-12-12 at 10:04 +0000, Jan Beulich wrote: >> Despite you mentioning this in the description, these last two hunks >> are, afaict, only renaming variables (and that''s even debatable, as >> the current names aren''t really misleading imo), and hence I don''t >> think belong in a patch that clearly has the potential for causing >> (performance) regressions. >> > Ok, I think I can live with the current names too... Just a matter of > taste. :-) > >> That said - I don''t think it will (and even more, I''m agreeable to the >> change done). >> > It has been benchmarked, together with the next change, and the results > are in the changelog of 2/6. Numbers there show that the combination of > those two changes are much more an improvement than anything else, at > least for the workloads I considered (which includes sysbench and > specjbb2005). > > Anyway, I think I see your point, and I can either move the remane > somewhere else or kill it entirely.Yes please; I''ll leave it to George to decide upon an eventual separate renaming patch. Btw., when you resend, can you please also fix the subject, so grepping the changeset titles for "idle" would actually hit on this change? Thanks, Jan
Dario Faggioli
2012-Dec-12 10:38 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
On Wed, 2012-12-12 at 10:30 +0000, Jan Beulich wrote:> > Anyway, I think I see your point, and I can either move the remane > > somewhere else or kill it entirely. > > Yes please; I''ll leave it to George to decide upon an eventual > separate renaming patch. >Ok.> Btw., when you resend, can you please also fix the subject, so > grepping the changeset titles for "idle" would actually hit on this > change? >Ups! My bad, sorry for that. I sure will. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Dec-14 19:16 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
On 12/12/12 02:52, Dario Faggioli wrote:> In _csched_cpu_pick() we try to select the best possible CPU for > running a VCPU, considering the characteristics of the underlying > hardware (i.e., how many threads, core, sockets, and how busy they > are). What we want is "the idle execution vehicle with the most > idling neighbours in its grouping". > > In order to achieve it, we select a CPU from the VCPU''s affinity, > giving preference to its current processor if possible, as the basis > for the comparison with all the other CPUs. Problem is, to discount > the VCPU itself when computing this "idleness" (in an attempt to be > fair wrt its current processor), we arbitrarily and unconditionally > consider that selected CPU as idle, even when it is not the case, > for instance: > 1. If the CPU is not the one where the VCPU is running (perhaps due > to the affinity being changed); > 2. The CPU is where the VCPU is running, but it has other VCPUs in > its runq, so it won''t go idle even if the VCPU in question goes.Good catch -- thanks. Comments below.> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -59,6 +59,8 @@ > #define CSCHED_VCPU(_vcpu) ((struct csched_vcpu *) (_vcpu)->sched_priv) > #define CSCHED_DOM(_dom) ((struct csched_dom *) (_dom)->sched_priv) > #define RUNQ(_cpu) (&(CSCHED_PCPU(_cpu)->runq)) > +/* Is the first element of _cpu''s runq its idle vcpu? */ > +#define IS_RUNQ_IDLE(_cpu) (is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu)) > > > /* > @@ -479,9 +481,14 @@ static int > * distinct cores first and guarantees we don''t do something stupid > * like run two VCPUs on co-hyperthreads while there are idle cores > * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > */ > cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > - cpumask_set_cpu(cpu, &idlers); > + if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu, &idlers);Why bother with this whole "current_on_cpu()" thing, when you can just look at vc->processor? I.e.: if ( cpu == vc->processor && IS_RUNQ_IDLE(cpu) )> cpumask_and(&cpus, &cpus, &idlers); > cpumask_clear_cpu(cpu, &cpus); > > @@ -489,7 +496,7 @@ static int > { > cpumask_t cpu_idlers; > cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > + int nxt, nr_idlers_cpu, nr_idlers_nxt;I think Jan is right, this probably should be a separate patch. -George
George Dunlap
2012-Dec-14 19:29 UTC
Re: [PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs
On 12/12/12 02:52, Dario Faggioli wrote:> Right now, when a VCPU wakes-up, we check whether it should preempt > what is running on the PCPU, and whether or not the waking VCPU can > be migrated (by tickling some idlers). However, this can result in > suboptimal or even wrong behaviour, as explained here: > > http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html > > This change, instead, when deciding which PCPU(s) to tickle, upon > VCPU wake-up, considers both what it is likely to happen on the PCPU > where the wakeup occurs,and whether or not there are idlers where > the woken-up VCPU can run. In fact, if there are, we can avoid > interrupting the running VCPU. Only in case there aren''t any of > these PCPUs, preemption and migration are the way to go. > > This has been tested (on top of the previous change) by running > the following benchmarks inside 2, 6 and 10 VMs, concurrently, on > a shared host, each with 2 VCPUs and 960 MB of memory (host had 16 > ways and 12 GB RAM). > > 1) All VMs had ''cpus="all"'' in their config file. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 | > | 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 | > | 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 | > | 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 | > | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 | > $ specjbb2005 ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 | > | 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 | > | 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 | > > > 2) All VMs had their VCPUs statically pinned to the host''s PCPUs. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 | > | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 | > | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 | > | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 | > | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 | > $ specjbb2005 ... (throughput, higher is better) > | 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 | > | 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 | > | 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 | > > > Numbers show how the change has either no or very limited impact > (specjbb2005 case) or, when it does have some impact, that is a > real improvement in performances (sysbench-memory case). > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > Changes from v1: > * Rewritten as per George''s suggestion, in order to improve readability. > * Killed some of the stats, with the only exception of `tickle_idlers_none'' > and `tickle_idlers_some''. They don''t make things look that terrible and > I think they could be useful. > * The preemption+migration of the currently running VCPU has been turned > into a migration request, instead than just tickling. I traced this > thing some more, and it looks like that is the way to go. Tickling is > not effective here, because the woken PCPU would expect cur to be out > of the scheduler tail, which is likely false (cur->is_running is > still set to 1).Ah, right. :-) Acked-by: George Dunlap <george.dunlap@eu.citrix.com>> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -133,6 +133,7 @@ struct csched_vcpu { > uint32_t state_idle; > uint32_t migrate_q; > uint32_t migrate_r; > + uint32_t kicked_away; > } stats; > #endif > }; > @@ -251,54 +252,67 @@ static inline void > struct csched_vcpu * const cur > CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > - cpumask_t mask; > + cpumask_t mask, idle_mask; > + int idlers_empty; > > ASSERT(cur); > cpumask_clear(&mask); > > - /* If strictly higher priority than current VCPU, signal the CPU */ > - if ( new->pri > cur->pri ) > + idlers_empty = cpumask_empty(prv->idlers); > + > + /* > + * If the pcpu is idle, or there are no idlers and the new > + * vcpu is a higher priority than the old vcpu, run it here. > + * > + * If there are idle cpus, first try to find one suitable to run > + * new, so we can avoid preempting cur. If we cannot find a > + * suitable idler on which to run new, run it here, but try to > + * find a suitable idler on which to run cur instead. > + */ > + if ( cur->pri == CSCHED_PRI_IDLE > + || (idlers_empty && new->pri > cur->pri) ) > { > - if ( cur->pri == CSCHED_PRI_IDLE ) > - SCHED_STAT_CRANK(tickle_local_idler); > - else if ( cur->pri == CSCHED_PRI_TS_OVER ) > - SCHED_STAT_CRANK(tickle_local_over); > - else if ( cur->pri == CSCHED_PRI_TS_UNDER ) > - SCHED_STAT_CRANK(tickle_local_under); > - else > - SCHED_STAT_CRANK(tickle_local_other); > - > + if ( cur->pri != CSCHED_PRI_IDLE ) > + SCHED_STAT_CRANK(tickle_idlers_none); > cpumask_set_cpu(cpu, &mask); > } > + else if ( !idlers_empty ) > + { > + /* Check whether or not there are idlers that can run new */ > + cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > > - /* > - * If this CPU has at least two runnable VCPUs, we tickle any idlers to > - * let them know there is runnable work in the system... > - */ > - if ( cur->pri > CSCHED_PRI_IDLE ) > - { > - if ( cpumask_empty(prv->idlers) ) > + /* > + * If there are no suitable idlers for new, and it''s higher > + * priority than cur, ask the scheduler to migrate cur away. > + * We have to act like this (instead of just waking some of > + * the idlers suitable for cur) because cur is running. > + * > + * If there are suitable idlers for new, no matter priorities, > + * leave cur alone (as it is running and is, likely, cache-hot) > + * and wake some of them (which is waking up and so is, likely, > + * cache cold anyway). > + */ > + if ( cpumask_empty(&idle_mask) && new->pri > cur->pri ) > { > SCHED_STAT_CRANK(tickle_idlers_none); > + SCHED_VCPU_STAT_CRANK(cur, kicked_away); > + SCHED_VCPU_STAT_CRANK(cur, migrate_r); > + SCHED_STAT_CRANK(migrate_kicked_away); > + set_bit(_VPF_migrating, &cur->vcpu->pause_flags); > + cpumask_set_cpu(cpu, &mask); > } > - else > + else if ( !cpumask_empty(&idle_mask) ) > { > - cpumask_t idle_mask; > - > - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > - if ( !cpumask_empty(&idle_mask) ) > + /* Which of the idlers suitable for new shall we wake up? */ > + SCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > { > - SCHED_STAT_CRANK(tickle_idlers_some); > - if ( opt_tickle_one_idle ) > - { > - this_cpu(last_tickle_cpu) > - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > - } > - else > - cpumask_or(&mask, &mask, &idle_mask); > + this_cpu(last_tickle_cpu) > + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > } > - cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); > + else > + cpumask_or(&mask, &mask, &idle_mask); > } > } > > @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc > { > printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight); > #ifdef CSCHED_STATS > - printk(" (%d+%u) {a/i=%u/%u m=%u+%u}", > + printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}", > svc->stats.credit_last, > svc->stats.credit_incr, > svc->stats.state_active, > svc->stats.state_idle, > svc->stats.migrate_q, > - svc->stats.migrate_r); > + svc->stats.migrate_r, > + svc->stats.kicked_away); > #endif > } > > diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h > --- a/xen/include/xen/perfc_defn.h > +++ b/xen/include/xen/perfc_defn.h > @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable, "csc > PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable") > PERFCOUNTER(vcpu_park, "csched: vcpu_park") > PERFCOUNTER(vcpu_unpark, "csched: vcpu_unpark") > -PERFCOUNTER(tickle_local_idler, "csched: tickle_local_idler") > -PERFCOUNTER(tickle_local_over, "csched: tickle_local_over") > -PERFCOUNTER(tickle_local_under, "csched: tickle_local_under") > -PERFCOUNTER(tickle_local_other, "csched: tickle_local_other") > PERFCOUNTER(tickle_idlers_none, "csched: tickle_idlers_none") > PERFCOUNTER(tickle_idlers_some, "csched: tickle_idlers_some") > PERFCOUNTER(load_balance_idle, "csched: load_balance_idle") > @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed, "csc > PERFCOUNTER(steal_peer_idle, "csched: steal_peer_idle") > PERFCOUNTER(migrate_queued, "csched: migrate_queued") > PERFCOUNTER(migrate_running, "csched: migrate_running") > +PERFCOUNTER(migrate_kicked_away, "csched: migrate_kicked_away") > PERFCOUNTER(vcpu_hot, "csched: vcpu_hot") > > PERFCOUNTER(need_flush_tlb_flush, "PG_need_flush tlb flushes")
George Dunlap
2012-Dec-14 19:39 UTC
Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate
On 12/12/12 02:52, Dario Faggioli wrote:> Defined by the previous change, using it wherever it is appropriate > throughout sched_credit.c makes the code easier to read.Hmm, I hadn''t read this patch when I commented about removing the macro from the first patch. :-) I personally think that using vc->processor is better in that patch anyway; but using this macro elsewhere is probably fine. I think from a taste point of view, I would have put this patch, with the new definition, as the first patch in the series, and the had the second patch just use it. I''ll go back and respond to Jan''s comment about the macro. -George> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -231,7 +231,7 @@ static void burn_credits(struct csched_v > unsigned int credits; > > /* Assert svc is current */ > - ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr)); > + ASSERT( svc == CSCHED_VCPU(current_on_cpu(svc->vcpu->processor)) ); > > if ( (delta = now - svc->start_time) <= 0 ) > return; > @@ -249,8 +249,7 @@ DEFINE_PER_CPU(unsigned int, last_tickle > static inline void > __runq_tickle(unsigned int cpu, struct csched_vcpu *new) > { > - struct csched_vcpu * const cur > - CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); > + struct csched_vcpu * const cur = CSCHED_VCPU(current_on_cpu(cpu)); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > cpumask_t mask, idle_mask; > int idlers_empty; > @@ -387,7 +386,7 @@ csched_alloc_pdata(const struct schedule > per_cpu(schedule_data, cpu).sched_priv = spc; > > /* Start off idling... */ > - BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr)); > + BUG_ON(!is_idle_vcpu(current_on_cpu(cpu))); > cpumask_set_cpu(cpu, prv->idlers); > > spin_unlock_irqrestore(&prv->lock, flags); > @@ -730,7 +729,7 @@ csched_vcpu_sleep(const struct scheduler > > BUG_ON( is_idle_vcpu(vc) ); > > - if ( per_cpu(schedule_data, vc->processor).curr == vc ) > + if ( current_on_cpu(vc->processor) == vc ) > cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ); > else if ( __vcpu_on_runq(svc) ) > __runq_remove(svc); > @@ -744,7 +743,7 @@ csched_vcpu_wake(const struct scheduler > > BUG_ON( is_idle_vcpu(vc) ); > > - if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) ) > + if ( unlikely(current_on_cpu(cpu) == vc) ) > { > SCHED_STAT_CRANK(vcpu_wake_running); > return; > @@ -1213,7 +1212,7 @@ static struct csched_vcpu * > csched_runq_steal(int peer_cpu, int cpu, int pri) > { > const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); > - const struct vcpu * const peer_vcpu = per_cpu(schedule_data, peer_cpu).curr; > + const struct vcpu * const peer_vcpu = current_on_cpu(peer_cpu); > struct csched_vcpu *speer; > struct list_head *iter; > struct vcpu *vc; > @@ -1502,7 +1501,7 @@ csched_dump_pcpu(const struct scheduler > printk("core=%s\n", cpustr); > > /* current VCPU */ > - svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); > + svc = CSCHED_VCPU(current_on_cpu(cpu)); > if ( svc ) > { > printk("\trun: ");
George Dunlap
2012-Dec-14 19:50 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
On 12/12/12 10:04, Jan Beulich wrote:>>>> On 12.12.12 at 03:52, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> --- a/xen/common/sched_credit.c >> +++ b/xen/common/sched_credit.c >> @@ -59,6 +59,8 @@ >> #define CSCHED_VCPU(_vcpu) ((struct csched_vcpu *) (_vcpu)->sched_priv) >> #define CSCHED_DOM(_dom) ((struct csched_dom *) (_dom)->sched_priv) >> #define RUNQ(_cpu) (&(CSCHED_PCPU(_cpu)->runq)) >> +/* Is the first element of _cpu''s runq its idle vcpu? */ >> +#define IS_RUNQ_IDLE(_cpu) (is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu)) >> >> >> /* >> @@ -479,9 +481,14 @@ static int >> * distinct cores first and guarantees we don''t do something stupid >> * like run two VCPUs on co-hyperthreads while there are idle cores >> * or sockets. >> + * >> + * Notice that, when computing the "idleness" of cpu, we may want to >> + * discount vc. That is, iff vc is the currently running and the only >> + * runnable vcpu on cpu, we add cpu to the idlers. >> */ >> cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >> - cpumask_set_cpu(cpu, &idlers); >> + if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) ) >> + cpumask_set_cpu(cpu, &idlers); >> cpumask_and(&cpus, &cpus, &idlers); >> cpumask_clear_cpu(cpu, &cpus); >> >> @@ -489,7 +496,7 @@ static int >> { >> cpumask_t cpu_idlers; >> cpumask_t nxt_idlers; >> - int nxt, weight_cpu, weight_nxt; >> + int nxt, nr_idlers_cpu, nr_idlers_nxt; >> int migrate_factor; >> >> nxt = cpumask_cycle(cpu, &cpus); >> @@ -513,12 +520,12 @@ static int >> cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); >> } >> >> - weight_cpu = cpumask_weight(&cpu_idlers); >> - weight_nxt = cpumask_weight(&nxt_idlers); >> + nr_idlers_cpu = cpumask_weight(&cpu_idlers); >> + nr_idlers_nxt = cpumask_weight(&nxt_idlers); >> /* smt_power_savings: consolidate work rather than spreading it */ >> if ( sched_smt_power_savings ? >> - weight_cpu > weight_nxt : >> - weight_cpu * migrate_factor < weight_nxt ) >> + nr_idlers_cpu > nr_idlers_nxt : >> + nr_idlers_cpu * migrate_factor < nr_idlers_nxt ) >> { >> cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); >> spc = CSCHED_PCPU(nxt); > Despite you mentioning this in the description, these last two hunks > are, afaict, only renaming variables (and that''s even debatable, as > the current names aren''t really misleading imo), and hence I don''t > think belong in a patch that clearly has the potential for causing > (performance) regressions. > > That said - I don''t think it will (and even more, I''m agreeable to the > change done). > >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS]; >> #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE) >> #define is_idle_vcpu(v) (is_idle_domain((v)->domain)) >> >> +#define current_on_cpu(_c) \ >> + ( (per_cpu(schedule_data, _c).curr) ) >> + > This, imo, really belings into sched-if.h.Hmm, it looks like there are a number of things that could live in either sched-if.h or sched.h; but I think this one probably most closely links with thins like vcpu_is_runnable() and cpu_is_haltable(), both of which are in sched.h; so sched.h is where I''d put it.> Plus - what''s the point of double parentheses, when in fact none > at all would be needed? > > And finally, why "_c" and not just "c"?I think the underscore is pretty standard in macros. There''s certainly no need for double parentheses though. -George
George Dunlap
2012-Dec-14 19:57 UTC
Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up
On 12/12/12 02:52, Dario Faggioli wrote:> When looking at traces, it turns out to be useful to know where a > waking-up VCPU is being queued. Yes, that is always the CPU where > it ran last, but that information can well be lost in past trace > records!When you say "lost in past trace records", do you primarily mean that the records themselves have been lost (due to the per-cpu trace buffers filling up), or do you mean that it may be way way back and you don''t want to go back and find it? If the latter, I think the best thing to do would be to just augment xenalyze to keep track of that information and print it when it sees the wake record. -George
George Dunlap
2012-Dec-14 20:00 UTC
Re: [PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs
On 12/12/12 02:52, Dario Faggioli wrote:> So that it becomes possible to create scheduler specific trace records, > within each scheduler, without worrying about the overlapping, and also > without giving up being able to recognise them univocally. The latter > is deemed as useful, since we can have more than one scheduler running > at the same time, thanks to cpupools. > > The event ID is 12 bits, and this change uses the upper 3 of them for > the ''scheduler ID''. This means we''re limited to 8 schedulers and to > 512 scheduler specific tracing events. Both seem reasonable limitations > as of now. > > This also converts the existing credit2 tracing (the only scheduler > generating tracing events up to now) to the new system. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: George Dunlap <george.dunlap@eu.citrix.com>> --- > Changes from v1: > * The event ID generaion macro is now called `TRC_SCHED_CLASS_EVT()'', > and has been generalized and put in trace.h, as suggested. > * The handling of per-scheduler tracing IDs and masks have been > restructured, properly naming "ID" the numerical identifiers > and "MASK" the bitmasks, as requested. > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -29,18 +29,22 @@ > #define d2printk(x...) > //#define d2printk printk > > -#define TRC_CSCHED2_TICK TRC_SCHED_CLASS + 1 > -#define TRC_CSCHED2_RUNQ_POS TRC_SCHED_CLASS + 2 > -#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3 > -#define TRC_CSCHED2_CREDIT_ADD TRC_SCHED_CLASS + 4 > -#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5 > -#define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS + 6 > -#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7 > -#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8 > -#define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS + 9 > -#define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS + 10 > -#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS + 11 > -#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS + 12 > +/* > + * Credit2 tracing events ("only" 512 available!). Check > + * include/public/trace.h for more details. > + */ > +#define TRC_CSCHED2_TICK TRC_SCHED_CLASS_EVT(CSCHED2, 1) > +#define TRC_CSCHED2_RUNQ_POS TRC_SCHED_CLASS_EVT(CSCHED2, 2) > +#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS_EVT(CSCHED2, 3) > +#define TRC_CSCHED2_CREDIT_ADD TRC_SCHED_CLASS_EVT(CSCHED2, 4) > +#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS_EVT(CSCHED2, 5) > +#define TRC_CSCHED2_TICKLE TRC_SCHED_CLASS_EVT(CSCHED2, 6) > +#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS_EVT(CSCHED2, 7) > +#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED2, 8) > +#define TRC_CSCHED2_UPDATE_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 9) > +#define TRC_CSCHED2_RUNQ_ASSIGN TRC_SCHED_CLASS_EVT(CSCHED2, 10) > +#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11) > +#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12) > > /* > * WARNING: This is still in an experimental phase. Status and work can be found at the > diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h > --- a/xen/include/public/trace.h > +++ b/xen/include/public/trace.h > @@ -57,6 +57,32 @@ > #define TRC_SCHED_CLASS 0x00022000 /* Scheduler-specific */ > #define TRC_SCHED_VERBOSE 0x00028000 /* More inclusive scheduling */ > > +/* > + * The highest 3 bits of the last 12 bits of TRC_SCHED_CLASS above are > + * reserved for encoding what scheduler produced the information. The > + * actual event is encoded in the last 9 bits. > + * > + * This means we have 8 scheduling IDs available (which means at most 8 > + * schedulers generating events) and, in each scheduler, up to 512 > + * different events. > + */ > +#define TRC_SCHED_ID_BITS 3 > +#define TRC_SCHED_ID_SHIFT (TRC_SUBCLS_SHIFT - TRC_SCHED_ID_BITS) > +#define TRC_SCHED_ID_MASK (((1UL<<TRC_SCHED_ID_BITS) - 1) << TRC_SCHED_ID_SHIFT) > +#define TRC_SCHED_EVT_MASK (~(TRC_SCHED_ID_MASK)) > + > +/* Per-scheduler IDs, to identify scheduler specific events */ > +#define TRC_SCHED_CSCHED 0 > +#define TRC_SCHED_CSCHED2 1 > +#define TRC_SCHED_SEDF 2 > +#define TRC_SCHED_ARINC653 3 > + > +/* Per-scheduler tracing */ > +#define TRC_SCHED_CLASS_EVT(_c, _e) \ > + ( ( TRC_SCHED_CLASS | \ > + ((TRC_SCHED_##_c << TRC_SCHED_ID_SHIFT) & TRC_SCHED_ID_MASK) ) + \ > + (_e & TRC_SCHED_EVT_MASK) ) > + > /* Trace classes for Hardware */ > #define TRC_HW_PM 0x00801000 /* Power management traces */ > #define TRC_HW_IRQ 0x00802000 /* Traces relating to the handling of IRQs */
George Dunlap
2012-Dec-14 20:05 UTC
Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing
On 12/12/12 02:52, Dario Faggioli wrote:> About tickling, and PCPU selection. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > Changes from v1: > * Dummy `struct d {}'', accommodating `cpu'' only, removed > in spite of the much more readable `trace_var(..., sizeof(cpu), &cpu)'', > as suggested. > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -21,6 +21,7 @@ > #include <asm/atomic.h> > #include <xen/errno.h> > #include <xen/keyhandler.h> > +#include <xen/trace.h> > > > /* > @@ -97,6 +98,18 @@ > > > /* > + * Credit tracing events ("only" 512 available!). Check > + * include/public/trace.h for more details. > + */ > +#define TRC_CSCHED_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED, 1) > +#define TRC_CSCHED_ACCOUNT_START TRC_SCHED_CLASS_EVT(CSCHED, 2) > +#define TRC_CSCHED_ACCOUNT_STOP TRC_SCHED_CLASS_EVT(CSCHED, 3) > +#define TRC_CSCHED_STOLEN_VCPU TRC_SCHED_CLASS_EVT(CSCHED, 4) > +#define TRC_CSCHED_PICKED_CPU TRC_SCHED_CLASS_EVT(CSCHED, 5) > +#define TRC_CSCHED_TICKLE TRC_SCHED_CLASS_EVT(CSCHED, 6) > + > + > +/* > * Boot parameters > */ > static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS; > @@ -315,9 +328,18 @@ static inline void > } > } > > - /* Send scheduler interrupts to designated CPUs */ > if ( !cpumask_empty(&mask) ) > + { > + if ( unlikely(tb_init_done) ) > + { > + /* Avoid TRACE_*: saves checking !tb_init_done each step */ > + for_each_cpu(cpu, &mask) > + trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu), &cpu); > + }Hmm, probably should have pointed this out before, but trace_var() is a static inline which checks tb_init_done -- you want __trace_var(). :-) The rest still looks good. -George> + > + /* Send scheduler interrupts to designated CPUs */ > cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ); > + } > } > > static void > @@ -554,6 +576,8 @@ static int > if ( commit && spc ) > spc->idle_bias = cpu; > > + TRACE_3D(TRC_CSCHED_PICKED_CPU, vc->domain->domain_id, vc->vcpu_id, cpu); > + > return cpu; > } > > @@ -586,6 +610,9 @@ static inline void > } > } > > + TRACE_3D(TRC_CSCHED_ACCOUNT_START, sdom->dom->domain_id, > + svc->vcpu->vcpu_id, sdom->active_vcpu_count); > + > spin_unlock_irqrestore(&prv->lock, flags); > } > > @@ -608,6 +635,9 @@ static inline void > { > list_del_init(&sdom->active_sdom_elem); > } > + > + TRACE_3D(TRC_CSCHED_ACCOUNT_STOP, sdom->dom->domain_id, > + svc->vcpu->vcpu_id, sdom->active_vcpu_count); > } > > static void > @@ -1241,6 +1271,8 @@ csched_runq_steal(int peer_cpu, int cpu, > if (__csched_vcpu_is_migrateable(vc, cpu)) > { > /* We got a candidate. Grab it! */ > + TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, > + vc->domain->domain_id, vc->vcpu_id); > SCHED_VCPU_STAT_CRANK(speer, migrate_q); > SCHED_STAT_CRANK(migrate_queued); > WARN_ON(vc->is_urgent); > @@ -1401,6 +1433,7 @@ csched_schedule( > /* Tasklet work (which runs in idle VCPU context) overrides all else. */ > if ( tasklet_work_scheduled ) > { > + TRACE_0D(TRC_CSCHED_SCHED_TASKLET); > snext = CSCHED_VCPU(idle_vcpu[cpu]); > snext->pri = CSCHED_PRI_TS_BOOST; > }
Jan Beulich
2012-Dec-17 08:35 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
>>> On 14.12.12 at 20:50, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 12/12/12 10:04, Jan Beulich wrote: >>>>> On 12.12.12 at 03:52, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> --- a/xen/common/sched_credit.c >>> +++ b/xen/common/sched_credit.c >>> @@ -59,6 +59,8 @@ >>> #define CSCHED_VCPU(_vcpu) ((struct csched_vcpu *) (_vcpu)->sched_priv) >>> #define CSCHED_DOM(_dom) ((struct csched_dom *) (_dom)->sched_priv) >>> #define RUNQ(_cpu) (&(CSCHED_PCPU(_cpu)->runq)) >>> +/* Is the first element of _cpu''s runq its idle vcpu? */ >>> +#define IS_RUNQ_IDLE(_cpu) (is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu)) >>> >>> >>> /* >>> @@ -479,9 +481,14 @@ static int >>> * distinct cores first and guarantees we don''t do something stupid >>> * like run two VCPUs on co-hyperthreads while there are idle cores >>> * or sockets. >>> + * >>> + * Notice that, when computing the "idleness" of cpu, we may want to >>> + * discount vc. That is, iff vc is the currently running and the only >>> + * runnable vcpu on cpu, we add cpu to the idlers. >>> */ >>> cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >>> - cpumask_set_cpu(cpu, &idlers); >>> + if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) ) >>> + cpumask_set_cpu(cpu, &idlers); >>> cpumask_and(&cpus, &cpus, &idlers); >>> cpumask_clear_cpu(cpu, &cpus); >>> >>> @@ -489,7 +496,7 @@ static int >>> { >>> cpumask_t cpu_idlers; >>> cpumask_t nxt_idlers; >>> - int nxt, weight_cpu, weight_nxt; >>> + int nxt, nr_idlers_cpu, nr_idlers_nxt; >>> int migrate_factor; >>> >>> nxt = cpumask_cycle(cpu, &cpus); >>> @@ -513,12 +520,12 @@ static int >>> cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); >>> } >>> >>> - weight_cpu = cpumask_weight(&cpu_idlers); >>> - weight_nxt = cpumask_weight(&nxt_idlers); >>> + nr_idlers_cpu = cpumask_weight(&cpu_idlers); >>> + nr_idlers_nxt = cpumask_weight(&nxt_idlers); >>> /* smt_power_savings: consolidate work rather than spreading it */ >>> if ( sched_smt_power_savings ? >>> - weight_cpu > weight_nxt : >>> - weight_cpu * migrate_factor < weight_nxt ) >>> + nr_idlers_cpu > nr_idlers_nxt : >>> + nr_idlers_cpu * migrate_factor < nr_idlers_nxt ) >>> { >>> cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); >>> spc = CSCHED_PCPU(nxt); >> Despite you mentioning this in the description, these last two hunks >> are, afaict, only renaming variables (and that''s even debatable, as >> the current names aren''t really misleading imo), and hence I don''t >> think belong in a patch that clearly has the potential for causing >> (performance) regressions. >> >> That said - I don''t think it will (and even more, I''m agreeable to the >> change done). >> >>> --- a/xen/include/xen/sched.h >>> +++ b/xen/include/xen/sched.h >>> @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS]; >>> #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE) >>> #define is_idle_vcpu(v) (is_idle_domain((v)->domain)) >>> >>> +#define current_on_cpu(_c) \ >>> + ( (per_cpu(schedule_data, _c).curr) ) >>> + >> This, imo, really belings into sched-if.h. > > Hmm, it looks like there are a number of things that could live in > either sched-if.h or sched.h; but I think this one probably most closely > links with thins like vcpu_is_runnable() and cpu_is_haltable(), both of > which are in sched.h; so sched.h is where I''d put it.Any use of schedule_data, the type of which is declared in sched-if.h, should be in sched-if.h - someone only including sched.h can''t make use of it anyway (and it''s intended to be used by scheduler code, i.e. shouldn''t be visible to other code).>> Plus - what''s the point of double parentheses, when in fact none >> at all would be needed? >> >> And finally, why "_c" and not just "c"? > > I think the underscore is pretty standard in macros.It''s bad practice imo; I have always understood this as questionable attempts of people to avoid name clashes (which is understandable only for variables declared locally inside a macro definition). Jan
Dario Faggioli
2012-Dec-17 14:36 UTC
Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU
On Mon, 2012-12-17 at 08:35 +0000, Jan Beulich wrote:> >>> On 14.12.12 at 20:50, George Dunlap <george.dunlap@eu.citrix.com> wrote: > > On 12/12/12 10:04, Jan Beulich wrote: > >> This, imo, really belings into sched-if.h. > > > > Hmm, it looks like there are a number of things that could live in > > either sched-if.h or sched.h; but I think this one probably most closely > > links with thins like vcpu_is_runnable() and cpu_is_haltable(), both of > > which are in sched.h; so sched.h is where I''d put it. > > Any use of schedule_data, the type of which is declared in > sched-if.h, should be in sched-if.h - someone only including > sched.h can''t make use of it anyway (and it''s intended to be > used by scheduler code, i.e. shouldn''t be visible to other > code). >Ok, this argument, I find quite convincing, I think I''m putting the macro in sched-if.h Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-17 14:41 UTC
Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate
On Fri, 2012-12-14 at 19:39 +0000, George Dunlap wrote:> Hmm, I hadn''t read this patch when I commented about removing the macro > from the first patch. :-) >:-)> I personally think that using vc->processor is better in that patch > anyway; but using this macro elsewhere is probably fine. >Ok.> I think from a taste point of view, I would have put this patch, with > the new definition, as the first patch in the series, and the had the > second patch just use it. >Ok then, when respinning I''ll have the first patch defining and using the macro. Then I''ll have the ''fix picking'' patch using vc->processor _instead_ the macro. Yes, that kills the need for the macro, but since I already have the code for it, and I think things with it does look a bit better, I won''t actually kill it. Let me know (here or replying to the corresponding e-mail in the reposting) if you don''t want it. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-17 14:43 UTC
Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up
On Fri, 2012-12-14 at 19:57 +0000, George Dunlap wrote:> On 12/12/12 02:52, Dario Faggioli wrote: > > When looking at traces, it turns out to be useful to know where a > > waking-up VCPU is being queued. Yes, that is always the CPU where > > it ran last, but that information can well be lost in past trace > > records! > > When you say "lost in past trace records", do you primarily mean that > the records themselves have been lost (due to the per-cpu trace buffers > filling up), or do you mean that it may be way way back and you don''t > want to go back and find it? >The latter... I''m quite lazy when looking at traces! :-P> If the latter, I think the best thing to do would be to just augment > xenalyze to keep track of that information and print it when it sees the > wake record. >I agree, I''ll kill this patch for now, and investigate further solution along the line you suggested in the future. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-17 14:45 UTC
Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing
On Fri, 2012-12-14 at 20:05 +0000, George Dunlap wrote:> > - /* Send scheduler interrupts to designated CPUs */ > > if ( !cpumask_empty(&mask) ) > > + { > > + if ( unlikely(tb_init_done) ) > > + { > > + /* Avoid TRACE_*: saves checking !tb_init_done each step */ > > + for_each_cpu(cpu, &mask) > > + trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu), &cpu); > > + } > > Hmm, probably should have pointed this out before, but trace_var() is a > static inline which checks tb_init_done -- you want __trace_var(). :-) >Correct. My bad. It actually was __trace_var() at the beginning (that''s the reason why this particular record deserves special treatment), but I messed things up while preparing this new version. Sorry for that and thanks for having spotted this! :-) Will fix. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel