thr3ads.net - Xen devel - [PATCH 0 of 6 v2] xen: sched_credit: fix picking & tickling and also add some tracing [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Dec-12 02:52 UTC

[PATCH 0 of 6 v2] xen: sched_credit: fix picking & tickling and also add some tracing

Hello everyone,

This is v2 of my previously submitted series about fixing scheduling anomalies
and introducing some tracing in the credit scheduler (with a couple of other
side effects).

All comments v1 got have been addressed and the series grew a couple of more
patches as I found some other issues, still falling under the broad description
given in the above paragraph. Details are given in the single changelogs but,
trying to make review as easy as possible, here it comes a short overview.

 [1 of 6] xen: sched_credit: improve picking up the idlal CPU for a VCPU
 [2 of 6] xen: sched_credit: improve tickling of idle CPUs

Are the fixes to the scheduling anomalies, happening during PCPU picking and
tickling, respectively. The latter has already been extensively discussed (by
me and George, mainly); the former is a new --small but nasty-- thing I
discovered during a couple of heavy tracing sessions. :-)

All the benchmarks have been rerun. No big changes in trends or anything, what
held true for v1 still does here (although, honestly, numbers looks even a
little bit better).

 [3 of 6] xen: sched_credit: use current_on_cpu() when appropriate

Is just (an attempt) to improve code readability.

 [4 of 6] xen: tracing: report where a VCPU wakes up

Is just (an attempt) to improve trace readability.

 [5 of 6] xen: tracing: introduce per-scheduler trace event IDs
 [6 of 6] xen: sched_credit: add some tracing

Finally, are what enables per-scheduler trace record generation already
discussed (again, mostly by me and George) and reworked as suggested and
requested during review of v1 of this series.

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

In _csched_cpu_pick() we try to select the best possible CPU for
running a VCPU, considering the characteristics of the underlying
hardware (i.e., how many threads, core, sockets, and how busy they
are). What we want is "the idle execution vehicle with the most
idling neighbours in its grouping".

In order to achieve it, we select a CPU from the VCPU''s affinity,
giving preference to its current processor if possible, as the basis
for the comparison with all the other CPUs. Problem is, to discount
the VCPU itself when computing this "idleness" (in an attempt to be
fair wrt its current processor), we arbitrarily and unconditionally
consider that selected CPU as idle, even when it is not the case,
for instance:
 1. If the CPU is not the one where the VCPU is running (perhaps due
    to the affinity being changed);
 2. The CPU is where the VCPU is running, but it has other VCPUs in
    its runq, so it won''t go idle even if the VCPU in question goes.

This is exemplified in the trace below:

]  3.466115364 x|------|------| d10v1   22005(2:2:5) 3 [ a 1 8 ]
   ... ... ...
   3.466122856 x|------|------| d10v1 runstate_change d10v1 running->offline
   3.466123046 x|------|------| d?v? runstate_change d32767v0
runnable->running
   ... ... ...
]  3.466126887 x|------|------| d32767v0   28004(2:8:4) 3 [ a 1 8 ]

22005(...) line (the first line) means _csched_cpu_pick() was called on
VCPU 1 of domain 10, while it is running on CPU 0, and it choose CPU 8,
which is busy (''|''), even if there are plenty of idle CPUs.
That is
because, as a consequence of changing the VCPU affinity, CPU 8 was
chosen as the basis for the comparison, and therefore considered idle
(its bit gets unconditionally set in the bitmask representing the idle
CPUs). 28004(...) line means the VCPU is woken up and queued on CPU 8''s
runq, where it waits for a context switch or a migration, in order to
be able to execute.

This change fixes things by only considering the "guessed" CPU idle if
the VCPU in question is both running there and is its only runnable
VCPU.

While at it, change the name of the two variables (within
_csched_cpu_pick() ) counting the numbers of idlers for `cpu'' and
`nxt'' in `nr_idlers_cpu'' and `nr_idlers_nxt'', which
makes their job
a little more evident than now that they''re just called
`weight_*''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -59,6 +59,8 @@
 #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
 #define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
 #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
+/* Is the first element of _cpu''s runq its idle vcpu? */
+#define IS_RUNQ_IDLE(_cpu) 
(is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu))
 
 
 /*
@@ -479,9 +481,14 @@ static int
      * distinct cores first and guarantees we don''t do something
stupid
      * like run two VCPUs on co-hyperthreads while there are idle cores
      * or sockets.
+     *
+     * Notice that, when computing the "idleness" of cpu, we may want
to
+     * discount vc. That is, iff vc is the currently running and the only
+     * runnable vcpu on cpu, we add cpu to the idlers.
      */
     cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
-    cpumask_set_cpu(cpu, &idlers);
+    if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) )
+        cpumask_set_cpu(cpu, &idlers);
     cpumask_and(&cpus, &cpus, &idlers);
     cpumask_clear_cpu(cpu, &cpus);
 
@@ -489,7 +496,7 @@ static int
     {
         cpumask_t cpu_idlers;
         cpumask_t nxt_idlers;
-        int nxt, weight_cpu, weight_nxt;
+        int nxt, nr_idlers_cpu, nr_idlers_nxt;
         int migrate_factor;
 
         nxt = cpumask_cycle(cpu, &cpus);
@@ -513,12 +520,12 @@ static int
             cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask,
nxt));
         }
 
-        weight_cpu = cpumask_weight(&cpu_idlers);
-        weight_nxt = cpumask_weight(&nxt_idlers);
+        nr_idlers_cpu = cpumask_weight(&cpu_idlers);
+        nr_idlers_nxt = cpumask_weight(&nxt_idlers);
         /* smt_power_savings: consolidate work rather than spreading it */
         if ( sched_smt_power_savings ?
-             weight_cpu > weight_nxt :
-             weight_cpu * migrate_factor < weight_nxt )
+             nr_idlers_cpu > nr_idlers_nxt :
+             nr_idlers_cpu * migrate_factor < nr_idlers_nxt )
         {
             cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
             spc = CSCHED_PCPU(nxt);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS];
 #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE)
 #define is_idle_vcpu(v)   (is_idle_domain((v)->domain))
 
+#define current_on_cpu(_c) \
+  ( (per_cpu(schedule_data, _c).curr) )
+
 #define DOMAIN_DESTROYED (1<<31) /* assumes atomic_t is >= 32 bits */
 #define put_domain(_d) \
   if ( atomic_dec_and_test(&(_d)->refcnt) ) domain_destroy(_d)

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs

Right now, when a VCPU wakes-up, we check whether it should preempt
what is running on the PCPU, and whether or not the waking VCPU can
be migrated (by tickling some idlers). However, this can result in
suboptimal or even wrong behaviour, as explained here:

 http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html

This change, instead, when deciding which PCPU(s) to tickle, upon
VCPU wake-up, considers both what it is likely to happen on the PCPU
where the wakeup occurs,and whether or not there are idlers where
the woken-up VCPU can run. In fact, if there are, we can avoid
interrupting the running VCPU. Only in case there aren''t any of
these PCPUs, preemption and migration are the way to go.

This has been tested (on top of the previous change) by running
the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
ways and 12 GB RAM).

1) All VMs had ''cpus="all"'' in their config file.

$ sysbench --test=cpu ... (time, lower is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
 | 6   | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
 | 10  | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
$ sysbench --test=memory ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
 | 6   | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
 | 10  | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
$ specjbb2005 ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 43150.63 +/- 1359.5616  | 43275.427 +/- 606.28185 |
 | 6   | 29274.29 +/- 1024.4042  | 29716.189 +/- 1290.1878 |
 | 10  | 19061.28 +/- 512.88561  | 19192.599 +/- 605.66058 |


2) All VMs had their VCPUs statically pinned to the host''s PCPUs.

$ sysbench --test=cpu ... (time, lower is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 47.8211   +/- 0.0215504 | 47.826900 +/- 0.0077872 |
 | 6   | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
 | 10  | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
$ sysbench --test=memory ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
 | 6   | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
 | 10  | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
$ specjbb2005 ... (throughput, higher is better)
 | 2   | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
 | 6   | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
 | 10  | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |


Numbers show how the change has either no or very limited impact
(specjbb2005 case) or, when it does have some impact, that is a
real improvement in performances (sysbench-memory case).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * Rewritten as per George''s suggestion, in order to improve
readability.
 * Killed some of the stats, with the only exception of
`tickle_idlers_none''
   and `tickle_idlers_some''. They don''t make things look that
terrible and
   I think they could be useful.
 * The preemption+migration of the currently running VCPU has been turned
   into a migration request, instead than just tickling. I traced this
   thing some more, and it looks like that is the way to go. Tickling is
   not effective here, because the woken PCPU would expect cur to be out
   of the scheduler tail, which is likely false (cur->is_running is
   still set to 1).

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -133,6 +133,7 @@ struct csched_vcpu {
         uint32_t state_idle;
         uint32_t migrate_q;
         uint32_t migrate_r;
+        uint32_t kicked_away;
     } stats;
 #endif
 };
@@ -251,54 +252,67 @@ static inline void
     struct csched_vcpu * const cur          CSCHED_VCPU(per_cpu(schedule_data,
cpu).curr);
     struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
-    cpumask_t mask;
+    cpumask_t mask, idle_mask;
+    int idlers_empty;
 
     ASSERT(cur);
     cpumask_clear(&mask);
 
-    /* If strictly higher priority than current VCPU, signal the CPU */
-    if ( new->pri > cur->pri )
+    idlers_empty = cpumask_empty(prv->idlers);
+
+    /*
+     * If the pcpu is idle, or there are no idlers and the new
+     * vcpu is a higher priority than the old vcpu, run it here.
+     *
+     * If there are idle cpus, first try to find one suitable to run
+     * new, so we can avoid preempting cur.  If we cannot find a
+     * suitable idler on which to run new, run it here, but try to
+     * find a suitable idler on which to run cur instead.
+     */
+    if ( cur->pri == CSCHED_PRI_IDLE
+         || (idlers_empty && new->pri > cur->pri) )
     {
-        if ( cur->pri == CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_local_idler);
-        else if ( cur->pri == CSCHED_PRI_TS_OVER )
-            SCHED_STAT_CRANK(tickle_local_over);
-        else if ( cur->pri == CSCHED_PRI_TS_UNDER )
-            SCHED_STAT_CRANK(tickle_local_under);
-        else
-            SCHED_STAT_CRANK(tickle_local_other);
-
+        if ( cur->pri != CSCHED_PRI_IDLE )
+            SCHED_STAT_CRANK(tickle_idlers_none);
         cpumask_set_cpu(cpu, &mask);
     }
+    else if ( !idlers_empty )
+    {
+        /* Check whether or not there are idlers that can run new */
+        cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
 
-    /*
-     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
-     * let them know there is runnable work in the system...
-     */
-    if ( cur->pri > CSCHED_PRI_IDLE )
-    {
-        if ( cpumask_empty(prv->idlers) )
+        /*
+         * If there are no suitable idlers for new, and it''s higher
+         * priority than cur, ask the scheduler to migrate cur away.
+         * We have to act like this (instead of just waking some of
+         * the idlers suitable for cur) because cur is running.
+         *
+         * If there are suitable idlers for new, no matter priorities,
+         * leave cur alone (as it is running and is, likely, cache-hot)
+         * and wake some of them (which is waking up and so is, likely,
+         * cache cold anyway).
+         */
+        if ( cpumask_empty(&idle_mask) && new->pri >
cur->pri )
         {
             SCHED_STAT_CRANK(tickle_idlers_none);
+            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
+            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
+            SCHED_STAT_CRANK(migrate_kicked_away);
+            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+            cpumask_set_cpu(cpu, &mask);
         }
-        else
+        else if ( !cpumask_empty(&idle_mask) )
         {
-            cpumask_t idle_mask;
-
-            cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
-            if ( !cpumask_empty(&idle_mask) )
+            /* Which of the idlers suitable for new shall we wake up? */
+            SCHED_STAT_CRANK(tickle_idlers_some);
+            if ( opt_tickle_one_idle )
             {
-                SCHED_STAT_CRANK(tickle_idlers_some);
-                if ( opt_tickle_one_idle )
-                {
-                    this_cpu(last_tickle_cpu) = 
-                        cpumask_cycle(this_cpu(last_tickle_cpu),
&idle_mask);
-                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, &idle_mask);
+                this_cpu(last_tickle_cpu) +                   
cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
+            else
+                cpumask_or(&mask, &mask, &idle_mask);
         }
     }
 
@@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc
     {
         printk(" credit=%i [w=%u]", atomic_read(&svc->credit),
sdom->weight);
 #ifdef CSCHED_STATS
-        printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
+        printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}",
                 svc->stats.credit_last,
                 svc->stats.credit_incr,
                 svc->stats.state_active,
                 svc->stats.state_idle,
                 svc->stats.migrate_q,
-                svc->stats.migrate_r);
+                svc->stats.migrate_r,
+                svc->stats.kicked_away);
 #endif
     }
 
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable,     "csc
 PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable")
 PERFCOUNTER(vcpu_park,              "csched: vcpu_park")
 PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
-PERFCOUNTER(tickle_local_idler,     "csched: tickle_local_idler")
-PERFCOUNTER(tickle_local_over,      "csched: tickle_local_over")
-PERFCOUNTER(tickle_local_under,     "csched: tickle_local_under")
-PERFCOUNTER(tickle_local_other,     "csched: tickle_local_other")
 PERFCOUNTER(tickle_idlers_none,     "csched: tickle_idlers_none")
 PERFCOUNTER(tickle_idlers_some,     "csched: tickle_idlers_some")
 PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
@@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed,   "csc
 PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
 PERFCOUNTER(migrate_queued,         "csched: migrate_queued")
 PERFCOUNTER(migrate_running,        "csched: migrate_running")
+PERFCOUNTER(migrate_kicked_away,    "csched: migrate_kicked_away")
 PERFCOUNTER(vcpu_hot,               "csched: vcpu_hot")
 
 PERFCOUNTER(need_flush_tlb_flush,   "PG_need_flush tlb flushes")

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

Defined by the previous change, using it wherever it is appropriate
throughout sched_credit.c makes the code easier to read.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -231,7 +231,7 @@ static void burn_credits(struct csched_v
     unsigned int credits;
 
     /* Assert svc is current */
-    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data,
svc->vcpu->processor).curr));
+    ASSERT( svc == CSCHED_VCPU(current_on_cpu(svc->vcpu->processor)) );
 
     if ( (delta = now - svc->start_time) <= 0 )
         return;
@@ -249,8 +249,7 @@ DEFINE_PER_CPU(unsigned int, last_tickle
 static inline void
 __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
 {
-    struct csched_vcpu * const cur -        CSCHED_VCPU(per_cpu(schedule_data,
cpu).curr);
+    struct csched_vcpu * const cur = CSCHED_VCPU(current_on_cpu(cpu));
     struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
     cpumask_t mask, idle_mask;
     int idlers_empty;
@@ -387,7 +386,7 @@ csched_alloc_pdata(const struct schedule
         per_cpu(schedule_data, cpu).sched_priv = spc;
 
     /* Start off idling... */
-    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+    BUG_ON(!is_idle_vcpu(current_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
 
     spin_unlock_irqrestore(&prv->lock, flags);
@@ -730,7 +729,7 @@ csched_vcpu_sleep(const struct scheduler
 
     BUG_ON( is_idle_vcpu(vc) );
 
-    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+    if ( current_on_cpu(vc->processor) == vc )
         cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
     else if ( __vcpu_on_runq(svc) )
         __runq_remove(svc);
@@ -744,7 +743,7 @@ csched_vcpu_wake(const struct scheduler 
 
     BUG_ON( is_idle_vcpu(vc) );
 
-    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    if ( unlikely(current_on_cpu(cpu) == vc) )
     {
         SCHED_STAT_CRANK(vcpu_wake_running);
         return;
@@ -1213,7 +1212,7 @@ static struct csched_vcpu *
 csched_runq_steal(int peer_cpu, int cpu, int pri)
 {
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
-    const struct vcpu * const peer_vcpu = per_cpu(schedule_data,
peer_cpu).curr;
+    const struct vcpu * const peer_vcpu = current_on_cpu(peer_cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
     struct vcpu *vc;
@@ -1502,7 +1501,7 @@ csched_dump_pcpu(const struct scheduler 
     printk("core=%s\n", cpustr);
 
     /* current VCPU */
-    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
+    svc = CSCHED_VCPU(current_on_cpu(cpu));
     if ( svc )
     {
         printk("\trun: ");

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

When looking at traces, it turns out to be useful to know where a
waking-up VCPU is being queued. Yes, that is always the CPU where
it ran last, but that information can well be lost in past trace
records!

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -371,7 +371,7 @@ void vcpu_wake(struct vcpu *v)
 
     vcpu_schedule_unlock_irqrestore(v, flags);
 
-    TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id);
+    TRACE_3D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id,
v->processor);
 }
 
 void vcpu_unblock(struct vcpu *v)

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs

So that it becomes possible to create scheduler specific trace records,
within each scheduler, without worrying about the overlapping, and also
without giving up being able to recognise them univocally. The latter
is deemed as useful, since we can have more than one scheduler running
at the same time, thanks to cpupools.

The event ID is 12 bits, and this change uses the upper 3 of them for
the ''scheduler ID''. This means we''re limited to 8
schedulers and to
512 scheduler specific tracing events. Both seem reasonable limitations
as of now.

This also converts the existing credit2 tracing (the only scheduler
generating tracing events up to now) to the new system.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * The event ID generaion macro is now called `TRC_SCHED_CLASS_EVT()'',
   and has been generalized and put in trace.h, as suggested.
 * The handling of per-scheduler tracing IDs and masks have been
   restructured, properly naming "ID" the numerical identifiers
   and "MASK" the bitmasks, as requested.

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -29,18 +29,22 @@
 #define d2printk(x...)
 //#define d2printk printk
 
-#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
-#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
-#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
-#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
-#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
-#define TRC_CSCHED2_TICKLE       TRC_SCHED_CLASS + 6
-#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7
-#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8
-#define TRC_CSCHED2_UPDATE_LOAD   TRC_SCHED_CLASS + 9
-#define TRC_CSCHED2_RUNQ_ASSIGN   TRC_SCHED_CLASS + 10
-#define TRC_CSCHED2_UPDATE_VCPU_LOAD   TRC_SCHED_CLASS + 11
-#define TRC_CSCHED2_UPDATE_RUNQ_LOAD   TRC_SCHED_CLASS + 12
+/*
+ * Credit2 tracing events ("only" 512 available!). Check
+ * include/public/trace.h for more details.
+ */
+#define TRC_CSCHED2_TICK             TRC_SCHED_CLASS_EVT(CSCHED2, 1)
+#define TRC_CSCHED2_RUNQ_POS         TRC_SCHED_CLASS_EVT(CSCHED2, 2)
+#define TRC_CSCHED2_CREDIT_BURN      TRC_SCHED_CLASS_EVT(CSCHED2, 3)
+#define TRC_CSCHED2_CREDIT_ADD       TRC_SCHED_CLASS_EVT(CSCHED2, 4)
+#define TRC_CSCHED2_TICKLE_CHECK     TRC_SCHED_CLASS_EVT(CSCHED2, 5)
+#define TRC_CSCHED2_TICKLE           TRC_SCHED_CLASS_EVT(CSCHED2, 6)
+#define TRC_CSCHED2_CREDIT_RESET     TRC_SCHED_CLASS_EVT(CSCHED2, 7)
+#define TRC_CSCHED2_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(CSCHED2, 8)
+#define TRC_CSCHED2_UPDATE_LOAD      TRC_SCHED_CLASS_EVT(CSCHED2, 9)
+#define TRC_CSCHED2_RUNQ_ASSIGN      TRC_SCHED_CLASS_EVT(CSCHED2, 10)
+#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11)
+#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12)
 
 /*
  * WARNING: This is still in an experimental phase.  Status and work can be
found at the
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -57,6 +57,32 @@
 #define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
 #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
 
+/*
+ * The highest 3 bits of the last 12 bits of TRC_SCHED_CLASS above are
+ * reserved for encoding what scheduler produced the information. The
+ * actual event is encoded in the last 9 bits.
+ *
+ * This means we have 8 scheduling IDs available (which means at most 8
+ * schedulers generating events) and, in each scheduler, up to 512
+ * different events.
+ */
+#define TRC_SCHED_ID_BITS 3
+#define TRC_SCHED_ID_SHIFT (TRC_SUBCLS_SHIFT - TRC_SCHED_ID_BITS)
+#define TRC_SCHED_ID_MASK (((1UL<<TRC_SCHED_ID_BITS) - 1) <<
TRC_SCHED_ID_SHIFT)
+#define TRC_SCHED_EVT_MASK (~(TRC_SCHED_ID_MASK))
+
+/* Per-scheduler IDs, to identify scheduler specific events */
+#define TRC_SCHED_CSCHED   0
+#define TRC_SCHED_CSCHED2  1
+#define TRC_SCHED_SEDF     2
+#define TRC_SCHED_ARINC653 3
+
+/* Per-scheduler tracing */
+#define TRC_SCHED_CLASS_EVT(_c, _e) \
+  ( ( TRC_SCHED_CLASS | \
+      ((TRC_SCHED_##_c << TRC_SCHED_ID_SHIFT) & TRC_SCHED_ID_MASK) )
+ \
+    (_e & TRC_SCHED_EVT_MASK) )
+
 /* Trace classes for Hardware */
 #define TRC_HW_PM           0x00801000   /* Power management traces */
 #define TRC_HW_IRQ          0x00802000   /* Traces relating to the handling of
IRQs */

Dario Faggioli

2012-Dec-12 02:52 UTC

head link

[PATCH 6 of 6 v2] xen: sched_credit: add some tracing

About tickling, and PCPU selection.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * Dummy `struct d {}'', accommodating `cpu'' only, removed
   in spite of the much more readable `trace_var(..., sizeof(cpu),
&cpu)'',
   as suggested.

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -21,6 +21,7 @@
 #include <asm/atomic.h>
 #include <xen/errno.h>
 #include <xen/keyhandler.h>
+#include <xen/trace.h>
 
 
 /*
@@ -97,6 +98,18 @@
 
 
 /*
+ * Credit tracing events ("only" 512 available!). Check
+ * include/public/trace.h for more details.
+ */
+#define TRC_CSCHED_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED, 1)
+#define TRC_CSCHED_ACCOUNT_START TRC_SCHED_CLASS_EVT(CSCHED, 2)
+#define TRC_CSCHED_ACCOUNT_STOP  TRC_SCHED_CLASS_EVT(CSCHED, 3)
+#define TRC_CSCHED_STOLEN_VCPU   TRC_SCHED_CLASS_EVT(CSCHED, 4)
+#define TRC_CSCHED_PICKED_CPU    TRC_SCHED_CLASS_EVT(CSCHED, 5)
+#define TRC_CSCHED_TICKLE        TRC_SCHED_CLASS_EVT(CSCHED, 6)
+
+
+/*
  * Boot parameters
  */
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
@@ -315,9 +328,18 @@ static inline void
         }
     }
 
-    /* Send scheduler interrupts to designated CPUs */
     if ( !cpumask_empty(&mask) )
+    {
+        if ( unlikely(tb_init_done) )
+        {
+            /* Avoid TRACE_*: saves checking !tb_init_done each step */
+            for_each_cpu(cpu, &mask)
+                trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu), &cpu);
+        }
+
+        /* Send scheduler interrupts to designated CPUs */
         cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
+    }
 }
 
 static void
@@ -554,6 +576,8 @@ static int
     if ( commit && spc )
        spc->idle_bias = cpu;
 
+    TRACE_3D(TRC_CSCHED_PICKED_CPU, vc->domain->domain_id,
vc->vcpu_id, cpu);
+
     return cpu;
 }
 
@@ -586,6 +610,9 @@ static inline void
         }
     }
 
+    TRACE_3D(TRC_CSCHED_ACCOUNT_START, sdom->dom->domain_id,
+             svc->vcpu->vcpu_id, sdom->active_vcpu_count);
+
     spin_unlock_irqrestore(&prv->lock, flags);
 }
 
@@ -608,6 +635,9 @@ static inline void
     {
         list_del_init(&sdom->active_sdom_elem);
     }
+
+    TRACE_3D(TRC_CSCHED_ACCOUNT_STOP, sdom->dom->domain_id,
+             svc->vcpu->vcpu_id, sdom->active_vcpu_count);
 }
 
 static void
@@ -1241,6 +1271,8 @@ csched_runq_steal(int peer_cpu, int cpu,
             if (__csched_vcpu_is_migrateable(vc, cpu))
             {
                 /* We got a candidate. Grab it! */
+                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
+                         vc->domain->domain_id, vc->vcpu_id);
                 SCHED_VCPU_STAT_CRANK(speer, migrate_q);
                 SCHED_STAT_CRANK(migrate_queued);
                 WARN_ON(vc->is_urgent);
@@ -1401,6 +1433,7 @@ csched_schedule(
     /* Tasklet work (which runs in idle VCPU context) overrides all else. */
     if ( tasklet_work_scheduled )
     {
+        TRACE_0D(TRC_CSCHED_SCHED_TASKLET);
         snext = CSCHED_VCPU(idle_vcpu[cpu]);
         snext->pri = CSCHED_PRI_TS_BOOST;
     }

Jan Beulich

2012-Dec-12 10:04 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

>>> On 12.12.12 at 03:52, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -59,6 +59,8 @@
>  #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *)
(_vcpu)->sched_priv)
>  #define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
>  #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
> +/* Is the first element of _cpu''s runq its idle vcpu? */
> +#define IS_RUNQ_IDLE(_cpu) 
(is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu))
>  
>  
>  /*
> @@ -479,9 +481,14 @@ static int
>       * distinct cores first and guarantees we don''t do something
stupid
>       * like run two VCPUs on co-hyperthreads while there are idle cores
>       * or sockets.
> +     *
> +     * Notice that, when computing the "idleness" of cpu, we may
want to
> +     * discount vc. That is, iff vc is the currently running and the only
> +     * runnable vcpu on cpu, we add cpu to the idlers.
>       */
>      cpumask_and(&idlers, &cpu_online_map,
CSCHED_PRIV(ops)->idlers);
> -    cpumask_set_cpu(cpu, &idlers);
> +    if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) )
> +        cpumask_set_cpu(cpu, &idlers);
>      cpumask_and(&cpus, &cpus, &idlers);
>      cpumask_clear_cpu(cpu, &cpus);
>  
> @@ -489,7 +496,7 @@ static int
>      {
>          cpumask_t cpu_idlers;
>          cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> +        int nxt, nr_idlers_cpu, nr_idlers_nxt;
>          int migrate_factor;
>  
>          nxt = cpumask_cycle(cpu, &cpus);
> @@ -513,12 +520,12 @@ static int
>              cpumask_and(&nxt_idlers, &idlers,
per_cpu(cpu_core_mask, nxt));
>          }
>  
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> +        nr_idlers_cpu = cpumask_weight(&cpu_idlers);
> +        nr_idlers_nxt = cpumask_weight(&nxt_idlers);
>          /* smt_power_savings: consolidate work rather than spreading it */
>          if ( sched_smt_power_savings ?
> -             weight_cpu > weight_nxt :
> -             weight_cpu * migrate_factor < weight_nxt )
> +             nr_idlers_cpu > nr_idlers_nxt :
> +             nr_idlers_cpu * migrate_factor < nr_idlers_nxt )
>          {
>              cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
>              spc = CSCHED_PCPU(nxt);
Despite you mentioning this in the description, these last two hunks
are, afaict, only renaming variables (and that''s even debatable, as
the current names aren''t really misleading imo), and hence I
don''t
think belong in a patch that clearly has the potential for causing
(performance) regressions.

That said - I don''t think it will (and even more, I''m
agreeable to the
change done).
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS];
>  #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE)
>  #define is_idle_vcpu(v)   (is_idle_domain((v)->domain))
>  
> +#define current_on_cpu(_c) \
> +  ( (per_cpu(schedule_data, _c).curr) )
> +
This, imo, really belings into sched-if.h.

Plus - what''s the point of double parentheses, when in fact none
at all would be needed?

And finally, why "_c" and not just "c"?

Jan
>  #define DOMAIN_DESTROYED (1<<31) /* assumes atomic_t is >= 32
bits */
>  #define put_domain(_d) \
>    if ( atomic_dec_and_test(&(_d)->refcnt) ) domain_destroy(_d)

Dario Faggioli

2012-Dec-12 10:19 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

On Wed, 2012-12-12 at 10:04 +0000, Jan Beulich wrote: > > -        weight_cpu = cpumask_weight(&cpu_idlers);
> > -        weight_nxt = cpumask_weight(&nxt_idlers);
> > +        nr_idlers_cpu = cpumask_weight(&cpu_idlers);
> > +        nr_idlers_nxt = cpumask_weight(&nxt_idlers);
> >          /* smt_power_savings: consolidate work rather than spreading
it */
> >          if ( sched_smt_power_savings ?
> > -             weight_cpu > weight_nxt :
> > -             weight_cpu * migrate_factor < weight_nxt )
> > +             nr_idlers_cpu > nr_idlers_nxt :
> > +             nr_idlers_cpu * migrate_factor < nr_idlers_nxt )
> >          {
> >              cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> >              spc = CSCHED_PCPU(nxt);
> 
> Despite you mentioning this in the description, these last two hunks
> are, afaict, only renaming variables (and that''s even debatable,
as
> the current names aren''t really misleading imo), and hence I
don''t
> think belong in a patch that clearly has the potential for causing
> (performance) regressions.
> Ok, I think I can live with the current names too... Just a matter of
taste. :-)
> That said - I don''t think it will (and even more, I''m
agreeable to the
> change done).
> It has been benchmarked, together with the next change, and the results
are in the changelog of 2/6. Numbers there show that the combination of
those two changes are much more an improvement than anything else, at
least for the workloads I considered (which includes sysbench and
specjbb2005).

Anyway, I think I see your point, and I can either move the remane
somewhere else or kill it entirely.
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS];
> >  #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE)
> >  #define is_idle_vcpu(v)   (is_idle_domain((v)->domain))
> >  
> > +#define current_on_cpu(_c) \
> > +  ( (per_cpu(schedule_data, _c).curr) )
> > +
> 
> This, imo, really belings into sched-if.h.
> Ok.
> Plus - what''s the point of double parentheses, when in fact none
> at all would be needed?
> 
> And finally, why "_c" and not just "c"?
> Nothing particular, just "personal macro style", I guess, which I can
convert to what you ask and resend.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Dec-12 10:30 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

>>> On 12.12.12 at 11:19, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Wed, 2012-12-12 at 10:04 +0000, Jan Beulich wrote: 
>> Despite you mentioning this in the description, these last two hunks
>> are, afaict, only renaming variables (and that''s even
debatable, as
>> the current names aren''t really misleading imo), and hence I
don''t
>> think belong in a patch that clearly has the potential for causing
>> (performance) regressions.
>> 
> Ok, I think I can live with the current names too... Just a matter of
> taste. :-)
> 
>> That said - I don''t think it will (and even more, I''m
agreeable to the
>> change done).
>> 
> It has been benchmarked, together with the next change, and the results
> are in the changelog of 2/6. Numbers there show that the combination of
> those two changes are much more an improvement than anything else, at
> least for the workloads I considered (which includes sysbench and
> specjbb2005).
> 
> Anyway, I think I see your point, and I can either move the remane
> somewhere else or kill it entirely.
Yes please; I''ll leave it to George to decide upon an eventual
separate renaming patch.

Btw., when you resend, can you please also fix the subject, so
grepping the changeset titles for "idle" would actually hit on this
change?

Thanks, Jan

Dario Faggioli

2012-Dec-12 10:38 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

On Wed, 2012-12-12 at 10:30 +0000, Jan Beulich wrote: > > Anyway, I think I see your point, and I can either move the remane
> > somewhere else or kill it entirely.
> 
> Yes please; I''ll leave it to George to decide upon an eventual
> separate renaming patch.
> Ok.
> Btw., when you resend, can you please also fix the subject, so
> grepping the changeset titles for "idle" would actually hit on
this
> change?
> Ups! My bad, sorry for that. I sure will.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Dec-14 19:16 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

On 12/12/12 02:52, Dario Faggioli wrote:> In _csched_cpu_pick() we try to select the best possible CPU for
> running a VCPU, considering the characteristics of the underlying
> hardware (i.e., how many threads, core, sockets, and how busy they
> are). What we want is "the idle execution vehicle with the most
> idling neighbours in its grouping".
>
> In order to achieve it, we select a CPU from the VCPU''s affinity,
> giving preference to its current processor if possible, as the basis
> for the comparison with all the other CPUs. Problem is, to discount
> the VCPU itself when computing this "idleness" (in an attempt to
be
> fair wrt its current processor), we arbitrarily and unconditionally
> consider that selected CPU as idle, even when it is not the case,
> for instance:
>   1. If the CPU is not the one where the VCPU is running (perhaps due
>      to the affinity being changed);
>   2. The CPU is where the VCPU is running, but it has other VCPUs in
>      its runq, so it won''t go idle even if the VCPU in question
goes.
Good catch -- thanks.  Comments below.
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -59,6 +59,8 @@
>   #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *)
(_vcpu)->sched_priv)
>   #define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
>   #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
> +/* Is the first element of _cpu''s runq its idle vcpu? */
> +#define IS_RUNQ_IDLE(_cpu) 
(is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu))
>   
>   
>   /*
> @@ -479,9 +481,14 @@ static int
>        * distinct cores first and guarantees we don''t do something
stupid
>        * like run two VCPUs on co-hyperthreads while there are idle cores
>        * or sockets.
> +     *
> +     * Notice that, when computing the "idleness" of cpu, we may
want to
> +     * discount vc. That is, iff vc is the currently running and the only
> +     * runnable vcpu on cpu, we add cpu to the idlers.
>        */
>       cpumask_and(&idlers, &cpu_online_map,
CSCHED_PRIV(ops)->idlers);
> -    cpumask_set_cpu(cpu, &idlers);
> +    if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) )
> +        cpumask_set_cpu(cpu, &idlers);
Why bother with this whole "current_on_cpu()" thing, when you can just
look at vc->processor?  I.e.:

if ( cpu == vc->processor && IS_RUNQ_IDLE(cpu) )
>       cpumask_and(&cpus, &cpus, &idlers);
>       cpumask_clear_cpu(cpu, &cpus);
>   
> @@ -489,7 +496,7 @@ static int
>       {
>           cpumask_t cpu_idlers;
>           cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> +        int nxt, nr_idlers_cpu, nr_idlers_nxt;
I think Jan is right, this probably should be a separate patch.

  -George

George Dunlap

2012-Dec-14 19:29 UTC

head link

Re: [PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs

On 12/12/12 02:52, Dario Faggioli wrote:> Right now, when a VCPU wakes-up, we check whether it should preempt
> what is running on the PCPU, and whether or not the waking VCPU can
> be migrated (by tickling some idlers). However, this can result in
> suboptimal or even wrong behaviour, as explained here:
>
>   http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html
>
> This change, instead, when deciding which PCPU(s) to tickle, upon
> VCPU wake-up, considers both what it is likely to happen on the PCPU
> where the wakeup occurs,and whether or not there are idlers where
> the woken-up VCPU can run. In fact, if there are, we can avoid
> interrupting the running VCPU. Only in case there aren''t any of
> these PCPUs, preemption and migration are the way to go.
>
> This has been tested (on top of the previous change) by running
> the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
> a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
> ways and 12 GB RAM).
>
> 1) All VMs had ''cpus="all"'' in their config
file.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
>   | 6   | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
>   | 10  | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
>   | 6   | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
>   | 10  | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
> $ specjbb2005 ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 43150.63 +/- 1359.5616  | 43275.427 +/- 606.28185 |
>   | 6   | 29274.29 +/- 1024.4042  | 29716.189 +/- 1290.1878 |
>   | 10  | 19061.28 +/- 512.88561  | 19192.599 +/- 605.66058 |
>
>
> 2) All VMs had their VCPUs statically pinned to the host''s PCPUs.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 47.8211   +/- 0.0215504 | 47.826900 +/- 0.0077872 |
>   | 6   | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
>   | 10  | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
>   | 6   | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
>   | 10  | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
> $ specjbb2005 ... (throughput, higher is better)
>   | 2   | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
>   | 6   | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
>   | 10  | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |
>
>
> Numbers show how the change has either no or very limited impact
> (specjbb2005 case) or, when it does have some impact, that is a
> real improvement in performances (sysbench-memory case).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Changes from v1:
>   * Rewritten as per George''s suggestion, in order to improve
readability.
>   * Killed some of the stats, with the only exception of
`tickle_idlers_none''
>     and `tickle_idlers_some''. They don''t make things look
that terrible and
>     I think they could be useful.
>   * The preemption+migration of the currently running VCPU has been turned
>     into a migration request, instead than just tickling. I traced this
>     thing some more, and it looks like that is the way to go. Tickling is
>     not effective here, because the woken PCPU would expect cur to be out
>     of the scheduler tail, which is likely false (cur->is_running is
>     still set to 1).
Ah, right. :-)

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -133,6 +133,7 @@ struct csched_vcpu {
>           uint32_t state_idle;
>           uint32_t migrate_q;
>           uint32_t migrate_r;
> +        uint32_t kicked_away;
>       } stats;
>   #endif
>   };
> @@ -251,54 +252,67 @@ static inline void
>       struct csched_vcpu * const cur >          
CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
> -    cpumask_t mask;
> +    cpumask_t mask, idle_mask;
> +    int idlers_empty;
>   
>       ASSERT(cur);
>       cpumask_clear(&mask);
>   
> -    /* If strictly higher priority than current VCPU, signal the CPU */
> -    if ( new->pri > cur->pri )
> +    idlers_empty = cpumask_empty(prv->idlers);
> +
> +    /*
> +     * If the pcpu is idle, or there are no idlers and the new
> +     * vcpu is a higher priority than the old vcpu, run it here.
> +     *
> +     * If there are idle cpus, first try to find one suitable to run
> +     * new, so we can avoid preempting cur.  If we cannot find a
> +     * suitable idler on which to run new, run it here, but try to
> +     * find a suitable idler on which to run cur instead.
> +     */
> +    if ( cur->pri == CSCHED_PRI_IDLE
> +         || (idlers_empty && new->pri > cur->pri) )
>       {
> -        if ( cur->pri == CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_local_idler);
> -        else if ( cur->pri == CSCHED_PRI_TS_OVER )
> -            SCHED_STAT_CRANK(tickle_local_over);
> -        else if ( cur->pri == CSCHED_PRI_TS_UNDER )
> -            SCHED_STAT_CRANK(tickle_local_under);
> -        else
> -            SCHED_STAT_CRANK(tickle_local_other);
> -
> +        if ( cur->pri != CSCHED_PRI_IDLE )
> +            SCHED_STAT_CRANK(tickle_idlers_none);
>           cpumask_set_cpu(cpu, &mask);
>       }
> +    else if ( !idlers_empty )
> +    {
> +        /* Check whether or not there are idlers that can run new */
> +        cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
>   
> -    /*
> -     * If this CPU has at least two runnable VCPUs, we tickle any idlers
to
> -     * let them know there is runnable work in the system...
> -     */
> -    if ( cur->pri > CSCHED_PRI_IDLE )
> -    {
> -        if ( cpumask_empty(prv->idlers) )
> +        /*
> +         * If there are no suitable idlers for new, and it''s
higher
> +         * priority than cur, ask the scheduler to migrate cur away.
> +         * We have to act like this (instead of just waking some of
> +         * the idlers suitable for cur) because cur is running.
> +         *
> +         * If there are suitable idlers for new, no matter priorities,
> +         * leave cur alone (as it is running and is, likely, cache-hot)
> +         * and wake some of them (which is waking up and so is, likely,
> +         * cache cold anyway).
> +         */
> +        if ( cpumask_empty(&idle_mask) && new->pri >
cur->pri )
>           {
>               SCHED_STAT_CRANK(tickle_idlers_none);
> +            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +            SCHED_STAT_CRANK(migrate_kicked_away);
> +            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +            cpumask_set_cpu(cpu, &mask);
>           }
> -        else
> +        else if ( !cpumask_empty(&idle_mask) )
>           {
> -            cpumask_t idle_mask;
> -
> -            cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
> -            if ( !cpumask_empty(&idle_mask) )
> +            /* Which of the idlers suitable for new shall we wake up? */
> +            SCHED_STAT_CRANK(tickle_idlers_some);
> +            if ( opt_tickle_one_idle )
>               {
> -                SCHED_STAT_CRANK(tickle_idlers_some);
> -                if ( opt_tickle_one_idle )
> -                {
> -                    this_cpu(last_tickle_cpu) > -                      
cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> -                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> -                }
> -                else
> -                    cpumask_or(&mask, &mask, &idle_mask);
> +                this_cpu(last_tickle_cpu) > +                   
cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> +                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
>               }
> -            cpumask_and(&mask, &mask,
new->vcpu->cpu_affinity);
> +            else
> +                cpumask_or(&mask, &mask, &idle_mask);
>           }
>       }
>   
> @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc
>       {
>           printk(" credit=%i [w=%u]",
atomic_read(&svc->credit), sdom->weight);
>   #ifdef CSCHED_STATS
> -        printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
> +        printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}",
>                   svc->stats.credit_last,
>                   svc->stats.credit_incr,
>                   svc->stats.state_active,
>                   svc->stats.state_idle,
>                   svc->stats.migrate_q,
> -                svc->stats.migrate_r);
> +                svc->stats.migrate_r,
> +                svc->stats.kicked_away);
>   #endif
>       }
>   
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable,     "csc
>   PERFCOUNTER(vcpu_wake_not_runnable, "csched:
vcpu_wake_not_runnable")
>   PERFCOUNTER(vcpu_park,              "csched: vcpu_park")
>   PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
> -PERFCOUNTER(tickle_local_idler,     "csched:
tickle_local_idler")
> -PERFCOUNTER(tickle_local_over,      "csched: tickle_local_over")
> -PERFCOUNTER(tickle_local_under,     "csched:
tickle_local_under")
> -PERFCOUNTER(tickle_local_other,     "csched:
tickle_local_other")
>   PERFCOUNTER(tickle_idlers_none,     "csched:
tickle_idlers_none")
>   PERFCOUNTER(tickle_idlers_some,     "csched:
tickle_idlers_some")
>   PERFCOUNTER(load_balance_idle,      "csched:
load_balance_idle")
> @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed,   "csc
>   PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
>   PERFCOUNTER(migrate_queued,         "csched: migrate_queued")
>   PERFCOUNTER(migrate_running,        "csched: migrate_running")
> +PERFCOUNTER(migrate_kicked_away,    "csched:
migrate_kicked_away")
>   PERFCOUNTER(vcpu_hot,               "csched: vcpu_hot")
>   
>   PERFCOUNTER(need_flush_tlb_flush,   "PG_need_flush tlb
flushes")

George Dunlap

2012-Dec-14 19:39 UTC

head link

Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

On 12/12/12 02:52, Dario Faggioli wrote:> Defined by the previous change, using it wherever it is appropriate
> throughout sched_credit.c makes the code easier to read.
Hmm, I hadn''t read this patch when I commented about removing the macro
from the first patch. :-)

I personally think that using vc->processor is better in that patch 
anyway; but using this macro elsewhere is probably fine.

I think from a taste point of view, I would have put this patch, with 
the new definition, as the first patch in the series, and the had the 
second patch just use it.

I''ll go back and respond to Jan''s comment about the macro.

  -George
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -231,7 +231,7 @@ static void burn_credits(struct csched_v
>       unsigned int credits;
>   
>       /* Assert svc is current */
> -    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data,
svc->vcpu->processor).curr));
> +    ASSERT( svc == CSCHED_VCPU(current_on_cpu(svc->vcpu->processor))
);
>   
>       if ( (delta = now - svc->start_time) <= 0 )
>           return;
> @@ -249,8 +249,7 @@ DEFINE_PER_CPU(unsigned int, last_tickle
>   static inline void
>   __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
>   {
> -    struct csched_vcpu * const cur > -       
CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
> +    struct csched_vcpu * const cur = CSCHED_VCPU(current_on_cpu(cpu));
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>       cpumask_t mask, idle_mask;
>       int idlers_empty;
> @@ -387,7 +386,7 @@ csched_alloc_pdata(const struct schedule
>           per_cpu(schedule_data, cpu).sched_priv = spc;
>   
>       /* Start off idling... */
> -    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
> +    BUG_ON(!is_idle_vcpu(current_on_cpu(cpu)));
>       cpumask_set_cpu(cpu, prv->idlers);
>   
>       spin_unlock_irqrestore(&prv->lock, flags);
> @@ -730,7 +729,7 @@ csched_vcpu_sleep(const struct scheduler
>   
>       BUG_ON( is_idle_vcpu(vc) );
>   
> -    if ( per_cpu(schedule_data, vc->processor).curr == vc )
> +    if ( current_on_cpu(vc->processor) == vc )
>           cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
>       else if ( __vcpu_on_runq(svc) )
>           __runq_remove(svc);
> @@ -744,7 +743,7 @@ csched_vcpu_wake(const struct scheduler
>   
>       BUG_ON( is_idle_vcpu(vc) );
>   
> -    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
> +    if ( unlikely(current_on_cpu(cpu) == vc) )
>       {
>           SCHED_STAT_CRANK(vcpu_wake_running);
>           return;
> @@ -1213,7 +1212,7 @@ static struct csched_vcpu *
>   csched_runq_steal(int peer_cpu, int cpu, int pri)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> -    const struct vcpu * const peer_vcpu = per_cpu(schedule_data,
peer_cpu).curr;
> +    const struct vcpu * const peer_vcpu = current_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
>       struct vcpu *vc;
> @@ -1502,7 +1501,7 @@ csched_dump_pcpu(const struct scheduler
>       printk("core=%s\n", cpustr);
>   
>       /* current VCPU */
> -    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
> +    svc = CSCHED_VCPU(current_on_cpu(cpu));
>       if ( svc )
>       {
>           printk("\trun: ");

George Dunlap

2012-Dec-14 19:50 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

On 12/12/12 10:04, Jan Beulich wrote:>>>> On 12.12.12 at 03:52, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>> --- a/xen/common/sched_credit.c
>> +++ b/xen/common/sched_credit.c
>> @@ -59,6 +59,8 @@
>>   #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *)
(_vcpu)->sched_priv)
>>   #define CSCHED_DOM(_dom)    ((struct csched_dom *)
(_dom)->sched_priv)
>>   #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
>> +/* Is the first element of _cpu''s runq its idle vcpu? */
>> +#define IS_RUNQ_IDLE(_cpu) 
(is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu))
>>   
>>   
>>   /*
>> @@ -479,9 +481,14 @@ static int
>>        * distinct cores first and guarantees we don''t do
something stupid
>>        * like run two VCPUs on co-hyperthreads while there are idle
cores
>>        * or sockets.
>> +     *
>> +     * Notice that, when computing the "idleness" of cpu, we
may want to
>> +     * discount vc. That is, iff vc is the currently running and the
only
>> +     * runnable vcpu on cpu, we add cpu to the idlers.
>>        */
>>       cpumask_and(&idlers, &cpu_online_map,
CSCHED_PRIV(ops)->idlers);
>> -    cpumask_set_cpu(cpu, &idlers);
>> +    if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) )
>> +        cpumask_set_cpu(cpu, &idlers);
>>       cpumask_and(&cpus, &cpus, &idlers);
>>       cpumask_clear_cpu(cpu, &cpus);
>>   
>> @@ -489,7 +496,7 @@ static int
>>       {
>>           cpumask_t cpu_idlers;
>>           cpumask_t nxt_idlers;
>> -        int nxt, weight_cpu, weight_nxt;
>> +        int nxt, nr_idlers_cpu, nr_idlers_nxt;
>>           int migrate_factor;
>>   
>>           nxt = cpumask_cycle(cpu, &cpus);
>> @@ -513,12 +520,12 @@ static int
>>               cpumask_and(&nxt_idlers, &idlers,
per_cpu(cpu_core_mask, nxt));
>>           }
>>   
>> -        weight_cpu = cpumask_weight(&cpu_idlers);
>> -        weight_nxt = cpumask_weight(&nxt_idlers);
>> +        nr_idlers_cpu = cpumask_weight(&cpu_idlers);
>> +        nr_idlers_nxt = cpumask_weight(&nxt_idlers);
>>           /* smt_power_savings: consolidate work rather than spreading
it */
>>           if ( sched_smt_power_savings ?
>> -             weight_cpu > weight_nxt :
>> -             weight_cpu * migrate_factor < weight_nxt )
>> +             nr_idlers_cpu > nr_idlers_nxt :
>> +             nr_idlers_cpu * migrate_factor < nr_idlers_nxt )
>>           {
>>               cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
>>               spc = CSCHED_PCPU(nxt);
> Despite you mentioning this in the description, these last two hunks
> are, afaict, only renaming variables (and that''s even debatable,
as
> the current names aren''t really misleading imo), and hence I
don''t
> think belong in a patch that clearly has the potential for causing
> (performance) regressions.
>
> That said - I don''t think it will (and even more, I''m
agreeable to the
> change done).
>
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS];
>>   #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE)
>>   #define is_idle_vcpu(v)   (is_idle_domain((v)->domain))
>>   
>> +#define current_on_cpu(_c) \
>> +  ( (per_cpu(schedule_data, _c).curr) )
>> +
> This, imo, really belings into sched-if.h.
Hmm, it looks like there are a number of things that could live in 
either sched-if.h or sched.h; but I think this one probably most closely 
links with thins like vcpu_is_runnable() and cpu_is_haltable(), both of 
which are in sched.h; so sched.h is where I''d put it.
> Plus - what''s the point of double parentheses, when in fact none
> at all would be needed?
>
> And finally, why "_c" and not just "c"?
I think the underscore is pretty standard in macros.

There''s certainly no need for double parentheses though.

  -George

George Dunlap

2012-Dec-14 19:57 UTC

head link

Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

On 12/12/12 02:52, Dario Faggioli wrote:> When looking at traces, it turns out to be useful to know where a
> waking-up VCPU is being queued. Yes, that is always the CPU where
> it ran last, but that information can well be lost in past trace
> records!
When you say "lost in past trace records", do you primarily mean that 
the records themselves have been lost (due to the per-cpu trace buffers 
filling up), or do you mean that it may be way way back and you don''t 
want to go back and find it?

If the latter, I think the best thing to do would be to just augment 
xenalyze to keep track of that information and print it when it sees the 
wake record.

  -George

George Dunlap

2012-Dec-14 20:00 UTC

head link

Re: [PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs

On 12/12/12 02:52, Dario Faggioli wrote:> So that it becomes possible to create scheduler specific trace records,
> within each scheduler, without worrying about the overlapping, and also
> without giving up being able to recognise them univocally. The latter
> is deemed as useful, since we can have more than one scheduler running
> at the same time, thanks to cpupools.
>
> The event ID is 12 bits, and this change uses the upper 3 of them for
> the ''scheduler ID''. This means we''re limited to
8 schedulers and to
> 512 scheduler specific tracing events. Both seem reasonable limitations
> as of now.
>
> This also converts the existing credit2 tracing (the only scheduler
> generating tracing events up to now) to the new system.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> ---
> Changes from v1:
>   * The event ID generaion macro is now called
`TRC_SCHED_CLASS_EVT()'',
>     and has been generalized and put in trace.h, as suggested.
>   * The handling of per-scheduler tracing IDs and masks have been
>     restructured, properly naming "ID" the numerical identifiers
>     and "MASK" the bitmasks, as requested.
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -29,18 +29,22 @@
>   #define d2printk(x...)
>   //#define d2printk printk
>   
> -#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
> -#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
> -#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
> -#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
> -#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
> -#define TRC_CSCHED2_TICKLE       TRC_SCHED_CLASS + 6
> -#define TRC_CSCHED2_CREDIT_RESET TRC_SCHED_CLASS + 7
> -#define TRC_CSCHED2_SCHED_TASKLET TRC_SCHED_CLASS + 8
> -#define TRC_CSCHED2_UPDATE_LOAD   TRC_SCHED_CLASS + 9
> -#define TRC_CSCHED2_RUNQ_ASSIGN   TRC_SCHED_CLASS + 10
> -#define TRC_CSCHED2_UPDATE_VCPU_LOAD   TRC_SCHED_CLASS + 11
> -#define TRC_CSCHED2_UPDATE_RUNQ_LOAD   TRC_SCHED_CLASS + 12
> +/*
> + * Credit2 tracing events ("only" 512 available!). Check
> + * include/public/trace.h for more details.
> + */
> +#define TRC_CSCHED2_TICK             TRC_SCHED_CLASS_EVT(CSCHED2, 1)
> +#define TRC_CSCHED2_RUNQ_POS         TRC_SCHED_CLASS_EVT(CSCHED2, 2)
> +#define TRC_CSCHED2_CREDIT_BURN      TRC_SCHED_CLASS_EVT(CSCHED2, 3)
> +#define TRC_CSCHED2_CREDIT_ADD       TRC_SCHED_CLASS_EVT(CSCHED2, 4)
> +#define TRC_CSCHED2_TICKLE_CHECK     TRC_SCHED_CLASS_EVT(CSCHED2, 5)
> +#define TRC_CSCHED2_TICKLE           TRC_SCHED_CLASS_EVT(CSCHED2, 6)
> +#define TRC_CSCHED2_CREDIT_RESET     TRC_SCHED_CLASS_EVT(CSCHED2, 7)
> +#define TRC_CSCHED2_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(CSCHED2, 8)
> +#define TRC_CSCHED2_UPDATE_LOAD      TRC_SCHED_CLASS_EVT(CSCHED2, 9)
> +#define TRC_CSCHED2_RUNQ_ASSIGN      TRC_SCHED_CLASS_EVT(CSCHED2, 10)
> +#define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11)
> +#define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12)
>   
>   /*
>    * WARNING: This is still in an experimental phase.  Status and work can
be found at the
> diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
> --- a/xen/include/public/trace.h
> +++ b/xen/include/public/trace.h
> @@ -57,6 +57,32 @@
>   #define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
>   #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
>   
> +/*
> + * The highest 3 bits of the last 12 bits of TRC_SCHED_CLASS above are
> + * reserved for encoding what scheduler produced the information. The
> + * actual event is encoded in the last 9 bits.
> + *
> + * This means we have 8 scheduling IDs available (which means at most 8
> + * schedulers generating events) and, in each scheduler, up to 512
> + * different events.
> + */
> +#define TRC_SCHED_ID_BITS 3
> +#define TRC_SCHED_ID_SHIFT (TRC_SUBCLS_SHIFT - TRC_SCHED_ID_BITS)
> +#define TRC_SCHED_ID_MASK (((1UL<<TRC_SCHED_ID_BITS) - 1) <<
TRC_SCHED_ID_SHIFT)
> +#define TRC_SCHED_EVT_MASK (~(TRC_SCHED_ID_MASK))
> +
> +/* Per-scheduler IDs, to identify scheduler specific events */
> +#define TRC_SCHED_CSCHED   0
> +#define TRC_SCHED_CSCHED2  1
> +#define TRC_SCHED_SEDF     2
> +#define TRC_SCHED_ARINC653 3
> +
> +/* Per-scheduler tracing */
> +#define TRC_SCHED_CLASS_EVT(_c, _e) \
> +  ( ( TRC_SCHED_CLASS | \
> +      ((TRC_SCHED_##_c << TRC_SCHED_ID_SHIFT) &
TRC_SCHED_ID_MASK) ) + \
> +    (_e & TRC_SCHED_EVT_MASK) )
> +
>   /* Trace classes for Hardware */
>   #define TRC_HW_PM           0x00801000   /* Power management traces */
>   #define TRC_HW_IRQ          0x00802000   /* Traces relating to the
handling of IRQs */

George Dunlap

2012-Dec-14 20:05 UTC

head link

Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing

On 12/12/12 02:52, Dario Faggioli wrote:> About tickling, and PCPU selection.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Changes from v1:
>   * Dummy `struct d {}'', accommodating `cpu'' only,
removed
>     in spite of the much more readable `trace_var(..., sizeof(cpu),
&cpu)'',
>     as suggested.
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -21,6 +21,7 @@
>   #include <asm/atomic.h>
>   #include <xen/errno.h>
>   #include <xen/keyhandler.h>
> +#include <xen/trace.h>
>   
>   
>   /*
> @@ -97,6 +98,18 @@
>   
>   
>   /*
> + * Credit tracing events ("only" 512 available!). Check
> + * include/public/trace.h for more details.
> + */
> +#define TRC_CSCHED_SCHED_TASKLET TRC_SCHED_CLASS_EVT(CSCHED, 1)
> +#define TRC_CSCHED_ACCOUNT_START TRC_SCHED_CLASS_EVT(CSCHED, 2)
> +#define TRC_CSCHED_ACCOUNT_STOP  TRC_SCHED_CLASS_EVT(CSCHED, 3)
> +#define TRC_CSCHED_STOLEN_VCPU   TRC_SCHED_CLASS_EVT(CSCHED, 4)
> +#define TRC_CSCHED_PICKED_CPU    TRC_SCHED_CLASS_EVT(CSCHED, 5)
> +#define TRC_CSCHED_TICKLE        TRC_SCHED_CLASS_EVT(CSCHED, 6)
> +
> +
> +/*
>    * Boot parameters
>    */
>   static int __read_mostly sched_credit_tslice_ms =
CSCHED_DEFAULT_TSLICE_MS;
> @@ -315,9 +328,18 @@ static inline void
>           }
>       }
>   
> -    /* Send scheduler interrupts to designated CPUs */
>       if ( !cpumask_empty(&mask) )
> +    {
> +        if ( unlikely(tb_init_done) )
> +        {
> +            /* Avoid TRACE_*: saves checking !tb_init_done each step */
> +            for_each_cpu(cpu, &mask)
> +                trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu), &cpu);
> +        }
Hmm, probably should have pointed this out before, but trace_var() is a 
static inline which checks tb_init_done -- you want __trace_var(). :-)

The rest still looks good.

  -George
> +
> +        /* Send scheduler interrupts to designated CPUs */
>           cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
> +    }
>   }
>   
>   static void
> @@ -554,6 +576,8 @@ static int
>       if ( commit && spc )
>          spc->idle_bias = cpu;
>   
> +    TRACE_3D(TRC_CSCHED_PICKED_CPU, vc->domain->domain_id,
vc->vcpu_id, cpu);
> +
>       return cpu;
>   }
>   
> @@ -586,6 +610,9 @@ static inline void
>           }
>       }
>   
> +    TRACE_3D(TRC_CSCHED_ACCOUNT_START, sdom->dom->domain_id,
> +             svc->vcpu->vcpu_id, sdom->active_vcpu_count);
> +
>       spin_unlock_irqrestore(&prv->lock, flags);
>   }
>   
> @@ -608,6 +635,9 @@ static inline void
>       {
>           list_del_init(&sdom->active_sdom_elem);
>       }
> +
> +    TRACE_3D(TRC_CSCHED_ACCOUNT_STOP, sdom->dom->domain_id,
> +             svc->vcpu->vcpu_id, sdom->active_vcpu_count);
>   }
>   
>   static void
> @@ -1241,6 +1271,8 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if (__csched_vcpu_is_migrateable(vc, cpu))
>               {
>                   /* We got a candidate. Grab it! */
> +                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> +                         vc->domain->domain_id, vc->vcpu_id);
>                   SCHED_VCPU_STAT_CRANK(speer, migrate_q);
>                   SCHED_STAT_CRANK(migrate_queued);
>                   WARN_ON(vc->is_urgent);
> @@ -1401,6 +1433,7 @@ csched_schedule(
>       /* Tasklet work (which runs in idle VCPU context) overrides all else.
*/
>       if ( tasklet_work_scheduled )
>       {
> +        TRACE_0D(TRC_CSCHED_SCHED_TASKLET);
>           snext = CSCHED_VCPU(idle_vcpu[cpu]);
>           snext->pri = CSCHED_PRI_TS_BOOST;
>       }

Jan Beulich

2012-Dec-17 08:35 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

>>> On 14.12.12 at 20:50, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 12/12/12 10:04, Jan Beulich wrote:
>>>>> On 12.12.12 at 03:52, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>> --- a/xen/common/sched_credit.c
>>> +++ b/xen/common/sched_credit.c
>>> @@ -59,6 +59,8 @@
>>>   #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *)
(_vcpu)->sched_priv)
>>>   #define CSCHED_DOM(_dom)    ((struct csched_dom *)
(_dom)->sched_priv)
>>>   #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
>>> +/* Is the first element of _cpu''s runq its idle vcpu? */
>>> +#define IS_RUNQ_IDLE(_cpu) 
(is_idle_vcpu(__runq_elem(RUNQ(_cpu)->next)->vcpu))
>>>   
>>>   
>>>   /*
>>> @@ -479,9 +481,14 @@ static int
>>>        * distinct cores first and guarantees we don''t do
something stupid
>>>        * like run two VCPUs on co-hyperthreads while there are idle
cores
>>>        * or sockets.
>>> +     *
>>> +     * Notice that, when computing the "idleness" of
cpu, we may want to
>>> +     * discount vc. That is, iff vc is the currently running and
the only
>>> +     * runnable vcpu on cpu, we add cpu to the idlers.
>>>        */
>>>       cpumask_and(&idlers, &cpu_online_map,
CSCHED_PRIV(ops)->idlers);
>>> -    cpumask_set_cpu(cpu, &idlers);
>>> +    if ( current_on_cpu(cpu) == vc && IS_RUNQ_IDLE(cpu) )
>>> +        cpumask_set_cpu(cpu, &idlers);
>>>       cpumask_and(&cpus, &cpus, &idlers);
>>>       cpumask_clear_cpu(cpu, &cpus);
>>>   
>>> @@ -489,7 +496,7 @@ static int
>>>       {
>>>           cpumask_t cpu_idlers;
>>>           cpumask_t nxt_idlers;
>>> -        int nxt, weight_cpu, weight_nxt;
>>> +        int nxt, nr_idlers_cpu, nr_idlers_nxt;
>>>           int migrate_factor;
>>>   
>>>           nxt = cpumask_cycle(cpu, &cpus);
>>> @@ -513,12 +520,12 @@ static int
>>>               cpumask_and(&nxt_idlers, &idlers,
per_cpu(cpu_core_mask, nxt));
>>>           }
>>>   
>>> -        weight_cpu = cpumask_weight(&cpu_idlers);
>>> -        weight_nxt = cpumask_weight(&nxt_idlers);
>>> +        nr_idlers_cpu = cpumask_weight(&cpu_idlers);
>>> +        nr_idlers_nxt = cpumask_weight(&nxt_idlers);
>>>           /* smt_power_savings: consolidate work rather than
spreading it */
>>>           if ( sched_smt_power_savings ?
>>> -             weight_cpu > weight_nxt :
>>> -             weight_cpu * migrate_factor < weight_nxt )
>>> +             nr_idlers_cpu > nr_idlers_nxt :
>>> +             nr_idlers_cpu * migrate_factor < nr_idlers_nxt )
>>>           {
>>>               cpumask_and(&nxt_idlers, &cpus,
&nxt_idlers);
>>>               spc = CSCHED_PCPU(nxt);
>> Despite you mentioning this in the description, these last two hunks
>> are, afaict, only renaming variables (and that''s even
debatable, as
>> the current names aren''t really misleading imo), and hence I
don''t
>> think belong in a patch that clearly has the potential for causing
>> (performance) regressions.
>>
>> That said - I don''t think it will (and even more, I''m
agreeable to the
>> change done).
>>
>>> --- a/xen/include/xen/sched.h
>>> +++ b/xen/include/xen/sched.h
>>> @@ -396,6 +396,9 @@ extern struct vcpu *idle_vcpu[NR_CPUS];
>>>   #define is_idle_domain(d) ((d)->domain_id == DOMID_IDLE)
>>>   #define is_idle_vcpu(v)   (is_idle_domain((v)->domain))
>>>   
>>> +#define current_on_cpu(_c) \
>>> +  ( (per_cpu(schedule_data, _c).curr) )
>>> +
>> This, imo, really belings into sched-if.h.
> 
> Hmm, it looks like there are a number of things that could live in 
> either sched-if.h or sched.h; but I think this one probably most closely 
> links with thins like vcpu_is_runnable() and cpu_is_haltable(), both of 
> which are in sched.h; so sched.h is where I''d put it.
Any use of schedule_data, the type of which is declared in
sched-if.h, should be in sched-if.h - someone only including
sched.h can''t make use of it anyway (and it''s intended to be
used by scheduler code, i.e. shouldn''t be visible to other
code).
>> Plus - what''s the point of double parentheses, when in fact
none
>> at all would be needed?
>>
>> And finally, why "_c" and not just "c"?
> 
> I think the underscore is pretty standard in macros.
It''s bad practice imo; I have always understood this as
questionable attempts of people to avoid name clashes (which
is understandable only for variables declared locally inside a
macro definition).

Jan

Dario Faggioli

2012-Dec-17 14:36 UTC

head link

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

On Mon, 2012-12-17 at 08:35 +0000, Jan Beulich wrote: > >>> On 14.12.12 at 20:50, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> > On 12/12/12 10:04, Jan Beulich wrote:
> >> This, imo, really belings into sched-if.h.
> > 
> > Hmm, it looks like there are a number of things that could live in 
> > either sched-if.h or sched.h; but I think this one probably most
closely
> > links with thins like vcpu_is_runnable() and cpu_is_haltable(), both
of
> > which are in sched.h; so sched.h is where I''d put it.
> 
> Any use of schedule_data, the type of which is declared in
> sched-if.h, should be in sched-if.h - someone only including
> sched.h can''t make use of it anyway (and it''s intended to
be
> used by scheduler code, i.e. shouldn''t be visible to other
> code).
> Ok, this argument, I find quite convincing, I think I''m putting the
macro in sched-if.h

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Dec-17 14:41 UTC

head link

Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

On Fri, 2012-12-14 at 19:39 +0000, George Dunlap wrote: > Hmm, I hadn''t read this patch when I commented about removing the
macro
> from the first patch. :-)
> :-)
> I personally think that using vc->processor is better in that patch 
> anyway; but using this macro elsewhere is probably fine.
> Ok.
> I think from a taste point of view, I would have put this patch, with 
> the new definition, as the first patch in the series, and the had the 
> second patch just use it.
> Ok then, when respinning I''ll have the first patch defining and using
the macro. Then I''ll have the ''fix picking'' patch
using vc->processor
_instead_ the macro. Yes, that kills the need for the macro, but since I
already have the code for it, and I think things with it does look a bit
better, I won''t actually kill it. Let me know (here or replying to the
corresponding e-mail in the reposting) if you don''t want it.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Dec-17 14:43 UTC

head link

Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

On Fri, 2012-12-14 at 19:57 +0000, George Dunlap wrote: > On 12/12/12 02:52, Dario Faggioli wrote:
> > When looking at traces, it turns out to be useful to know where a
> > waking-up VCPU is being queued. Yes, that is always the CPU where
> > it ran last, but that information can well be lost in past trace
> > records!
> 
> When you say "lost in past trace records", do you primarily mean
that
> the records themselves have been lost (due to the per-cpu trace buffers 
> filling up), or do you mean that it may be way way back and you
don''t
> want to go back and find it?
> The latter... I''m quite lazy when looking at traces! :-P
> If the latter, I think the best thing to do would be to just augment 
> xenalyze to keep track of that information and print it when it sees the 
> wake record.
> I agree, I''ll kill this patch for now, and investigate further solution
along the line you suggested in the future.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Dec-17 14:45 UTC

head link

Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing

On Fri, 2012-12-14 at 20:05 +0000, George Dunlap wrote: > > -    /* Send scheduler interrupts to designated CPUs */
> >       if ( !cpumask_empty(&mask) )
> > +    {
> > +        if ( unlikely(tb_init_done) )
> > +        {
> > +            /* Avoid TRACE_*: saves checking !tb_init_done each step
*/
> > +            for_each_cpu(cpu, &mask)
> > +                trace_var(TRC_CSCHED_TICKLE, 0, sizeof(cpu),
&cpu);
> > +        }
> 
> Hmm, probably should have pointed this out before, but trace_var() is a 
> static inline which checks tb_init_done -- you want __trace_var(). :-)
> Correct. My bad. It actually was __trace_var() at the beginning (that''s
the reason why this particular record deserves special treatment), but I
messed things up while preparing this new version. Sorry for that and
thanks for having spotted this! :-)

Will fix.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Dec 2012 - [PATCH 0 of 6 v2] xen: sched_credit: fix picking & tickling and also add some tracing

[PATCH 0 of 6 v2] xen: sched_credit: fix picking & tickling and also add some tracing

[PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

[PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs

[PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

[PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

[PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs

[PATCH 6 of 6 v2] xen: sched_credit: add some tracing

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs

Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

Re: [PATCH 5 of 6 v2] xen: tracing: introduce per-scheduler trace event IDs

Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 1 of 6 v2] xen: sched_credit: improve picking up the idlal CPU for a VCPU

Re: [PATCH 3 of 6 v2] xen: sched_credit: use current_on_cpu() when appropriate

Re: [PATCH 4 of 6 v2] xen: tracing: report where a VCPU wakes up

Re: [PATCH 6 of 6 v2] xen: sched_credit: add some tracing