thr3ads.net - Xen devel - [PATCH] xen: sched_credit: filter node-affinity mask against online cpus [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2013-Sep-13 16:09 UTC

[PATCH] xen: sched_credit: filter node-affinity mask against online cpus

in _csched_cpu_pick(), as not doing so may result in the domain''s
node-affinity mask (as retrieved by csched_balance_cpumask() )
and online mask (as retrieved by cpupool_scheduler_cpumask() )
having an empty intersection.

Therefore, when attempting a node-affinity load balancing step
and running this:

    ...
    /* Pick an online CPU from the proper affinity mask */
    csched_balance_cpumask(vc, balance_step, &cpus);
    cpumask_and(&cpus, &cpus, online);
    ...

we end up with an empty cpumask (in cpus). At this point, in
the following code:

    ....
    /* If present, prefer vc''s current processor */
    cpu = cpumask_test_cpu(vc->processor, &cpus)
            ? vc->processor
            : cpumask_cycle(vc->processor, &cpus);
    ....

an ASSERT (from inside cpumask_cycle() ) triggers like this:

(XEN) Xen call trace:
(XEN)    [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652
(XEN)    [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10
(XEN)    [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e
(XEN)    [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287
(XEN)    [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4
(XEN)    [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1
(XEN)    [<ffff82d080127793>] do_tasklet_work+0x78/0xab
(XEN)    [<ffff82d080127a70>] do_tasklet+0x5f/0x8b
(XEN)    [<ffff82d080158985>] idle_loop+0x57/0x5e
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion ''cpu < nr_cpu_ids'' failed at
/home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481

It is for example sufficient to have a domain with node-affinity
to NUMA node 1 running, and issueing a `xl cpupool-numa-split''
would make the above happen. That is because, by default, all
the existing domains remain assigned to the first cpupool, and
it now (after the cpupool-numa-split) only includes NUMA node 0.

This change prevents that by generalizing the function used
for figuring out whether a node-affinity load balancing step
is legit or not. This way we can, in _csched_cpu_pick(),
figure out early enough that the mask would end up empty,
skip the step all together and avoid the splat.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
---
This was originally submitted (yesterday) as "xen: make sure the
node-affinity
is always updated". After discussing it with George, I got convinced that a
fix
like this is more appropriate, since it deals with the issue in the scheduler,
without making a mess out of the user interface.

This is therefore not being posted as a real v2 of that patch because the
approach radically changed.
---
 xen/common/sched_credit.c |   41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index dbe6de6..d76cd88 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -296,15 +296,25 @@ static void csched_set_node_affinity(
  * vcpu-affinity balancing is always necessary and must never be skipped.
  * OTOH, if a domain''s node-affinity is said to be automatically
computed
  * (or if it just spans all the nodes), we can safely avoid dealing with
- * node-affinity entirely. Ah, node-affinity is also deemed meaningless
- * in case it has empty intersection with the vcpu''s vcpu-affinity, as
it
- * would mean trying to schedule it on _no_ pcpu!
+ * node-affinity entirely.
+ *
+ * Node-affinity is also deemed meaningless in case it has empty
+ * intersection with mask, to cover the cases where using the node-affinity
+ * mask seems legit, but would instead led to trying to schedule the vcpu
+ * on _no_ pcpu! Typical use cases are for mask to be equal to the
vcpu''s
+ * vcpu-affinity, or to the && of vcpu-affinity and the set of online
cpus
+ * in the domain''s cpupool.
  */
-#define __vcpu_has_node_affinity(vc)                                          \
-    ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)      
\
-        || !cpumask_intersects(vc->cpu_affinity,                            
\
-                              
CSCHED_DOM(vc->domain)->node_affinity_cpumask) \
-        || vc->domain->auto_node_affinity == 1) )
+static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t *mask)
+{
+    if ( vc->domain->auto_node_affinity == 1
+         || cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)
+         ||
!cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask,
+                                mask) )
+        return 0;
+
+    return 1;
+}
 
 /*
  * Each csched-balance step uses its own cpumask. This function determines
@@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
             int new_idlers_empty;
 
             if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-                 && !__vcpu_has_node_affinity(new->vcpu) )
+                 && !__vcpu_has_node_affinity(new->vcpu,
+                                              new->vcpu->cpu_affinity) )
                 continue;
 
             /* Are there idlers suitable for new (for this balance step)? */
@@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu
*vc, bool_t commit)
     int balance_step;
 
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
+    cpumask_and(&cpus, vc->cpu_affinity, online);
+
     for_each_csched_balance_step( balance_step )
     {
+        /*
+         * We filter the node-affinity mask against
+         * [vc->cpu_affinity && online] here to avoid problems if
all
+         * the cpus in in the node-affinity mask are offline (e.g.,
+         * because the domain moved to a different cpupool).
+         */
         if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-             && !__vcpu_has_node_affinity(vc) )
+             && !__vcpu_has_node_affinity(vc, &cpus) )
             continue;
 
         /* Pick an online CPU from the proper affinity mask */
@@ -1445,7 +1464,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int
balance_step)
              * or counter.
              */
             if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
-                 && !__vcpu_has_node_affinity(vc) )
+                 && !__vcpu_has_node_affinity(vc, vc->cpu_affinity)
)
                 continue;
 
             csched_balance_cpumask(vc, balance_step, csched_balance_mask);

George Dunlap

2013-Sep-16 14:00 UTC

head link

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

On 13/09/13 17:09, Dario Faggioli wrote:> in _csched_cpu_pick(), as not doing so may result in the domain''s
> node-affinity mask (as retrieved by csched_balance_cpumask() )
> and online mask (as retrieved by cpupool_scheduler_cpumask() )
> having an empty intersection.
>
> Therefore, when attempting a node-affinity load balancing step
> and running this:
>
>      ...
>      /* Pick an online CPU from the proper affinity mask */
>      csched_balance_cpumask(vc, balance_step, &cpus);
>      cpumask_and(&cpus, &cpus, online);
>      ...
>
> we end up with an empty cpumask (in cpus). At this point, in
> the following code:
>
>      ....
>      /* If present, prefer vc''s current processor */
>      cpu = cpumask_test_cpu(vc->processor, &cpus)
>              ? vc->processor
>              : cpumask_cycle(vc->processor, &cpus);
>      ....
>
> an ASSERT (from inside cpumask_cycle() ) triggers like this:
>
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652
> (XEN)    [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10
> (XEN)    [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e
> (XEN)    [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287
> (XEN)    [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4
> (XEN)    [<ffff82d08010544f>]
continue_hypercall_tasklet_handler+0x4a/0xb1
> (XEN)    [<ffff82d080127793>] do_tasklet_work+0x78/0xab
> (XEN)    [<ffff82d080127a70>] do_tasklet+0x5f/0x8b
> (XEN)    [<ffff82d080158985>] idle_loop+0x57/0x5e
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion ''cpu < nr_cpu_ids'' failed at
/home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481
>
> It is for example sufficient to have a domain with node-affinity
> to NUMA node 1 running, and issueing a `xl cpupool-numa-split''
> would make the above happen. That is because, by default, all
> the existing domains remain assigned to the first cpupool, and
> it now (after the cpupool-numa-split) only includes NUMA node 0.
>
> This change prevents that by generalizing the function used
> for figuring out whether a node-affinity load balancing step
> is legit or not. This way we can, in _csched_cpu_pick(),
> figure out early enough that the mask would end up empty,
> skip the step all together and avoid the splat.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Keir Fraser <keir@xen.org>
> ---
> This was originally submitted (yesterday) as "xen: make sure the
node-affinity
> is always updated". After discussing it with George, I got convinced
that a fix
> like this is more appropriate, since it deals with the issue in the
scheduler,
> without making a mess out of the user interface.
>
> This is therefore not being posted as a real v2 of that patch because the
> approach radically changed.
> ---
>   xen/common/sched_credit.c |   41
++++++++++++++++++++++++++++++-----------
>   1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index dbe6de6..d76cd88 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -296,15 +296,25 @@ static void csched_set_node_affinity(
>    * vcpu-affinity balancing is always necessary and must never be skipped.
>    * OTOH, if a domain''s node-affinity is said to be automatically
computed
>    * (or if it just spans all the nodes), we can safely avoid dealing with
> - * node-affinity entirely. Ah, node-affinity is also deemed meaningless
> - * in case it has empty intersection with the vcpu''s
vcpu-affinity, as it
> - * would mean trying to schedule it on _no_ pcpu!
> + * node-affinity entirely.
> + *
> + * Node-affinity is also deemed meaningless in case it has empty
> + * intersection with mask, to cover the cases where using the
node-affinity
> + * mask seems legit, but would instead led to trying to schedule the vcpu
> + * on _no_ pcpu! Typical use cases are for mask to be equal to the
vcpu''s
> + * vcpu-affinity, or to the && of vcpu-affinity and the set of
online cpus
> + * in the domain''s cpupool.
>    */
> -#define __vcpu_has_node_affinity(vc)                                      
\
> -    ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask) 
\
> -        || !cpumask_intersects(vc->cpu_affinity,                       
\
> -                              
CSCHED_DOM(vc->domain)->node_affinity_cpumask) \
> -        || vc->domain->auto_node_affinity == 1) )
> +static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t
*mask)
> +{
> +    if ( vc->domain->auto_node_affinity == 1
> +         ||
cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)
> +         ||
!cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask,
> +                                mask) )
> +        return 0;
> +
> +    return 1;
> +}
>   
>   /*
>    * Each csched-balance step uses its own cpumask. This function
determines
> @@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu
*new)
>               int new_idlers_empty;
>   
>               if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
> -                 && !__vcpu_has_node_affinity(new->vcpu) )
> +                 && !__vcpu_has_node_affinity(new->vcpu,
> +                                             
new->vcpu->cpu_affinity) )
>                   continue;
>   
>               /* Are there idlers suitable for new (for this balance step)?
*/
> @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct
vcpu *vc, bool_t commit)
>       int balance_step;
>   
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, vc->cpu_affinity, online);
> +
>       for_each_csched_balance_step( balance_step )
>       {
> +        /*
> +         * We filter the node-affinity mask against
> +         * [vc->cpu_affinity && online] here to avoid problems
if all
> +         * the cpus in in the node-affinity mask are offline (e.g.,
> +         * because the domain moved to a different cpupool).
> +         */
It''s probably worth mentioning that the reason you do the cpumask_and 
here and not in the other place this is called is that cpumask_cycle is 
called after this one (which will choke on an empy mask), but not in the 
other place it''s called.

Other than that, looks good -- thanks.

  -George

Dario Faggioli

2013-Sep-16 15:23 UTC

head link

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

On lun, 2013-09-16 at 15:00 +0100, George Dunlap wrote:> On 13/09/13 17:09, Dario Faggioli wrote:
>
> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> > index dbe6de6..d76cd88 100644
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
>
> > @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops,
struct vcpu *vc, bool_t commit)
> >       int balance_step;
> >   
> >       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> > +    cpumask_and(&cpus, vc->cpu_affinity, online);
> > +
> >       for_each_csched_balance_step( balance_step )
> >       {
> > +        /*
> > +         * We filter the node-affinity mask against
> > +         * [vc->cpu_affinity && online] here to avoid
problems if all
> > +         * the cpus in in the node-affinity mask are offline (e.g.,
> > +         * because the domain moved to a different cpupool).
> > +         */
> 
> It''s probably worth mentioning that the reason you do the
cpumask_and
> here and not in the other place this is called is that cpumask_cycle is 
> called after this one (which will choke on an empy mask), but not in the 
> other place it''s called.
> Well, it is indeed what the comment says, or at leas what I wanted it to
say. :-)

I tied to concentrate on the actual issue, which is avoid picking an
offline cpu, independently on which specific ASSERT is the one that
triggers.

Anyway, I can try to make the comment more accurate and descriptive
though.
> Other than that, looks good -- thanks.
> Ok, I''ll resend with an updated comment.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Sep-16 15:27 UTC

head link

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

On 16/09/13 16:23, Dario Faggioli wrote:> On lun, 2013-09-16 at 15:00 +0100, George Dunlap wrote:
>> On 13/09/13 17:09, Dario Faggioli wrote:
>>
>>> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
>>> index dbe6de6..d76cd88 100644
>>> --- a/xen/common/sched_credit.c
>>> +++ b/xen/common/sched_credit.c
>>> @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops,
struct vcpu *vc, bool_t commit)
>>>        int balance_step;
>>>    
>>>        online =
cpupool_scheduler_cpumask(vc->domain->cpupool);
>>> +    cpumask_and(&cpus, vc->cpu_affinity, online);
>>> +
>>>        for_each_csched_balance_step( balance_step )
>>>        {
>>> +        /*
>>> +         * We filter the node-affinity mask against
>>> +         * [vc->cpu_affinity && online] here to avoid
problems if all
>>> +         * the cpus in in the node-affinity mask are offline
(e.g.,
>>> +         * because the domain moved to a different cpupool).
>>> +         */
>> It''s probably worth mentioning that the reason you do the
cpumask_and
>> here and not in the other place this is called is that cpumask_cycle is
>> called after this one (which will choke on an empy mask), but not in
the
>> other place it''s called.
>>
> Well, it is indeed what the comment says, or at leas what I wanted it to
> say. :-)
>
> I tied to concentrate on the actual issue, which is avoid picking an
> offline cpu, independently on which specific ASSERT is the one that
> triggers.
Ok, but the question is, why AND it with online cpus here, but not in 
the other place?  If the "actual issue" is that the mask may be empty,
it should be done in both places.  If the "actual issue" is that 
cpu_cycle() chokes on empty masks, then that''s an important think to 
point out, so poor programmers don''t get confused as to why
it''s
different. :-)

  -George

Dario Faggioli

2013-Sep-17 15:14 UTC

head link

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

On lun, 2013-09-16 at 16:27 +0100, George Dunlap wrote:> On 16/09/13 16:23, Dario Faggioli wrote:
> > I tied to concentrate on the actual issue, which is avoid picking an
> > offline cpu, independently on which specific ASSERT is the one that
> > triggers.
> 
> Ok, but the question is, why AND it with online cpus here, but not in 
> the other place?  If the "actual issue" is that the mask may be
empty,
> it should be done in both places.  If the "actual issue" is that 
> cpu_cycle() chokes on empty masks, then that''s an important think
to
> point out, so poor programmers don''t get confused as to why
it''s
> different. :-)
> It''s a little bit of both. In fact, _csched_cpu_pick() *is* the only
function structure in such a way that the mask can''t be empty (if we
leave node-affinity out). The other ones either don''t care or account
for it in different ways and places, and that was like that even before
of my NUMA-aware scheduling patches... So, yes, that one call site is,
if not special, at least different from the other ones.

At the same time, you are right in saying that the consequence of the
above which is really bothering and calling for this fix _is_ indeed
that the ASSERT() fails.

So, I''m re-sending the patch with an updated comment which hopefully
clarifies the situation a bit more.

Let me know what you think about it (e-mail coming right after this one)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Sep 2013 - [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

[PATCH] xen: sched_credit: filter node-affinity mask against online cpus

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus