Dario Faggioli
2013-Sep-13 16:09 UTC
[PATCH] xen: sched_credit: filter node-affinity mask against online cpus
in _csched_cpu_pick(), as not doing so may result in the domain''s node-affinity mask (as retrieved by csched_balance_cpumask() ) and online mask (as retrieved by cpupool_scheduler_cpumask() ) having an empty intersection. Therefore, when attempting a node-affinity load balancing step and running this: ... /* Pick an online CPU from the proper affinity mask */ csched_balance_cpumask(vc, balance_step, &cpus); cpumask_and(&cpus, &cpus, online); ... we end up with an empty cpumask (in cpus). At this point, in the following code: .... /* If present, prefer vc''s current processor */ cpu = cpumask_test_cpu(vc->processor, &cpus) ? vc->processor : cpumask_cycle(vc->processor, &cpus); .... an ASSERT (from inside cpumask_cycle() ) triggers like this: (XEN) Xen call trace: (XEN) [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652 (XEN) [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10 (XEN) [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e (XEN) [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287 (XEN) [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4 (XEN) [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1 (XEN) [<ffff82d080127793>] do_tasklet_work+0x78/0xab (XEN) [<ffff82d080127a70>] do_tasklet+0x5f/0x8b (XEN) [<ffff82d080158985>] idle_loop+0x57/0x5e (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) Assertion ''cpu < nr_cpu_ids'' failed at /home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481 It is for example sufficient to have a domain with node-affinity to NUMA node 1 running, and issueing a `xl cpupool-numa-split'' would make the above happen. That is because, by default, all the existing domains remain assigned to the first cpupool, and it now (after the cpupool-numa-split) only includes NUMA node 0. This change prevents that by generalizing the function used for figuring out whether a node-affinity load balancing step is legit or not. This way we can, in _csched_cpu_pick(), figure out early enough that the mask would end up empty, skip the step all together and avoid the splat. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Keir Fraser <keir@xen.org> --- This was originally submitted (yesterday) as "xen: make sure the node-affinity is always updated". After discussing it with George, I got convinced that a fix like this is more appropriate, since it deals with the issue in the scheduler, without making a mess out of the user interface. This is therefore not being posted as a real v2 of that patch because the approach radically changed. --- xen/common/sched_credit.c | 41 ++++++++++++++++++++++++++++++----------- 1 file changed, 30 insertions(+), 11 deletions(-) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index dbe6de6..d76cd88 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -296,15 +296,25 @@ static void csched_set_node_affinity( * vcpu-affinity balancing is always necessary and must never be skipped. * OTOH, if a domain''s node-affinity is said to be automatically computed * (or if it just spans all the nodes), we can safely avoid dealing with - * node-affinity entirely. Ah, node-affinity is also deemed meaningless - * in case it has empty intersection with the vcpu''s vcpu-affinity, as it - * would mean trying to schedule it on _no_ pcpu! + * node-affinity entirely. + * + * Node-affinity is also deemed meaningless in case it has empty + * intersection with mask, to cover the cases where using the node-affinity + * mask seems legit, but would instead led to trying to schedule the vcpu + * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s + * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus + * in the domain''s cpupool. */ -#define __vcpu_has_node_affinity(vc) \ - ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask) \ - || !cpumask_intersects(vc->cpu_affinity, \ - CSCHED_DOM(vc->domain)->node_affinity_cpumask) \ - || vc->domain->auto_node_affinity == 1) ) +static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t *mask) +{ + if ( vc->domain->auto_node_affinity == 1 + || cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask) + || !cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask, + mask) ) + return 0; + + return 1; +} /* * Each csched-balance step uses its own cpumask. This function determines @@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) int new_idlers_empty; if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(new->vcpu) ) + && !__vcpu_has_node_affinity(new->vcpu, + new->vcpu->cpu_affinity) ) continue; /* Are there idlers suitable for new (for this balance step)? */ @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) int balance_step; online = cpupool_scheduler_cpumask(vc->domain->cpupool); + cpumask_and(&cpus, vc->cpu_affinity, online); + for_each_csched_balance_step( balance_step ) { + /* + * We filter the node-affinity mask against + * [vc->cpu_affinity && online] here to avoid problems if all + * the cpus in in the node-affinity mask are offline (e.g., + * because the domain moved to a different cpupool). + */ if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc) ) + && !__vcpu_has_node_affinity(vc, &cpus) ) continue; /* Pick an online CPU from the proper affinity mask */ @@ -1445,7 +1464,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) * or counter. */ if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc) ) + && !__vcpu_has_node_affinity(vc, vc->cpu_affinity) ) continue; csched_balance_cpumask(vc, balance_step, csched_balance_mask);
George Dunlap
2013-Sep-16 14:00 UTC
Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus
On 13/09/13 17:09, Dario Faggioli wrote:> in _csched_cpu_pick(), as not doing so may result in the domain''s > node-affinity mask (as retrieved by csched_balance_cpumask() ) > and online mask (as retrieved by cpupool_scheduler_cpumask() ) > having an empty intersection. > > Therefore, when attempting a node-affinity load balancing step > and running this: > > ... > /* Pick an online CPU from the proper affinity mask */ > csched_balance_cpumask(vc, balance_step, &cpus); > cpumask_and(&cpus, &cpus, online); > ... > > we end up with an empty cpumask (in cpus). At this point, in > the following code: > > .... > /* If present, prefer vc''s current processor */ > cpu = cpumask_test_cpu(vc->processor, &cpus) > ? vc->processor > : cpumask_cycle(vc->processor, &cpus); > .... > > an ASSERT (from inside cpumask_cycle() ) triggers like this: > > (XEN) Xen call trace: > (XEN) [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652 > (XEN) [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10 > (XEN) [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e > (XEN) [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287 > (XEN) [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4 > (XEN) [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1 > (XEN) [<ffff82d080127793>] do_tasklet_work+0x78/0xab > (XEN) [<ffff82d080127a70>] do_tasklet+0x5f/0x8b > (XEN) [<ffff82d080158985>] idle_loop+0x57/0x5e > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion ''cpu < nr_cpu_ids'' failed at /home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481 > > It is for example sufficient to have a domain with node-affinity > to NUMA node 1 running, and issueing a `xl cpupool-numa-split'' > would make the above happen. That is because, by default, all > the existing domains remain assigned to the first cpupool, and > it now (after the cpupool-numa-split) only includes NUMA node 0. > > This change prevents that by generalizing the function used > for figuring out whether a node-affinity load balancing step > is legit or not. This way we can, in _csched_cpu_pick(), > figure out early enough that the mask would end up empty, > skip the step all together and avoid the splat. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Cc: George Dunlap <George.Dunlap@eu.citrix.com> > Cc: Jan Beulich <JBeulich@suse.com> > Cc: Keir Fraser <keir@xen.org> > --- > This was originally submitted (yesterday) as "xen: make sure the node-affinity > is always updated". After discussing it with George, I got convinced that a fix > like this is more appropriate, since it deals with the issue in the scheduler, > without making a mess out of the user interface. > > This is therefore not being posted as a real v2 of that patch because the > approach radically changed. > --- > xen/common/sched_credit.c | 41 ++++++++++++++++++++++++++++++----------- > 1 file changed, 30 insertions(+), 11 deletions(-) > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index dbe6de6..d76cd88 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -296,15 +296,25 @@ static void csched_set_node_affinity( > * vcpu-affinity balancing is always necessary and must never be skipped. > * OTOH, if a domain''s node-affinity is said to be automatically computed > * (or if it just spans all the nodes), we can safely avoid dealing with > - * node-affinity entirely. Ah, node-affinity is also deemed meaningless > - * in case it has empty intersection with the vcpu''s vcpu-affinity, as it > - * would mean trying to schedule it on _no_ pcpu! > + * node-affinity entirely. > + * > + * Node-affinity is also deemed meaningless in case it has empty > + * intersection with mask, to cover the cases where using the node-affinity > + * mask seems legit, but would instead led to trying to schedule the vcpu > + * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s > + * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus > + * in the domain''s cpupool. > */ > -#define __vcpu_has_node_affinity(vc) \ > - ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask) \ > - || !cpumask_intersects(vc->cpu_affinity, \ > - CSCHED_DOM(vc->domain)->node_affinity_cpumask) \ > - || vc->domain->auto_node_affinity == 1) ) > +static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t *mask) > +{ > + if ( vc->domain->auto_node_affinity == 1 > + || cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask) > + || !cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask, > + mask) ) > + return 0; > + > + return 1; > +} > > /* > * Each csched-balance step uses its own cpumask. This function determines > @@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) > int new_idlers_empty; > > if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY > - && !__vcpu_has_node_affinity(new->vcpu) ) > + && !__vcpu_has_node_affinity(new->vcpu, > + new->vcpu->cpu_affinity) ) > continue; > > /* Are there idlers suitable for new (for this balance step)? */ > @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) > int balance_step; > > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > + cpumask_and(&cpus, vc->cpu_affinity, online); > + > for_each_csched_balance_step( balance_step ) > { > + /* > + * We filter the node-affinity mask against > + * [vc->cpu_affinity && online] here to avoid problems if all > + * the cpus in in the node-affinity mask are offline (e.g., > + * because the domain moved to a different cpupool). > + */It''s probably worth mentioning that the reason you do the cpumask_and here and not in the other place this is called is that cpumask_cycle is called after this one (which will choke on an empy mask), but not in the other place it''s called. Other than that, looks good -- thanks. -George
Dario Faggioli
2013-Sep-16 15:23 UTC
Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus
On lun, 2013-09-16 at 15:00 +0100, George Dunlap wrote:> On 13/09/13 17:09, Dario Faggioli wrote: > > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > > index dbe6de6..d76cd88 100644 > > --- a/xen/common/sched_credit.c > > +++ b/xen/common/sched_credit.c > > > @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) > > int balance_step; > > > > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > > + cpumask_and(&cpus, vc->cpu_affinity, online); > > + > > for_each_csched_balance_step( balance_step ) > > { > > + /* > > + * We filter the node-affinity mask against > > + * [vc->cpu_affinity && online] here to avoid problems if all > > + * the cpus in in the node-affinity mask are offline (e.g., > > + * because the domain moved to a different cpupool). > > + */ > > It''s probably worth mentioning that the reason you do the cpumask_and > here and not in the other place this is called is that cpumask_cycle is > called after this one (which will choke on an empy mask), but not in the > other place it''s called. >Well, it is indeed what the comment says, or at leas what I wanted it to say. :-) I tied to concentrate on the actual issue, which is avoid picking an offline cpu, independently on which specific ASSERT is the one that triggers. Anyway, I can try to make the comment more accurate and descriptive though.> Other than that, looks good -- thanks. >Ok, I''ll resend with an updated comment. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Sep-16 15:27 UTC
Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus
On 16/09/13 16:23, Dario Faggioli wrote:> On lun, 2013-09-16 at 15:00 +0100, George Dunlap wrote: >> On 13/09/13 17:09, Dario Faggioli wrote: >> >>> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c >>> index dbe6de6..d76cd88 100644 >>> --- a/xen/common/sched_credit.c >>> +++ b/xen/common/sched_credit.c >>> @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) >>> int balance_step; >>> >>> online = cpupool_scheduler_cpumask(vc->domain->cpupool); >>> + cpumask_and(&cpus, vc->cpu_affinity, online); >>> + >>> for_each_csched_balance_step( balance_step ) >>> { >>> + /* >>> + * We filter the node-affinity mask against >>> + * [vc->cpu_affinity && online] here to avoid problems if all >>> + * the cpus in in the node-affinity mask are offline (e.g., >>> + * because the domain moved to a different cpupool). >>> + */ >> It''s probably worth mentioning that the reason you do the cpumask_and >> here and not in the other place this is called is that cpumask_cycle is >> called after this one (which will choke on an empy mask), but not in the >> other place it''s called. >> > Well, it is indeed what the comment says, or at leas what I wanted it to > say. :-) > > I tied to concentrate on the actual issue, which is avoid picking an > offline cpu, independently on which specific ASSERT is the one that > triggers.Ok, but the question is, why AND it with online cpus here, but not in the other place? If the "actual issue" is that the mask may be empty, it should be done in both places. If the "actual issue" is that cpu_cycle() chokes on empty masks, then that''s an important think to point out, so poor programmers don''t get confused as to why it''s different. :-) -George
Dario Faggioli
2013-Sep-17 15:14 UTC
Re: [PATCH] xen: sched_credit: filter node-affinity mask against online cpus
On lun, 2013-09-16 at 16:27 +0100, George Dunlap wrote:> On 16/09/13 16:23, Dario Faggioli wrote: > > I tied to concentrate on the actual issue, which is avoid picking an > > offline cpu, independently on which specific ASSERT is the one that > > triggers. > > Ok, but the question is, why AND it with online cpus here, but not in > the other place? If the "actual issue" is that the mask may be empty, > it should be done in both places. If the "actual issue" is that > cpu_cycle() chokes on empty masks, then that''s an important think to > point out, so poor programmers don''t get confused as to why it''s > different. :-) >It''s a little bit of both. In fact, _csched_cpu_pick() *is* the only function structure in such a way that the mask can''t be empty (if we leave node-affinity out). The other ones either don''t care or account for it in different ways and places, and that was like that even before of my NUMA-aware scheduling patches... So, yes, that one call site is, if not special, at least different from the other ones. At the same time, you are right in saying that the consequence of the above which is really bothering and calling for this fix _is_ indeed that the ASSERT() fails. So, I''m re-sending the patch with an updated comment which hopefully clarifies the situation a bit more. Let me know what you think about it (e-mail coming right after this one) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel