The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was incompatible with the SEDF scheduler: Any vCPU using VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux guests) ends up on pCPU0 after that call. Obviously, running all PV guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those guests rather sooner than later. So the main thing that was clearly wrong (and bogus from the beginning) was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced by a construct that prefers to put back the vCPU on the pCPU that it got launched on. However, there''s one more glitch: When reducing the affinity of a vCPU temporarily, and then widening it again to a set that includes the pCPU that the vCPU was last running on, the generic scheduler code would not force a migration of that vCPU, and hence it would forever stay on the pCPU it last ran on. Since that can again create a load imbalance, the SEDF scheduler wants a migration to happen regardless of it being apparently unnecessary. Of course, an alternative to checking for SEDF explicitly in vcpu_set_affinity() would be to introduce a flags field in struct scheduler, and have SEDF set a "always-migrate-on-affinity-change" flag. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/common/sched_sedf.c +++ b/xen/common/sched_sedf.c @@ -397,7 +397,8 @@ static int sedf_pick_cpu(const struct sc online = cpupool_scheduler_cpumask(v->domain->cpupool); cpumask_and(&online_affinity, v->cpu_affinity, online); - return cpumask_first(&online_affinity); + return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, + &online_affinity); } /* --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -605,7 +605,8 @@ int vcpu_set_affinity(struct vcpu *v, co vcpu_schedule_lock_irq(v); cpumask_copy(v->cpu_affinity, affinity); - if ( !cpumask_test_cpu(v->processor, v->cpu_affinity) ) + if ( VCPU2OP(v)->sched_id == XEN_SCHEDULER_SEDF || + !cpumask_test_cpu(v->processor, v->cpu_affinity) ) set_bit(_VPF_migrating, &v->pause_flags); vcpu_schedule_unlock_irq(v); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 01/03/2013 15:35, "Jan Beulich" <JBeulich@suse.com> wrote:> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was > incompatible with the SEDF scheduler: Any vCPU using > VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux > guests) ends up on pCPU0 after that call. Obviously, running all PV > guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those > guests rather sooner than later. > > So the main thing that was clearly wrong (and bogus from the beginning) > was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced > by a construct that prefers to put back the vCPU on the pCPU that it > got launched on. > > However, there''s one more glitch: When reducing the affinity of a vCPU > temporarily, and then widening it again to a set that includes the pCPU > that the vCPU was last running on, the generic scheduler code would not > force a migration of that vCPU, and hence it would forever stay on the > pCPU it last ran on. Since that can again create a load imbalance, the > SEDF scheduler wants a migration to happen regardless of it being > apparently unnecessary. > > Of course, an alternative to checking for SEDF explicitly in > vcpu_set_affinity() would be to introduce a flags field in struct > scheduler, and have SEDF set a "always-migrate-on-affinity-change" > flag. > > Signed-off-by: Jan Beulich <jbeulich@suse.com>Acked-by: Keir Fraser <keir@xen.org>> --- a/xen/common/sched_sedf.c > +++ b/xen/common/sched_sedf.c > @@ -397,7 +397,8 @@ static int sedf_pick_cpu(const struct sc > > online = cpupool_scheduler_cpumask(v->domain->cpupool); > cpumask_and(&online_affinity, v->cpu_affinity, online); > - return cpumask_first(&online_affinity); > + return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, > + &online_affinity); > } > > /* > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -605,7 +605,8 @@ int vcpu_set_affinity(struct vcpu *v, co > vcpu_schedule_lock_irq(v); > > cpumask_copy(v->cpu_affinity, affinity); > - if ( !cpumask_test_cpu(v->processor, v->cpu_affinity) ) > + if ( VCPU2OP(v)->sched_id == XEN_SCHEDULER_SEDF || > + !cpumask_test_cpu(v->processor, v->cpu_affinity) ) > set_bit(_VPF_migrating, &v->pause_flags); > > vcpu_schedule_unlock_irq(v); > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Fri, 2013-03-01 at 15:35 +0000, Jan Beulich wrote:> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was > incompatible with the SEDF scheduler: Any vCPU using > VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux > guests) ends up on pCPU0 after that call. Obviously, running all PV > guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those > guests rather sooner than later. > > So the main thing that was clearly wrong (and bogus from the beginning) > was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced > by a construct that prefers to put back the vCPU on the pCPU that it > got launched on. >Is this change at all related to the long standing failure to boot with sedf in the test harnesss?
On 01.03.2013 16:35, Jan Beulich wrote:> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was > incompatible with the SEDF scheduler: Any vCPU using > VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux > guests) ends up on pCPU0 after that call. Obviously, running all PV > guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those > guests rather sooner than later. > > So the main thing that was clearly wrong (and bogus from the beginning) > was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced > by a construct that prefers to put back the vCPU on the pCPU that it > got launched on. > > However, there''s one more glitch: When reducing the affinity of a vCPU > temporarily, and then widening it again to a set that includes the pCPU > that the vCPU was last running on, the generic scheduler code would not > force a migration of that vCPU, and hence it would forever stay on the > pCPU it last ran on. Since that can again create a load imbalance, the > SEDF scheduler wants a migration to happen regardless of it being > apparently unnecessary. > > Of course, an alternative to checking for SEDF explicitly in > vcpu_set_affinity() would be to introduce a flags field in struct > scheduler, and have SEDF set a "always-migrate-on-affinity-change" > flag.Or something like this? I don''t like the test for sedf in schedule.c diff -r 65105a4a8c7a xen/common/sched_sedf.c --- a/xen/common/sched_sedf.c Fri Mar 01 16:59:49 2013 +0100 +++ b/xen/common/sched_sedf.c Mon Mar 04 07:35:53 2013 +0100 @@ -397,7 +397,8 @@ static int sedf_pick_cpu(const struct sc online = cpupool_scheduler_cpumask(v->domain->cpupool); cpumask_and(&online_affinity, v->cpu_affinity, online); - return cpumask_first(&online_affinity); + return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, + &online_affinity); } /* @@ -1503,6 +1504,11 @@ out: return rc; } +static void sedf_set_affinity(const struct scheduler *ops, struct vcpu *v) +{ + set_bit(_VPF_migrating, &v->pause_flags); +} + static struct sedf_priv_info _sedf_priv; const struct scheduler sched_sedf_def = { @@ -1532,6 +1538,7 @@ const struct scheduler sched_sedf_def .sleep = sedf_sleep, .wake = sedf_wake, .adjust = sedf_adjust, + .set_affinity = sedf_set_affinity, }; /* diff -r 65105a4a8c7a xen/common/schedule.c --- a/xen/common/schedule.c Fri Mar 01 16:59:49 2013 +0100 +++ b/xen/common/schedule.c Mon Mar 04 07:35:53 2013 +0100 @@ -615,6 +615,8 @@ int vcpu_set_affinity(struct vcpu *v, co cpumask_copy(v->cpu_affinity, affinity); if ( !cpumask_test_cpu(v->processor, v->cpu_affinity) ) set_bit(_VPF_migrating, &v->pause_flags); + if ( VCPU2OP(v)->set_affinity ) + SCHED_OP(VCPU2OP(v), set_affinity, v); vcpu_schedule_unlock_irq(v); diff -r 65105a4a8c7a xen/include/xen/sched-if.h --- a/xen/include/xen/sched-if.h Fri Mar 01 16:59:49 2013 +0100 +++ b/xen/include/xen/sched-if.h Mon Mar 04 07:35:53 2013 +0100 @@ -180,6 +180,7 @@ struct scheduler { int (*pick_cpu) (const struct scheduler *, struct vcpu *); void (*migrate) (const struct scheduler *, struct vcpu *, unsigned int); + void (*set_affinity) (const struct scheduler *, struct vcpu *); int (*adjust) (const struct scheduler *, struct domain *, struct xen_domctl_scheduler_op *); int (*adjust_global) (const struct scheduler *, -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
>>> On 02.03.13 at 04:44, Ian Campbell <ian.campbell@citrix.com> wrote: > On Fri, 2013-03-01 at 15:35 +0000, Jan Beulich wrote: >> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was >> incompatible with the SEDF scheduler: Any vCPU using >> VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux >> guests) ends up on pCPU0 after that call. Obviously, running all PV >> guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those >> guests rather sooner than later. >> >> So the main thing that was clearly wrong (and bogus from the beginning) >> was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced >> by a construct that prefers to put back the vCPU on the pCPU that it >> got launched on. >> > > Is this change at all related to the long standing failure to boot with > sedf in the test harnesss?Yes, very likely. It was looking at the logs (where all Dom0 vCPU-s were similarly running on pCPU0) plus immediately seeing a hang when trying it out that prompted me to finally look into the issue. Jan
>>> On 04.03.13 at 07:48, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote: > On 01.03.2013 16:35, Jan Beulich wrote: >> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was >> incompatible with the SEDF scheduler: Any vCPU using >> VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux >> guests) ends up on pCPU0 after that call. Obviously, running all PV >> guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those >> guests rather sooner than later. >> >> So the main thing that was clearly wrong (and bogus from the beginning) >> was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced >> by a construct that prefers to put back the vCPU on the pCPU that it >> got launched on. >> >> However, there''s one more glitch: When reducing the affinity of a vCPU >> temporarily, and then widening it again to a set that includes the pCPU >> that the vCPU was last running on, the generic scheduler code would not >> force a migration of that vCPU, and hence it would forever stay on the >> pCPU it last ran on. Since that can again create a load imbalance, the >> SEDF scheduler wants a migration to happen regardless of it being >> apparently unnecessary. >> >> Of course, an alternative to checking for SEDF explicitly in >> vcpu_set_affinity() would be to introduce a flags field in struct >> scheduler, and have SEDF set a "always-migrate-on-affinity-change" >> flag. > > Or something like this? I don''t like the test for sedf in schedule.cAs said - to me a flag would seem more appropriate than another indirect call. But for now I''ll commit as is - feel free to submit an incremental patch. Jan
On Fri, Mar 1, 2013 at 3:35 PM, Jan Beulich <JBeulich@suse.com> wrote:> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was > incompatible with the SEDF scheduler: Any vCPU using > VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux > guests) ends up on pCPU0 after that call. Obviously, running all PV > guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those > guests rather sooner than later. > > So the main thing that was clearly wrong (and bogus from the beginning) > was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced > by a construct that prefers to put back the vCPU on the pCPU that it > got launched on. > > However, there''s one more glitch: When reducing the affinity of a vCPU > temporarily, and then widening it again to a set that includes the pCPU > that the vCPU was last running on, the generic scheduler code would not > force a migration of that vCPU, and hence it would forever stay on the > pCPU it last ran on. Since that can again create a load imbalance, the > SEDF scheduler wants a migration to happen regardless of it being > apparently unnecessary.I''m not quite understanding what this is about -- why is this necessary for sedf but not for the other schedulers? -George
>>> On 04.03.13 at 12:22, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Fri, Mar 1, 2013 at 3:35 PM, Jan Beulich <JBeulich@suse.com> wrote: >> The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was >> incompatible with the SEDF scheduler: Any vCPU using >> VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux >> guests) ends up on pCPU0 after that call. Obviously, running all PV >> guests'' (and namely Dom0''s) vCPU-s on pCPU0 causes problems for those >> guests rather sooner than later. >> >> So the main thing that was clearly wrong (and bogus from the beginning) >> was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced >> by a construct that prefers to put back the vCPU on the pCPU that it >> got launched on. >> >> However, there''s one more glitch: When reducing the affinity of a vCPU >> temporarily, and then widening it again to a set that includes the pCPU >> that the vCPU was last running on, the generic scheduler code would not >> force a migration of that vCPU, and hence it would forever stay on the >> pCPU it last ran on. Since that can again create a load imbalance, the >> SEDF scheduler wants a migration to happen regardless of it being >> apparently unnecessary. > > I''m not quite understanding what this is about -- why is this > necessary for sedf but not for the other schedulers?The problem with sedf is that it doesn''t do any balancing, and never moves a vCPU to a different pCPU unless the affinity mask changes in a way that makes this necessary (in which case it''s the generic scheduler code that invokes to relocation). So when a vCPU narrows its affinity (in the extreme to a mask with just one bit set) and then widens it again, it would nevertheless remain on the pCPU it was formerly restricted to. When (perhaps much later) a second and then a third vCPU do the same, they may all end up running on the same pCPU, i.e. we''d get back to the problem addressed by the first half of the fix here. While you might argue that affinity changes ought to be done in a "sensible" way, I think you still realize that there is behavior here that one wouldn''t be able to control without explicitly adding an intermediate step when doing this for DomU-s. And you should also keep in mind that there are certain things that Dom0 needs to change its vCPU affinities for (pv-ops doesn''t do that, which is while certain things there just can''t work); this is how I noticed in the first place that the second half of the fix is necessary. Jan
On 04/03/13 11:37, Jan Beulich wrote:> While you might argue that affinity changes ought to be done > in a "sensible" way, I think you still realize that there is behavior > here that one wouldn''t be able to control without explicitly > adding an intermediate step when doing this for DomU-s. And > you should also keep in mind that there are certain things that > Dom0 needs to change its vCPU affinities for (pv-ops doesn''t > do that, which is while certain things there just can''t work); > this is how I noticed in the first place that the second half of > the fix is necessary.Oh right -- so the problem is that when you *expand* the mask, you want the scheduler to take another look at where might be a good place to run the vcpu given the changed constraints. It''s probably not a bad idea to just extend that idea to all the schedulers really. It''s not like we expect people to be changing the affinity masks hundreds of times a second. -George
On 04/03/13 12:23, Jan Beulich wrote:>>>> On 04.03.13 at 13:11, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> It''s probably not a bad idea to just extend that idea to all the >> schedulers really. It''s not like we expect people to be changing the >> affinity masks hundreds of times a second. > Yeah, why not - would (minimally) reduce code size and remove the > SEDF-specific check from the generic scheduler code again.I''ll whip up a patch... -G> > Jan >
>>> On 04.03.13 at 13:11, George Dunlap <george.dunlap@eu.citrix.com> wrote: > It''s probably not a bad idea to just extend that idea to all the > schedulers really. It''s not like we expect people to be changing the > affinity masks hundreds of times a second.Yeah, why not - would (minimally) reduce code size and remove the SEDF-specific check from the generic scheduler code again. Jan
Ian Jackson
2013-Mar-05 15:54 UTC
Re: [PATCH] SEDF: avoid gathering vCPU-s on pCPU0 [and 1 more messages]
Jan Beulich writes ("Re: [Xen-devel] [PATCH] SEDF: avoid gathering vCPU-s on pCPU0"):> On 02.03.13 at 04:44, Ian Campbell <ian.campbell@citrix.com> wrote: > > Is this change at all related to the long standing failure to boot with > > sedf in the test harnesss? > > Yes, very likely. It was looking at the logs (where all Dom0 vCPU-s > were similarly running on pCPU0) plus immediately seeing a hang > when trying it out that prompted me to finally look into the issue.Good, but... Jan Beulich writes ("Re: [Xen-devel] [PATCH] SEDF: avoid gathering vCPU-s on pCPU0"): ...> So when a vCPU narrows its affinity (in the extreme to a mask > with just one bit set) and then widens it again, it would > nevertheless remain on the pCPU it was formerly restricted to. > When (perhaps much later) a second and then a third vCPU > do the same, they may all end up running on the same pCPU, > i.e. we''d get back to the problem addressed by the first half of > the fix here.I don''t understand why having every VCPU in the whole system running on a single PCPU would result in a complete hang. Surely unless we have many VCPUs all piling on spinlocks there will not be that many runnable VCPUs ? Or is this symptom due to the lack of PV spinlocks ? Ian.
Jan Beulich
2013-Mar-05 16:11 UTC
Re: [PATCH] SEDF: avoid gathering vCPU-s on pCPU0 [and 1 more messages]
>>> On 05.03.13 at 16:54, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > Jan Beulich writes ("Re: [Xen-devel] [PATCH] SEDF: avoid gathering vCPU-s on pCPU0"): >> So when a vCPU narrows its affinity (in the extreme to a mask >> with just one bit set) and then widens it again, it would >> nevertheless remain on the pCPU it was formerly restricted to. >> When (perhaps much later) a second and then a third vCPU >> do the same, they may all end up running on the same pCPU, >> i.e. we''d get back to the problem addressed by the first half of >> the fix here. > > I don''t understand why having every VCPU in the whole system running > on a single PCPU would result in a complete hang. Surely unless we > have many VCPUs all piling on spinlocks there will not be that many > runnable VCPUs ? > > Or is this symptom due to the lack of PV spinlocks ?The case that I observed here was all CPUs piling up in stop_machine logic - that''s effectively an open coded spinning operation. But clearly non-ticket, non-pv spin locks can result in similar live lock effects (as soon as the locking rate is higher than what the single CPU can handle in terms of executing locked region code). Jan