As one of the topics presented in Xen summit2011 in SC, we proposed one method scheduler rate controller (SRC) to control high frequency of scheduling under some conditions. You can find the slides at http://www.slideshare.net/xen_com_mgr/9-hui-lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystems In the followings, we have tested it with 2-socket multi-core system with many rounds and got the positive results and improve the performance greatly either with the consolidation workload SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So I posted it here for review.>From Xen scheduling mechanism, hypervisor kicks related VCPUs by raising schedule softirq during processing external interrupts. Therefore, if the number of IRQ is very large, the scheduling happens more frequent. Frequent scheduling will1) bring more overhead for hypervisor and 2) increase cache miss rate. In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI solution are adopted to bypass software emulation but bring heavy network traffic. Correspondingly, 15k scheduling happened per second on each physical core, which means the average running time is very short, only 60us. We proposed SRC in XEN to mitigate this problem. The performance benefits brought by this patch is very huge at peak throughput with no influence when system loads are low. SRC improved SPECvirt performance by 14%. 1)It reduced CPU utilization, which allows more load to be added. 2)Response time (QoS) became better at the same CPU %. 3)The better response time allowed us to push the CPU % at peak performance to an even higher level (CPU was not saturated in SPECvirt). SRC reduced context switch rate significantly, resulted in 2)Smaller Path Length 3)Less cache misses thus lower CPI 4)Better performance for both Guest and Hypervisor sides. With this patch, from our SPECvirt_sc2010 results, the performance of xen catches up the other open sourced hypervisor. Signed-off-by: Hui Lv hui.lv@intel.com diff -ruNp xen.org/common/schedule.c xen/common/schedule.c --- xen.org/common/schedule.c 2011-10-20 03:29:44.000000000 -0400 +++ xen/common/schedule.c 2011-10-23 21:41:14.000000000 -0400 @@ -98,6 +98,31 @@ static inline void trace_runstate_change __trace_var(event, 1/*tsc*/, sizeof(d), &d); } +/* + *opt_sched_rate_control: parameter to turn on/off scheduler rate controller (SRC) + *opt_sched_rate_high: scheduling frequency threshold, default value is 50. + + *Suggest to set the value of opt_sched_rate_high larger than 50. + *It means if the scheduling frequency number, calculated during SCHED_SRC_INTERVAL (default 10 millisecond), is larger than opt_sched_rate_high, SRC works. +*/ +bool_t opt_sched_rate_control = 0; +unsigned int opt_sched_rate_high = 50; +boolean_param("sched_rate_control", opt_sched_rate_control); +integer_param("sched_rate_high", opt_sched_rate_high); + + +/* The following function is the scheduling rate controller (SRC). It is triggered when + * the frequency of scheduling is excessive high. (larger than opt_sched_rate_high) + * + * Rules to control the scheduling frequency + * 1)if the frequency of scheduling (sd->s_csnum), calculated during the period of SCHED_SRC_INTERVAL, + * is larger than the threshold opt_sched_rate_high, SRC is enabled to work by setting sd->s_src_control = 1 + * 2)if SRC works, it returns previous vcpu directly if previous vcpu is still runnalbe and not the idle vcpu. + * This method can decrease the frequency of scheduling when the scheduling frequency is excessive. +*/ + +void src_controller(struct schedule_data *sd, struct vcpu *prev, s_time_t now); + static inline void trace_continue_running(struct vcpu *v) { @@ -1033,6 +1058,29 @@ static void vcpu_periodic_timer_work(str set_timer(&v->periodic_timer, periodic_next_event); } +void src_controller(struct schedule_data *sd, struct vcpu *prev, s_time_t now) +{ + sd->s_csnum++; + if ((now - sd->s_src_loop_begin) >= MILLISECS(SCHED_SRC_INTERVAL)) + { + if (sd->s_csnum >= opt_sched_rate_high) + sd->s_src_control = 1; + else + sd->s_src_control = 0; + sd->s_src_loop_begin = now; + sd->s_csnum = 0; + } + if (sd->s_src_control) + { + if (!is_idle_vcpu(prev) && vcpu_runnable(prev)) + { + perfc_incr(sched_src); + return continue_running(prev); + } + perfc_incr(sched_nosrc); + } +} + /* * The main function * - deschedule the current domain (scheduler independent). @@ -1054,6 +1102,8 @@ static void schedule(void) sd = &this_cpu(schedule_data); + if (opt_sched_rate_control) + src_controller(sd,prev,now); /* Update tasklet scheduling status. */ switch ( *tasklet_work ) { @@ -1197,6 +1247,9 @@ static int cpu_schedule_up(unsigned int sd->curr = idle_vcpu[cpu]; init_timer(&sd->s_timer, s_timer_fn, NULL, cpu); atomic_set(&sd->urgent_count, 0); + sd->s_csnum=0; + sd->s_src_loop_begin=NOW(); + sd->s_src_control=0; /* Boot CPU is dealt with later in schedule_init(). */ if ( cpu == 0 ) diff -ruNp xen.org/include/xen/perfc_defn.h xen/include/xen/perfc_defn.h --- xen.org/include/xen/perfc_defn.h 2011-10-20 03:29:44.000000000 -0400 +++ xen/include/xen/perfc_defn.h 2011-10-23 21:08:28.000000000 -0400 @@ -15,6 +15,8 @@ PERFCOUNTER(ipis, "#IP PERFCOUNTER(sched_irq, "sched: timer") PERFCOUNTER(sched_run, "sched: runs through scheduler") PERFCOUNTER(sched_ctx, "sched: context switches") +PERFCOUNTER(sched_src, "sched: src triggered") +PERFCOUNTER(sched_nosrc, "sched: src not triggered") PERFCOUNTER(vcpu_check, "csched: vcpu_check") PERFCOUNTER(schedule, "csched: schedule") diff -ruNp xen.org/include/xen/sched-if.h xen/include/xen/sched-if.h --- xen.org/include/xen/sched-if.h 2011-10-20 03:29:44.000000000 -0400 +++ xen/include/xen/sched-if.h 2011-10-23 21:20:57.000000000 -0400 @@ -15,6 +15,11 @@ extern struct cpupool *cpupool0; /* cpus currently in no cpupool */ extern cpumask_t cpupool_free_cpus; +/*SRC judge whether to trigger scheduling controller based on the comparison + *between the scheduling frequency, counted during SCHED_SRC_INTERVAL, and the threshold opt_sched_rate_high + *Suggest to set SCHED_SRC_INTERVAL to 10 (millisecond) +*/ +#define SCHED_SRC_INTERVAL 10 /* * In order to allow a scheduler to remap the lock->cpu mapping, @@ -32,6 +37,9 @@ struct schedule_data { struct vcpu *curr; /* current task */ void *sched_priv; struct timer s_timer; /* scheduling timer */ + int s_csnum; /* scheduling number based on last period */ + s_time_t s_src_loop_begin; /* SRC conting start point */ + bool_t s_src_control; /*indicate whether src should be triggered */ atomic_t urgent_count; /* how many urgent vcpus */ }; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@intel.com> wrote:> > As one of the topics presented in Xen summit2011 in SC, we proposed one method scheduler rate controller (SRC) to control high frequency of scheduling under some conditions. You can find the slides at > http://www.slideshare.net/xen_com_mgr/9-hui-lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystems > > In the followings, we have tested it with 2-socket multi-core system with many rounds and got the positive results and improve the performance greatly either with the consolidation workload SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So I posted it here for review. > > >From Xen scheduling mechanism, hypervisor kicks related VCPUs by raising schedule softirq during processing external interrupts. Therefore, if the number of IRQ is very large, the scheduling happens more frequent. Frequent scheduling will > 1) bring more overhead for hypervisor and > 2) increase cache miss rate. > > In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI solution are adopted to bypass software emulation but bring heavy network traffic. Correspondingly, 15k scheduling happened per second on each physical core, which means the average running time is very short, only 60us. We proposed SRC in XEN to mitigate this problem. > The performance benefits brought by this patch is very huge at peak throughput with no influence when system loads are low. > > SRC improved SPECvirt performance by 14%. > 1)It reduced CPU utilization, which allows more load to be added. > 2)Response time (QoS) became better at the same CPU %. > 3)The better response time allowed us to push the CPU % at peak performance to an even higher level (CPU was not saturated in SPECvirt). > SRC reduced context switch rate significantly, resulted in > 2)Smaller Path Length > 3)Less cache misses thus lower CPI > 4)Better performance for both Guest and Hypervisor sides. > > With this patch, from our SPECvirt_sc2010 results, the performance of xen catches up the other open sourced hypervisor.Hui, Thanks for the patch, and the work you''ve done testing it. There are a couple of things to discuss. * I''m not sure I like the idea of doing this at the generic level than at the specific scheduler level -- e.g., inside of credit1. For better or for worse, all aspects of scheduling work together, and even small changes tend to have a significant effect on the emergent behavior. I understand why you''d want this in the generic scheduling code; but it seems like it would be better for each scheduler to implement a rate control independently. * The actual algorithm you use here isn''t described. It seems to be as follows (please correct me if I''ve made a mistake reverse-engineering the algorithm): Every 10ms, check to see if there have been more than 50 schedules. If so, disable pre-emption entirely for 10ms, allowing processes to run without being interrupted (unless they yield). It seems like we should be able to do better. For one, it means in the general case you will flip back and forth between really frequent schedules and less frequent schedules. For two, turning off preemption entirely will mean that whatever vcpu happens to be running could, if it wished, run for the full 10ms; and which one got elected to do that would be really random. This may work well for SPECvirt, but it''s the kind of algorithm that is likely to have some workloads on which it works very poorly. Finally, there''s the chance that this algorithm could be "gamed" -- i.e., if a rogue VM knew that most other VMs yielded frequently, it might be able to arrange that there would always be more than 50 context switches a second, while it runs without preemption and takes up more than its fair share. Have you tried just making it give each vcpu a minimum amount of scheduling time, say, 500us or 1ms? Now a couple of stylistic comments: * src tends to make me think of "source". I think sched_rate[_*] would fit the existing naming convention better. * src_controller() shouldn''t call continue_running() directly. Instead, scheduler() should call src_controller(); and only call sched->do_schedule() if src_controller() returns false (or something like that). * Whatever the algorithm is should have comments describing what it does and how it''s supposed to work. * Your patch is malformed; you need to have it apply at the top level, not from within the xen/ subdirectory. The easiest way to get a patch is to use either mercurial queues, or "hg diff". There are some good suggestions for making and posting patches here: http://wiki.xensource.com/xenwiki/SubmittingXenPatches Thanks again for all your work on this -- we definitely want Xen to beat the other open-source hypervisor. :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 24/10/2011 17:17, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:> * I''m not sure I like the idea of doing this at the generic level than > at the specific scheduler level -- e.g., inside of credit1. For > better or for worse, all aspects of scheduling work together, and even > small changes tend to have a significant effect on the emergent > behavior. I understand why you''d want this in the generic scheduling > code; but it seems like it would be better for each scheduler to > implement a rate control independently.Yes, this doesn''t belong in schedule.c. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George > Dunlap > Sent: Tuesday, October 25, 2011 12:17 AM > To: Lv, Hui > Cc: xen-devel@lists.xensource.com; Duan, Jiangang; Tian, Kevin; > keir@xen.org; Dong, Eddie > Subject: Re: [Xen-devel] [PATCH] scheduler rate controller > > On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@intel.com> wrote: > > > > As one of the topics presented in Xen summit2011 in SC, we proposed > one method scheduler rate controller (SRC) to control high frequency of > scheduling under some conditions. You can find the slides at > > http://www.slideshare.net/xen_com_mgr/9-hui- > lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystem > s > > > > In the followings, we have tested it with 2-socket multi-core system > with many rounds and got the positive results and improve the > performance greatly either with the consolidation workload > SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So > I posted it here for review. > > > > >From Xen scheduling mechanism, hypervisor kicks related VCPUs by > raising schedule softirq during processing external interrupts. > Therefore, if the number of IRQ is very large, the scheduling happens > more frequent. Frequent scheduling will > > 1) bring more overhead for hypervisor and > > 2) increase cache miss rate. > > > > In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI > solution are adopted to bypass software emulation but bring heavy > network traffic. Correspondingly, 15k scheduling happened per second on > each physical core, which means the average running time is very short, > only 60us. We proposed SRC in XEN to mitigate this problem. > > The performance benefits brought by this patch is very huge at peak > throughput with no influence when system loads are low. > > > > SRC improved SPECvirt performance by 14%. > > 1)It reduced CPU utilization, which allows more load to be added. > > 2)Response time (QoS) became better at the same CPU %. > > 3)The better response time allowed us to push the CPU % at peak > performance to an even higher level (CPU was not saturated in SPECvirt). > > SRC reduced context switch rate significantly, resulted in > > 2)Smaller Path Length > > 3)Less cache misses thus lower CPI > > 4)Better performance for both Guest and Hypervisor sides. > > > > With this patch, from our SPECvirt_sc2010 results, the performance of > xen catches up the other open sourced hypervisor. > > Hui, > > Thanks for the patch, and the work you''ve done testing it. There are > a couple of things to discuss. > > * I''m not sure I like the idea of doing this at the generic level than > at the specific scheduler level -- e.g., inside of credit1. For > better or for worse, all aspects of scheduling work together, and even > small changes tend to have a significant effect on the emergent > behavior. I understand why you''d want this in the generic scheduling > code; but it seems like it would be better for each scheduler to > implement a rate control independently. > > * The actual algorithm you use here isn''t described. It seems to be > as follows (please correct me if I''ve made a mistake > reverse-engineering the algorithm): > > Every 10ms, check to see if there have been more than 50 schedules. > If so, disable pre-emption entirely for 10ms, allowing processes to > run without being interrupted (unless they yield). >Sorry for the lack of description. You are right for the control process.> It seems like we should be able to do better. For one, it means in > the general case you will flip back and forth between really frequent > schedules and less frequent schedules. For two, turning off > preemption entirely will mean that whatever vcpu happens to be running > could, if it wished, run for the full 10ms; and which one got elected > to do that would be really random. This may work well for SPECvirt, > but it''s the kind of algorithm that is likely to have some workloads > on which it works very poorly. Finally, there''s the chance that this > algorithm could be "gamed" -- i.e., if a rogue VM knew that most other > VMs yielded frequently, it might be able to arrange that there would > always be more than 50 context switches a second, while it runs > without preemption and takes up more than its fair share. >Yes, agree that, there are more things to do to make a more delicate solution for this in the next step. For example, we can consider per VM status to decide whether to turn on/off the control to make it more fair, such as your point three. However, as the first step, the current solution is straightforward and effective. 1) Most importantly, it happens when the scheduling frequency is excessive. User can decide which degree is excessive by setting parameter "opt_sched_rate_high"(default 50). If the system is crucial for latency sensitive tasks, you can choose a higher value that this patch will have little impact on it. User can decide which value is good for their environment. However, from our experience, if the scheduling frequency is too excessive, it also impairs the Qos of latency sensitive tasks due to frequent interrupts by other VMs. 2) Considering the excessive scheduling condition, the preemption is turning off entirely. If current running vcpu, yielded frequently, it cannot run for the full 10ms. If current running vcpu, not yielded frequently, it can possible run as long as up to 10ms. That means, this algorithm roughly protect the un-yielded vcpu to run a long time slice without preemption. This is something similar to your point 3 but in a roughly way. :) 3) Finally, this patch aimed to solve the issue when scheduling frequency is excessive but not influence the normal case (less frequency). We should treat these two cases separately. Since excessive scheduling case cannot guarantee neither performance or Qos.> Have you tried just making it give each vcpu a minimum amount of > scheduling time, say, 500us or 1ms? > > Now a couple of stylistic comments: > * src tends to make me think of "source". I think sched_rate[_*] > would fit the existing naming convention better. > * src_controller() shouldn''t call continue_running() directly. > Instead, scheduler() should call src_controller(); and only call > sched->do_schedule() if src_controller() returns false (or something > like that). > * Whatever the algorithm is should have comments describing what it > does and how it''s supposed to work. > * Your patch is malformed; you need to have it apply at the top level, > not from within the xen/ subdirectory. The easiest way to get a patch > is to use either mercurial queues, or "hg diff". There are some good > suggestions for making and posting patches here: > http://wiki.xensource.com/xenwiki/SubmittingXenPatches >Thanks for the kind information. I think the next version will be better :)> Thanks again for all your work on this -- we definitely want Xen to > beat the other open-source hypervisor. :-) > > -George_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Thanks Dario for your helpful comments.> Just something crossed my mind reading the patch and the comments, > would it make sense to rate-limit the calls coming from (non-timer) > interrupt exit paths while still letting the tick able to trigger a > scheduling decision? This just to be sure that at least the time slice > enforcing (if any) happens how expected... Could it make sense? >Yes, it makes sense. But currently, we lacks the scheduler knowledge such as what caused the scheduler, timer or interrupt? Can we?> More generally speaking, I see how this feature can be useful, and I > also think it could live in the generic schedule.c code, but (as George > was saying) the algorithm by which rate-limiting is happening needs to > be well known, documented and exposed to the user (more than by means > of a couple of perf-counters). >One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.> For example this might completely destroy the time guarantees a > scheduler like sEDF would give, and in such case it must be easy > enough to figure out what's going on and why the scheduler is not > behaving as expected! > > For that reason, although again, this could be made general enough to > be sensible and meaningful for all the various schedulers, it might be > worthwhile to have it inside credit1 for now, where we know it will > probably yield the most of its benefits. >I think I got your point. More considerations should be taken to avoid the disasters to any of the existing schedulers. I'm fine to move it to the credit in the current stage. :)> Just my 2 cents. :-) > > Thanks and Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ---------------------------------------------------------------------- > Dario Faggioli, http://retis.sssup.it/people/faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) PhD > Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:> Not sure yet, I can imagine it''s tricky and I need to dig a bit more in > the code, but I''ll let know if I found a way of doing that...There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised. But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback. Whether the .wake() happens as a result of a HW interrupt or something else, I don''t think really matters. Dario and Hui, neither of you have commented on my idea, which is simply don''t preempt a VM if it has run for less than some amount of time (say, 500us or 1ms). If a higher-priority VM is woken up, see how long the current VM has run. If it''s less than 1ms, set a 1ms timer and call schedule() then.> > > More generally speaking, I see how this feature can be useful, and I > > > also think it could live in the generic schedule.c code, but (as George > > > was saying) the algorithm by which rate-limiting is happening needs to > > > be well known, documented and exposed to the user (more than by means > > > of a couple of perf-counters). > > > > > > > One question is that, what is the right palace to document such information? I''d like to make it as clear as possible to the users. > > > Well, don''t know, maybe a WARN (a WARN_ONCE alike thing would probably > be better), or in general something that leave a footstep in the logs, > so that one can find out by means of `xl dmesg'' or related. Obviously, > I''m not suggesting of printk-ing each suppressed schedule invocation, or > the overhead would get even worse... :-P > > I''m thinking of something that happens the very first time the limiting > fires, or maybe oncee some period/number of suppressions, just to remind > the user that he''s getting weird behaviour because _he_enabled_ > rate-limiting. Hopefully, that might also be useful for the user itself > to fine tune the limiting parameters, although I think the perf-counters > are already quite well suited for this.As much as possible, we want the system to Just Work. Under normal circumstances it wouldn''t be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution. That''s another reason I''m in favor of trying it -- it''s simple and easy to understand, and doesn''t require detecting when to "turn it on". -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I have tried one way very similar as your idea. 1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption. It try to guarantee vcpu to run as long as 1ms, if it wants. It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used). As you said, if applying the seveal_ms_delay, it will happen whenever system is normal or not (excessive frequency). It may possible have the consequence that 1)under normal condition, it will produce worse Qos than that without applying such delay, 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change. I think one possible solution, is to make the value of 1ms-delay adaptive according to the system status (low load or high load). If so, SRC patch just covered the excessive condition currently :). That's why I mentioned to treat normal and excessive conditions separately and don't influence the normal case as much as possible. Because we never know the consequence without amount of testing work. :) Some of my stupid thinking :) Best regards, Lv, Hui -----Original Message----- From: George Dunlap [mailto:george.dunlap@citrix.com] Sent: Saturday, October 29, 2011 12:19 AM To: Dario Faggioli Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie Subject: RE: [Xen-devel] [PATCH] scheduler rate controller On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:> Not sure yet, I can imagine it's tricky and I need to dig a bit more > in the code, but I'll let know if I found a way of doing that...There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised. But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback. Whether the .wake() happens as a result of a HW interrupt or something else, I don't think really matters. Dario and Hui, neither of you have commented on my idea, which is simply don't preempt a VM if it has run for less than some amount of time (say, 500us or 1ms). If a higher-priority VM is woken up, see how long the current VM has run. If it's less than 1ms, set a 1ms timer and call schedule() then.> > > More generally speaking, I see how this feature can be useful, and > > > I also think it could live in the generic schedule.c code, but (as > > > George was saying) the algorithm by which rate-limiting is > > > happening needs to be well known, documented and exposed to the > > > user (more than by means of a couple of perf-counters). > > > > > > > One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users. > > > Well, don't know, maybe a WARN (a WARN_ONCE alike thing would probably > be better), or in general something that leave a footstep in the logs, > so that one can find out by means of `xl dmesg' or related. Obviously, > I'm not suggesting of printk-ing each suppressed schedule invocation, > or the overhead would get even worse... :-P > > I'm thinking of something that happens the very first time the > limiting fires, or maybe oncee some period/number of suppressions, > just to remind the user that he's getting weird behaviour because > _he_enabled_ rate-limiting. Hopefully, that might also be useful for > the user itself to fine tune the limiting parameters, although I think > the perf-counters are already quite well suited for this.As much as possible, we want the system to Just Work. Under normal circumstances it wouldn't be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution. That's another reason I'm in favor of trying it -- it's simple and easy to understand, and doesn't require detecting when to "turn it on". -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, Oct 29, 2011 at 11:05 AM, Lv, Hui <hui.lv@intel.com> wrote:> I have tried one way very similar as your idea. > 1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption. > It try to guarantee vcpu to run as long as 1ms, if it wants. > It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used).Hey Hui, Sorry for the delay in response -- FYI I''m at the XenSummit Korea now, and I''ll be on holiday next week. Do you have the patch that you wrote for the 1ms delay handy, and any numbers that you ran? I''m a bit surprised that a 1ms delay didn''t have much effect; but in any case, it seems dialing that up should have a similar effect -- e.g., if we changed that to 10ms, then it should have a similar effect to the patch that you sent before.> As you said, if applying the seveal_ms_delay, it will happen whenever system is normal or not (excessive frequency). It may possible have the consequence that 1)under normal condition, it will produce worse Qos than that without applying such delay,Perhaps; but the current credit scheduler may already allow a VM to run exclusively for 30ms, so I don''t think that overall it should have a big influence.> 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.Well it will prevent preemption; so as long as at least one VM does not yield, it will reduce the number of schedule events to 1000 times per second. If all VMs yield, then you can''t really reduce the number of scheduling events anyway (even with your preemption-disable patch).> I think one possible solution, is to make the value of 1ms-delay adaptive according to the system status (low load or high load). If so, SRC patch just covered the excessive condition currently :). That''s why I mentioned to treat normal and excessive conditions separately and don''t influence the normal case as much as possible. Because we never know the consequence without amount of testing work. :)Yes, exactly. :-)> Some of my stupid thinking :)Well, you''ve obviously done a lot more looking recently than I have. :-) I''m attaching a prototype minimum timeslice patch that I threw together last week. It currently hangs during boot, but it will give you the idea of what I was thinking of. Hui, can you let me know what you think of the idea, and if you find it interesting, could you try to fix it up, and test it? Testing it with bigger values like 5ms would be really interesting. -George> > Best regards, > > Lv, Hui > > > -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: Saturday, October 29, 2011 12:19 AM > To: Dario Faggioli > Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie > Subject: RE: [Xen-devel] [PATCH] scheduler rate controller > > On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote: >> Not sure yet, I can imagine it''s tricky and I need to dig a bit more >> in the code, but I''ll let know if I found a way of doing that... > > There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised. But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback. Whether the .wake() happens as a result of a HW interrupt or something else, I don''t think really matters. > > Dario and Hui, neither of you have commented on my idea, which is simply don''t preempt a VM if it has run for less than some amount of time (say, 500us or 1ms). If a higher-priority VM is woken up, see how long the current VM has run. If it''s less than 1ms, set a 1ms timer and call schedule() then. > >> > > More generally speaking, I see how this feature can be useful, and >> > > I also think it could live in the generic schedule.c code, but (as >> > > George was saying) the algorithm by which rate-limiting is >> > > happening needs to be well known, documented and exposed to the >> > > user (more than by means of a couple of perf-counters). >> > > >> > >> > One question is that, what is the right palace to document such information? I''d like to make it as clear as possible to the users. >> > >> Well, don''t know, maybe a WARN (a WARN_ONCE alike thing would probably >> be better), or in general something that leave a footstep in the logs, >> so that one can find out by means of `xl dmesg'' or related. Obviously, >> I''m not suggesting of printk-ing each suppressed schedule invocation, >> or the overhead would get even worse... :-P >> >> I''m thinking of something that happens the very first time the >> limiting fires, or maybe oncee some period/number of suppressions, >> just to remind the user that he''s getting weird behaviour because >> _he_enabled_ rate-limiting. Hopefully, that might also be useful for >> the user itself to fine tune the limiting parameters, although I think >> the perf-counters are already quite well suited for this. > > As much as possible, we want the system to Just Work. Under normal circumstances it wouldn''t be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution. That''s another reason I''m in favor of trying it -- it''s simple and easy to understand, and doesn''t require detecting when to "turn it on". > > -George > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Hey Hui, Sorry for the delay in response -- FYI I'm at the XenSummit Korea > now, and I'll be on holiday next week. >Have a good trip in Korea and the following holiday! And Say Hi to everyone there。 :)> I'm attaching a prototype minimum timeslice patch that I threw together last > week. It currently hangs during boot, but it will give you the idea of what I > was thinking of. > > Hui, can you let me know what you think of the idea, and if you find it > interesting, could you try to fix it up, and test it? Testing it with bigger values > like 5ms would be really interesting.I agree that this idea seems more natural and proper, if it can solve the two problems that I addressed above. We need data to prove/disprove it. As you mentioned that, this method is supposed to have the similar result as the patch I sent, when setting 10ms as the delay value in the excessive condition. So that, an idea came to me that may enforce your proposal, 1. we still count the scheduling number during each period (for example 10ms) 2. This scheduling number is used to adaptively decide the delay value. For example, if scheduling number is very excessive, we can set longer delay time, such as 5ms or 10ms. Or if the scheduling number is small, we can set small delay time, such as 1ms, 500us or even zero. In this way, the delay value is decided adaptively. I think It can solve the possible problems that I addressed above. George, how do you think this? I'd like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :). Also we have a set of small workloads to make quick testing. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, Nov 4, 2011 at 2:08 PM, Lv, Hui <hui.lv@intel.com> wrote:> So that, an idea came to me that may enforce your proposal, > 1. we still count the scheduling number during each period (for example 10ms) > 2. This scheduling number is used to adaptively decide the delay value. > For example, if scheduling number is very excessive, we can set longer delay time, such as 5ms or 10ms. Or if the scheduling number is small, we can set small delay time, such as 1ms, 500us or even zero. In this way, the delay value is decided adaptively.Setting the value adaptively is good, but only if it''s adapting to the right thing. :-) For instance, adapting to number of cache misses, or to latency requirements of guests, seems like a good idea. But adapting to the number of scheduling events in the last period doesn''t seem very useful -- especially since our whole goal is to change the number of scheduling events to be fewer. :-)> I''d like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :).I''ve heard that; thanks for doing the work.> Also we have a set of small workloads to make quick testing.What kinds of workloads are these? Our performance team here is also trying to develop a lighter-weight (i.e., easier to set up and run) benchmark for scalability / consolidation. Hopefully once they get that up and running I can test the scheduling characteristics as well. Peace, -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, Nov 4, 2011 at 2:08 PM, Lv, Hui <hui.lv@intel.com> wrote:> I''d like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :). > Also we have a set of small workloads to make quick testing.BTW, did you take any traces when running SPECVirt for your previous tests? If you could take some traces of SPECVirt whenever you get to doing the new tests, that would be really helpful in understanding both how the SPECVirt benchmark behaves, and how we can best tweak the scheduler so that it runs better. Thanks, -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, George Sorry for the late. I also met issues to boot up xen according to your patch, which is same as credit_3.patch that I attached. So I modified it to the credit_1.patch and credit_2.patch, both of which work well. 1) credit_1 adopts "scheduling frequency counting" to decide the value of sched_ratelimit_us, which makes it adaptive. 2) credit_2 adopts the constant sched_ratelimit_us value 1000. Although the performance comparison data is still in process, I want to hear some feedbacks from you. I think I shall share the data very soon when the system becomes stable. Best regards, Lv, Hui -----Original Message----- From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George Dunlap Sent: Thursday, November 03, 2011 12:29 PM To: Lv, Hui Cc: George Dunlap; Dario Faggioli; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie; Duan, Jiangang Subject: Re: [Xen-devel] [PATCH] scheduler rate controller On Sat, Oct 29, 2011 at 11:05 AM, Lv, Hui <hui.lv@intel.com> wrote:> I have tried one way very similar as your idea. > 1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption. > It try to guarantee vcpu to run as long as 1ms, if it wants. > It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used).Hey Hui, Sorry for the delay in response -- FYI I''m at the XenSummit Korea now, and I''ll be on holiday next week. Do you have the patch that you wrote for the 1ms delay handy, and any numbers that you ran? I''m a bit surprised that a 1ms delay didn''t have much effect; but in any case, it seems dialing that up should have a similar effect -- e.g., if we changed that to 10ms, then it should have a similar effect to the patch that you sent before.> As you said, if applying the seveal_ms_delay, it will happen whenever > system is normal or not (excessive frequency). It may possible have > the consequence that 1)under normal condition, it will produce worse > Qos than that without applying such delay,Perhaps; but the current credit scheduler may already allow a VM to run exclusively for 30ms, so I don''t think that overall it should have a big influence.> 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.Well it will prevent preemption; so as long as at least one VM does not yield, it will reduce the number of schedule events to 1000 times per second. If all VMs yield, then you can''t really reduce the number of scheduling events anyway (even with your preemption-disable patch).> I think one possible solution, is to make the value of 1ms-delay > adaptive according to the system status (low load or high load). If > so, SRC patch just covered the excessive condition currently :). > That''s why I mentioned to treat normal and excessive conditions > separately and don''t influence the normal case as much as possible. > Because we never know the consequence without amount of testing work. > :)Yes, exactly. :-)> Some of my stupid thinking :)Well, you''ve obviously done a lot more looking recently than I have. :-) I''m attaching a prototype minimum timeslice patch that I threw together last week. It currently hangs during boot, but it will give you the idea of what I was thinking of. Hui, can you let me know what you think of the idea, and if you find it interesting, could you try to fix it up, and test it? Testing it with bigger values like 5ms would be really interesting. -George> > Best regards, > > Lv, Hui > > > -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: Saturday, October 29, 2011 12:19 AM > To: Dario Faggioli > Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; > xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie > Subject: RE: [Xen-devel] [PATCH] scheduler rate controller > > On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote: >> Not sure yet, I can imagine it''s tricky and I need to dig a bit more >> in the code, but I''ll let know if I found a way of doing that... > > There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised. But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback. Whether the .wake() happens as a result of a HW interrupt or something else, I don''t think really matters. > > Dario and Hui, neither of you have commented on my idea, which is simply don''t preempt a VM if it has run for less than some amount of time (say, 500us or 1ms). If a higher-priority VM is woken up, see how long the current VM has run. If it''s less than 1ms, set a 1ms timer and call schedule() then. > >> > > More generally speaking, I see how this feature can be useful, >> > > and I also think it could live in the generic schedule.c code, >> > > but (as George was saying) the algorithm by which rate-limiting >> > > is happening needs to be well known, documented and exposed to >> > > the user (more than by means of a couple of perf-counters). >> > > >> > >> > One question is that, what is the right palace to document such information? I''d like to make it as clear as possible to the users. >> > >> Well, don''t know, maybe a WARN (a WARN_ONCE alike thing would >> probably be better), or in general something that leave a footstep in >> the logs, so that one can find out by means of `xl dmesg'' or related. >> Obviously, I''m not suggesting of printk-ing each suppressed schedule >> invocation, or the overhead would get even worse... :-P >> >> I''m thinking of something that happens the very first time the >> limiting fires, or maybe oncee some period/number of suppressions, >> just to remind the user that he''s getting weird behaviour because >> _he_enabled_ rate-limiting. Hopefully, that might also be useful for >> the user itself to fine tune the limiting parameters, although I >> think the perf-counters are already quite well suited for this. > > As much as possible, we want the system to Just Work. Under normal circumstances it wouldn''t be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution. That''s another reason I''m in favor of trying it -- it''s simple and easy to understand, and doesn''t require detecting when to "turn it on". > > -George > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Nov 28, 2011 at 5:31 PM, Lv, Hui <hui.lv@intel.com> wrote:> Sorry for the late. I also met issues to boot up xen according to your patch, which is same as credit_3.patch that I attached.Thanks Hui; debugging someone else''s buggy code is going beyond expectations. :-)> So I modified it to the credit_1.patch and credit_2.patch, both of which work well.Standard patches for OSS development need to be -p1, not -p0; they need to work if, in the toplevel of the directory (foo.hg) you type "patch -p1 < bar.patch". The easiest way to make one of these patches is to use "hg diff"; the best way (if you''re using Mercurial) is to use Mercurual queues.> 2) credit_2 adopts the constant sched_ratelimit_us value 1000.Looks fine overall. One issue with the patch is that it will not only fail to preempt for a higher-priority vcpu, it will also fail to preempt for tasklets. Tasklet work must be done immediately. Perhaps we can add "!tasklet_work_scheduled" to the list of conditions for Why did you change "MICROSECS" to "MILLISECS" when calculating timeslice? In this case, it will set the timeslice to a full second! Not what we want... From a software maintenance perspective, I''m not a fan of early returns from functions. I think it''s too easy not to notice that there''s a different return path. In this case, I think I''d prefer adding a label, and using "goto out;" instead.> 1) credit_1 adopts "scheduling frequency counting" to decide the value of sched_ratelimit_us, which makes it adaptive.If you were using mercurial queues, you could put this after the last one, and it would be easier to see the proposed "adaptive" part of the code. :-) Hypervisors are very complicated; it''s best to keep things as absolutely simple as possible. This kind of mechanism is exactly the sort of thing that makes it very hard to predict what will happen. I think unless you can show that it adds a significant benefit, it''s better just to use the min timeslice. Regarding this particular code, a couple of items, just for feedback: * All of the ratelimiting code and data structures should be in the pluggable scheduler, not in common code. * This code hard-codes in ''1000'' as the value it sets the global variable to, overriding whatever the user may have entered on the command-line * Furthermore, global variable is shared by all of the cpus, however; meaning, you may have one cpu enabling it one moment based on its own conditions, and have nother cpu disabling it almost immediatly afterwards, based on conditions on that cpu. If you''re testing with this at the moment, you might as well stop -- you''re going to get a pretty random result. If you really wanted this to be an adaptive solution, you''d need to make a per-cpu variable with the per-cpu rate scheduling; and then set it to the global variable (which is the user configuration). * Finally, this patch doesn''t distinguish between schedules that happen due to a yield, and schedules that happen due to preemption. The only schedules we have any control over are schedules that happen due to preemption. If adaptivity has any value, it should pay attention to what it can control. I''ve taken your two patches, given them the proper formating, and made them into a patch series (the first adding the ratelimiting, the second adding the adaptive bit); they are attached. You should be able to easily pull them into a mercurial patchqueue using "hg qimport /path/to/patches/*.diff". In the future, I will not look at any patches which do not apply using either "patch -p1" or "hg qimport." Thanks again for all your work on this -- I hope we can get a simple, effective solution in place soon. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi George, Thank you very much for your feedbacks. I have finished the measurement work based on the delay method. From the comparable results, 1ms-delay can do as well as SRC patch to gain significant performance boost without obvious drawbacks. 1. Basically, the "delay method" can achieve nearly the same benefits as my previous SRC patch, 11% overall performance boost for SPECvirt than original credit scheduler. 2. I have tried 1ms delay and 10ms delay, there is no big difference between these two configurations. (1ms is enough to achieve a good performance) 3. I have compared different load level response time/latency (low, high, peak), "delay method" didn''t bring very much Qos increase. 4. 1ms delay can reduce 30% context switch at peak performance, where produces the benefits. You can find the raw data from the attached excel file. The attached credit_1.diff patch works stable at my side.> Looks fine overall. One issue with the patch is that it will not only fail to > preempt for a higher-priority vcpu, it will also fail to preempt for tasklets. > Tasklet work must be done immediately. Perhaps we can add > "!tasklet_work_scheduled" to the list of conditions forYes, added "!tasklet_work_scheduled". I have done the experiments to compare with/without this, there is not big difference.> > Why did you change "MICROSECS" to "MILLISECS" when calculating timeslice? > In this case, it will set the timeslice to a full second! > Not what we want...Sorry , it''s my typo, I have changed> From a software maintenance perspective, I''m not a fan of early returns from > functions. I think it''s too easy not to notice that there''s a different return > path. In this case, I think I''d prefer adding a label, and using "goto out;" > instead.Followed this code style.> If you were using mercurial queues, you could put this after the last one, and it > would be easier to see the proposed "adaptive" part of the code. :-) > > Hypervisors are very complicated; it''s best to keep things as absolutely simple > as possible. This kind of mechanism is exactly the sort of thing that makes it > very hard to predict what will happen. I think unless you can show that it adds > a significant benefit, it''s better just to use the min timeslice.In fact, I have tested the results with 1ms delay and 10ms delay, there is no significant performance improvement. It means, even though we select adaptive method, there is also no significant performance boost with comparison to 1ms. 1ms is good enough insofar.> Regarding this particular code, a couple of items, just for feedback: > * All of the ratelimiting code and data structures should be in the pluggable > scheduler, not in common code.Agreed.> an adaptive solution, you''d need to make a per-cpu variable with the per-cpu > rate scheduling; and then set it to the global variable (which is the user > configuration).It seems no need to make it adaptive now :)> I''ve taken your two patches, given them the proper formating, and made them > into a patch series (the first adding the ratelimiting, the second adding the > adaptive bit); they are attached. You should be able to easily pull them into a > mercurial patchqueue using "hg qimport /path/to/patches/*.diff". In the > future, I will not look at any patches which do not apply using either "patch -p1" > or "hg qimport."Thanks for the coach to submit patch in a right way.> > -George_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Dec 11, 2011 at 3:27 PM, Lv, Hui <hui.lv@intel.com> wrote:> Hi George, > > Thank you very much for your feedbacks. > I have finished the measurement work based on the delay method. From the comparable results, 1ms-delay can do as well as SRC patch to gain significant performance boost without obvious drawbacks. > 1. Basically, the "delay method" can achieve nearly the same benefits as my previous SRC patch, 11% overall performance boost for SPECvirt than original credit scheduler. > 2. I have tried 1ms delay and 10ms delay, there is no big difference between these two configurations. (1ms is enough to achieve a good performance) > 3. I have compared different load level response time/latency (low, high, peak), "delay method" didn''t bring very much Qos increase.Thanks Hui, those are good results. Just one question: What''s QoS supposed to measure? Is this a metric that SPECVirt reports? Is higher or lower better? The patch looks good, but there''s one last nitpick: Several of your lines have hard tab characters in them; tabs are officially verboten in the Xen code. Please replace them with spaces. After that, I think we''re ready to check it in. One more small request: would it be possible to get some short xen traces of SPECVirt running, at least with the 1ms-delay patch, and if possible without it? I''d like to have a better understanding of what kind of scheduling workload SPECVirt creates, and how the 1ms delay affects it. If you have other priorities, don''t worry, I''ll wait until our performance team here gets it set up. If you do have time, the command to use is as follows: xentrace -D -e 0x21000 -T 30 /path/to/file.trace This will take *just* scheduling traces for 30 seconds. If you could run it when the benchmark is going full throttle, that should help me get an idea what the scheduling looks like. Thanks, -George
> Thanks Hui, those are good results. Just one question: What''s QoS > supposed to measure? Is this a metric that SPECVirt reports? Is > higher or lower better? >Yes, it was reported by SPECvirt. The lower, the better.> The patch looks good, but there''s one last nitpick: Several of your > lines have hard tab characters in them; tabs are officially verboten > in the Xen code. Please replace them with spaces. After that, I > think we''re ready to check it in. >Great!!> the command to use is as follows: > > xentrace -D -e 0x21000 -T 30 /path/to/file.trace > > This will take *just* scheduling traces for 30 seconds. If you could > run it when the benchmark is going full throttle, that should help me > get an idea what the scheduling looks like. >Yes, I''d like to do this. Hope the data size is not very large. Since, for full throttle condition, Xen hypervisor occupied a large CPU time (around 20%). Correspondingly, the trace data is very big. I''ll let you know when data is ready.