George Dunlap
2009-Apr-09 15:58 UTC
[Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
In the interest of openness (as well as in the interest of taking advantage of all the smart people out there), I''m posting a very early design prototype of the credit2 scheduler. We''ve had a lot of contributors to the scheduler recently, so I hope that those with interest and knowledge will take a look and let me know what they think at a high level. This first e-mail will discuss the overall goals: the target "sweet spot" use cases to consider, measurable goals for the scheduler, and the target interface / features. This is for general comment. The subsequent e-mail(s?) will include some specific algorithms and changes currently in consideration, as well as some bleeding-edge patches. This will be for people who have a specific interest in the details of the scheduling algorithms. Please feel free to comment / discuss / suggest improvements. 1. Design targets We have three general use cases in mind: Server consolidation, virtual desktop providers, and clients (e.g. XenClient). For servers, our target "sweet spot" for which we will optimize is a system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus). Ideal performance is expected to be reached at about 80% total system cpu utilization; but the system should function reasonably well up to a utilization of 800% (e.g., a load of 8). For virtual desktop systems, we will have a large number of interactive VMs with a lot of shared memory. Most of these will be single-vcpu, or at most 2 vcpus. For client systems, we expect to have 3-4 VMs (including dom0). Systems will probably ahve a single socket with 2 cores and SMT (4 logical cpus). Many VMs will be using PCI pass-through to access network, video, and audio cards. They''ll also be running video and audio workloads, which are extremely latency-sensitive. 2. Design goals For each of the target systems and workloads above, we have some high-level goals for the scheduler: * Fairness. In this context, we define "fairness" as the ability to get cpu time proportional to weight. We want to try to make this true even for latency-sensitive workloads such as networking, where long scheduling latency can reduce the throughput, and thus the total amount of time the VM can effectively use. * Good scheduling for latency-sensitive workloads. To the degree we are able, we want this to be true even those which use a significant amount of cpu power: That is, my audio shouldn''t break up if I start a cpu hog process in the VM playing the audio. * HT-aware. Running on a logical processor with an idle peer thread is not the same as running on a logical processor with a busy peer thread. The scheduler needs to take this into account when deciding "fairness". * Power-aware. Using as many sockets / cores as possible can increase the total cache size avalable to VMs, and thus (in the absence of inter-VM sharing) increase total computing power; but by keeping multiple sockets and cores powered up, also increases the electrical power used by the system. We want a configurable way to balance between maximizing processing power vs minimizing electrical power. 3. Target interface: The target interface will be similar to credit1: * The basic unit is the VM "weight". When competing for cpu resources, VMs will get a share of the resources proportional to their weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will get 33% and 67% of the cpu, respectively). * Additionally, we will be introducing a "reservation" or "floor". (I''m open to name changes on this one.) This will be a minimum amount of cpu time that a VM can get if it wants it. For example, one could give dom0 a "reservation" of 50%, but leave the weight at 256. No matter how many other VMs run with a weight of 256, dom0 will be guaranteed to get 50% of one cpu if it wants it. * The "cap" functionality of credit1 will be retained. This is a maximum amount of cpu time that a VM can get: i.e., a VM with a cap of 50% will only get half of one cpu, even if the rest of the system is completely idle. * We will also have an interface to the cpu-vs-electrical power. This is yet to be defined. At the hypervisor level, it will probably be a number representing the "badness" of powering up extra cpus / cores. At the tools level, there will probably be the option of either specifying the number, or of using one of 2/3 pre-defined values {power, balance, green/battery}. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-09 18:41 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
George Dunlap wrote:> 1. Design targets > > We have three general use cases in mind: Server consolidation, virtual > desktop providers, and clients (e.g. XenClient). > > For servers, our target "sweet spot" for which we will optimize is a > system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus). > Ideal performance is expected to be reached at about 80% total system > cpu utilization; but the system should function reasonably well up to > a utilization of 800% (e.g., a load of 8). >Is that forward-looking enough? That hardware is currently available; what''s going to be commonplace in 2-3 years?> For virtual desktop systems, we will have a large number of > interactive VMs with a lot of shared memory. Most of these will be > single-vcpu, or at most 2 vcpus. > > For client systems, we expect to have 3-4 VMs (including dom0). > Systems will probably ahve a single socket with 2 cores and SMT (4 > logical cpus). Many VMs will be using PCI pass-through to access > network, video, and audio cards. They''ll also be running video and > audio workloads, which are extremely latency-sensitive. > > 2. Design goals > > For each of the target systems and workloads above, we have some > high-level goals for the scheduler: > > * Fairness. In this context, we define "fairness" as the ability to > get cpu time proportional to weight. > > We want to try to make this true even for latency-sensitive workloads > such as networking, where long scheduling latency can reduce the > throughput, and thus the total amount of time the VM can effectively > use. > > * Good scheduling for latency-sensitive workloads. > > To the degree we are able, we want this to be true even those which > use a significant amount of cpu power: That is, my audio shouldn''t > break up if I start a cpu hog process in the VM playing the audio. > > * HT-aware. > > Running on a logical processor with an idle peer thread is not the > same as running on a logical processor with a busy peer thread. The > scheduler needs to take this into account when deciding "fairness". >Would it be worth just pair-scheduling HT threads so they''re always running in the same domain?> * Power-aware. > > Using as many sockets / cores as possible can increase the total cache > size avalable to VMs, and thus (in the absence of inter-VM sharing) > increase total computing power; but by keeping multiple sockets and > cores powered up, also increases the electrical power used by the > system. We want a configurable way to balance between maximizing > processing power vs minimizing electrical power. >I don''t remember if there''s a proper term for this, but what about having multiple domains sharing the same scheduling context, so that a stub domain can be co-scheduled with its main domain, rather than having them treated separately? Also, a somewhat related point, some kind of directed schedule so that when one vcpu is synchronously waiting on anohter vcpu, have it directly hand over its pcpu to avoid any cross-cpu overhead (including the ability to take advantage of directly using hot cache lines). That would be useful for intra-domain IPIs, etc, but also inter-domain context switches (domain<->stub, frontend<->backend, etc).> 3. Target interface: > > The target interface will be similar to credit1: > > * The basic unit is the VM "weight". When competing for cpu > resources, VMs will get a share of the resources proportional to their > weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will > get 33% and 67% of the cpu, respectively). > > * Additionally, we will be introducing a "reservation" or "floor". > (I''m open to name changes on this one.) This will be a minimum > amount of cpu time that a VM can get if it wants it. > > For example, one could give dom0 a "reservation" of 50%, but leave the > weight at 256. No matter how many other VMs run with a weight of 256, > dom0 will be guaranteed to get 50% of one cpu if it wants it. >How does the reservation interact with the credits? Is the reservtion in addition to its credits, or does using the reservation consume them?> * The "cap" functionality of credit1 will be retained. > > This is a maximum amount of cpu time that a VM can get: i.e., a VM > with a cap of 50% will only get half of one cpu, even if the rest of > the system is completely idle. > > * We will also have an interface to the cpu-vs-electrical power. > > This is yet to be defined. At the hypervisor level, it will probably > be a number representing the "badness" of powering up extra cpus / > cores. At the tools level, there will probably be the option of > either specifying the number, or of using one of 2/3 pre-defined > values {power, balance, green/battery}. >Is it worth taking into account the power cost of cache misses vs hits? Do vcpus running on pcpus running at less than 100% speed consume fewer credits? Is there any explicit interface to cpu power state management, or would that be decoupled? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-10 00:15 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: George Dunlap >Sent: 2009年4月9日 23:59 > >For servers, our target "sweet spot" for which we will optimize is a >system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus). >Ideal performance is expected to be reached at about 80% total system >cpu utilization; but the system should function reasonably well up to >a utilization of 800% (e.g., a load of 8).How is 80%/800% chosen here?> >For virtual desktop systems, we will have a large number of >interactive VMs with a lot of shared memory. Most of these will be >single-vcpu, or at most 2 vcpusHow about VM number in total you'd like to support? .> >* HT-aware. > >Running on a logical processor with an idle peer thread is not the >same as running on a logical processor with a busy peer thread. The >scheduler needs to take this into account when deciding "fairness".Do you mean that same elapsed time in above two scenarios will be translated into different credits?> >* Power-aware. > >Using as many sockets / cores as possible can increase the total cache >size avalable to VMs, and thus (in the absence of inter-VM sharing) >increase total computing power; but by keeping multiple sockets and >cores powered up, also increases the electrical power used by the >system. We want a configurable way to balance between maximizing >processing power vs minimizing electrical power.Xen3.4 now supports "sched_smt_power_savings" (both boot option and touchable by xenpm) to change power/performance preference. It's simple implementation to simply reverse the span order from existing package->core->thread to thread->core->package. More fine-grained flexibility could be given in future if hierarchical scheduling concept could be more clearly constructed like domain scheduler in Linux. Another possible 'fairness' point affected by power management could be to take freq scaling into consideration, since credit by far is simply calculated by elapsed time while elapsed time with different frequency actually indicates different consumed cycles.> >3. Target interface: > >The target interface will be similar to credit1: > >* The basic unit is the VM "weight". When competing for cpu >resources, VMs will get a share of the resources proportional to their >weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will >get 33% and 67% of the cpu, respectively).imo, weight is not strictly translated into the care for latency. any elaboration on that? I remembered that previously Nishiguchi-san gave idea to boost credit, and Disheng proposed static priority. Maybe you can make a summary to help people how latency would be exactly ensured in your proposal> >* Additionally, we will be introducing a "reservation" or "floor". > (I'm open to name changes on this one.) This will be a minimum > amount of cpu time that a VM can get if it wants it.this is good idea.> >For example, one could give dom0 a "reservation" of 50%, but leave the >weight at 256. No matter how many other VMs run with a weight of 256, >dom0 will be guaranteed to get 50% of one cpu if it wants it.there should be some way to adjust or limit usage of 'reservation' when multiple vcpus both claim a desire which however sum up to some exceeding cpu's computing power or weaken your general 'weight-as-basic-unit' idea?> >* The "cap" functionality of credit1 will be retained. > >This is a maximum amount of cpu time that a VM can get: i.e., a VM >with a cap of 50% will only get half of one cpu, even if the rest of >the system is completely idle. > >* We will also have an interface to the cpu-vs-electrical power. > >This is yet to be defined. At the hypervisor level, it will probably >be a number representing the "badness" of powering up extra cpus / >cores. At the tools level, there will probably be the option of >either specifying the number, or of using one of 2/3 pre-defined >values {power, balance, green/battery}.Not sure how that number will be defined. Maybe we can follow current way to just add individual name-based options matching its purpose (such as migration_cost and sched_smt_power_savings...) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-10 00:33 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: Jeremy Fitzhardinge >Sent: 2009年4月10日 2:42 > >George Dunlap wrote: >> 1. Design targets >> >> We have three general use cases in mind: Server >consolidation, virtual >> desktop providers, and clients (e.g. XenClient). >> >> For servers, our target "sweet spot" for which we will optimize is a >> system with 2 sockets, 4 cores each socket, and SMT (16 >logical cpus). >> Ideal performance is expected to be reached at about 80% total system >> cpu utilization; but the system should function reasonably well up to >> a utilization of 800% (e.g., a load of 8). >> > >Is that forward-looking enough? That hardware is currently available; >what's going to be commonplace in 2-3 years?good point.>> * HT-aware. >> >> Running on a logical processor with an idle peer thread is not the >> same as running on a logical processor with a busy peer thread. The >> scheduler needs to take this into account when deciding "fairness". >> > >Would it be worth just pair-scheduling HT threads so they're always >running in the same domain?running same domain doesn't help fairness and instead, it worsens.> >> * Power-aware. >> >> Using as many sockets / cores as possible can increase the >total cache >> size avalable to VMs, and thus (in the absence of inter-VM sharing) >> increase total computing power; but by keeping multiple sockets and >> cores powered up, also increases the electrical power used by the >> system. We want a configurable way to balance between maximizing >> processing power vs minimizing electrical power. >> > >I don't remember if there's a proper term for this, but what about >having multiple domains sharing the same scheduling context, so that a >stub domain can be co-scheduled with its main domain, rather >than having >them treated separately?This is really desired.> >Also, a somewhat related point, some kind of directed schedule so that >when one vcpu is synchronously waiting on anohter vcpu, have >it directly >hand over its pcpu to avoid any cross-cpu overhead (including the >ability to take advantage of directly using hot cache lines). That >would be useful for intra-domain IPIs, etc, but also inter-domain >context switches (domain<->stub, frontend<->backend, etc).The hard part here is to find the hint on WHICH vcpu that given cpu is waiting, which is not straightforward. Of course stub domain is most possible example, but it may be already cleanly addressed if above co-scheduling could be added? :-)>> * We will also have an interface to the cpu-vs-electrical power. >> >> This is yet to be defined. At the hypervisor level, it will probably >> be a number representing the "badness" of powering up extra cpus / >> cores. At the tools level, there will probably be the option of >> either specifying the number, or of using one of 2/3 pre-defined >> values {power, balance, green/battery}. >> > >Is it worth taking into account the power cost of cache misses vs hits? > >Do vcpus running on pcpus running at less than 100% speed >consume fewer >credits? > >Is there any explicit interface to cpu power state management, >or would >that be decoupled? >now cpu power management has sysctl interface exposed and xenpm is the tool using that interface so far. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhiyuan Shao
2009-Apr-10 02:28 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Hi all, Actually I think I/O responsiveness is important to control by the scheduling algorithm. And this is especially true for vritualized desktop/client environment, since in such environments, there are so many I/O events to handle, which is different from the server consolidation case, where many of the tasks are CPU-intensive. I would like to show this point by a simple scheduling algorithm, which is attached with this mail, i wrotten in the last winter (Jan. 2009). That time, I am in Intel OTC for a visitation, and thank Intel guys (Disheng, Kevin Tian and etc.) for their help. The scheduler is named as SDP (you have to use "sched=sdp" parameter in Xen kernel line when boot), which i mean to use ideas of dynamic priority to make the virtualized clients meat their needs of usage. However, this scheduler is basically a simple prototype for idea proving till now, I had not implement the dynamic priority mechanisms in yet. The solution used in this simple scheduler is largely ad-hoc, and i hope it can contribute something to the future development of next generation Xen scheduler. BTW, I borrowed large portion of code from Credit scheduler. This patch should work in VT-d platform well (it does not doing well in the emulated device environment, since device emulation, especially the video results in too high overhead to handle by the scheduler). We (thank again Intel OTC for the VT-d platform) tested the scheduler in a 3.0Ghz 2-core system, invoked 2 HVM guest domains (one is primary domain, and another is auxiliary, both have two VCPUs), pinned each vcpu of each domain to different PCPUs (VCPUs of domain0 is pinned as well), since the i still had not implement a proper VCPU migration mechanism in SDP (sorry for that, I do not think the aggresive migration mechanism of Credit is proper for virtualized clients, and working on this now, hope can find a proper one for Xen in near future). The sound and video card are directly assigned to the primary domain, while the auxiliary domain uses emulated ones. Set the "priority" (should be named as "I/O responsiveness", I think i had make a mistake on this, since the initial objectiveness is to use dynamic priority ideas) value of domain0 to 91, and set that of primary domain to 90. regarding the auxiliary domain, you can left it as default (80). Please used the attached domain0 command line tool (i.e., sched_sdp_set [domain_id] [new_priority]) to set the new priority, I am not good at Python, sorry for that! We had tested a scenario that the primary domain plays a DivX video, and at the same time, copy very big files in the auxiliary domain. The video can be played well! The effectiveness we experienced beats BCredit in this usage case, no matter how we adjust the parameters of BCredit. Some explanations of SDP: The "priority" parameter is actually used to control I/O responsiveness. If a VCPU is woken by an I/O event during the runtime, and at the same time, and its "priority" value happened to be higher than the current VCPU, the current VCPU will be preempted, and the woken VCPU is scheduled. A "bonus" will be given to the woken VCPU to leave "enough possible" time for it to complete its I/O handling. The bonus value is computed by substracting the "priority" parameter of the two (the woken and the current ) VCPUs. This strategy actually inhibits the preemption of a currently running VCPU with high "priority" by another VCPU with lower one, while permits preemption vice vesa, and i think this method fits well for the asymmetric domain role scenario of virtualized clients and desktops. Regarding the computation resource allocation, the simple scheduler actually shares the CPU resource by a round-robin fashion. I/O event happened at a high "priority" VCPU can give it a little "bonus", after using up that, the VCPU fall back to the round-robin scheduling ring. By this way, it maintains some kind of fairness even in virtualized client environment. e.g., in our testing scenario, we found the file-copying in auxiliary domain proceeds well (although a little slow) when the primary domain plays a DivX video, which results in high volumn of I/O events. By this experience, i think I/O responsiveness is an important parameter to be added in the development of new scheduler, since platforms have their independent performance metrics, and user can adjust the I/O responsiveness parameter of the domains to make them work well. Moreover, I think some characters of Credit scheduler does not fit well for the virtualized clients/desktops (for further discussion if possible). If used in the virtualized clients/desktop scenatio, the worst side of Credit is its little state space to mark the VCPUs (i.e., BOOT, UNDER and OVER). This make it very inconvenient at least to differentiate the VCPUs of different domains, and with different kinds of tasks, although the little state space do work well in consolidated servers. The second inconevient side of Credit scheduler is the method that the scheduler "boosts" the VCPUs. In the original version Credit, a woken VCPU has to have enough credits (UNDER state) to make itself promoted to BOOST state. However, the domain (VCPU) may used up its credit, and at the same time it do have critical task. At this moment, fairness is of the secondary consideration, and should be maintained in later phases. At this point, BCredit brings some changes, although Bcredit may give fairness little consideration, unfortunately. Thanks, 2009-04-10 Zhiyuan Shao 发件人: George Dunlap 发送时间: 2009-04-09 23:59:18 收件人: xen-devel@lists.xensource.com 抄送: 主题: [Xen-devel] [RFC] Scheduler work,part 1: High-level goals and interface. In the interest of openness (as well as in the interest of taking advantage of all the smart people out there), I'm posting a very early design prototype of the credit2 scheduler. We've had a lot of contributors to the scheduler recently, so I hope that those with interest and knowledge will take a look and let me know what they think at a high level. This first e-mail will discuss the overall goals: the target "sweet spot" use cases to consider, measurable goals for the scheduler, and the target interface / features. This is for general comment. The subsequent e-mail(s?) will include some specific algorithms and changes currently in consideration, as well as some bleeding-edge patches. This will be for people who have a specific interest in the details of the scheduling algorithms. Please feel free to comment / discuss / suggest improvements. 1. Design targets We have three general use cases in mind: Server consolidation, virtual desktop providers, and clients (e.g. XenClient). For servers, our target "sweet spot" for which we will optimize is a system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus). Ideal performance is expected to be reached at about 80% total system cpu utilization; but the system should function reasonably well up to a utilization of 800% (e.g., a load of 8). For virtual desktop systems, we will have a large number of interactive VMs with a lot of shared memory. Most of these will be single-vcpu, or at most 2 vcpus. For client systems, we expect to have 3-4 VMs (including dom0). Systems will probably ahve a single socket with 2 cores and SMT (4 logical cpus). Many VMs will be using PCI pass-through to access network, video, and audio cards. They'll also be running video and audio workloads, which are extremely latency-sensitive. 2. Design goals For each of the target systems and workloads above, we have some high-level goals for the scheduler: * Fairness. In this context, we define "fairness" as the ability to get cpu time proportional to weight. We want to try to make this true even for latency-sensitive workloads such as networking, where long scheduling latency can reduce the throughput, and thus the total amount of time the VM can effectively use. * Good scheduling for latency-sensitive workloads. To the degree we are able, we want this to be true even those which use a significant amount of cpu power: That is, my audio shouldn't break up if I start a cpu hog process in the VM playing the audio. * HT-aware. Running on a logical processor with an idle peer thread is not the same as running on a logical processor with a busy peer thread. The scheduler needs to take this into account when deciding "fairness". * Power-aware. Using as many sockets / cores as possible can increase the total cache size avalable to VMs, and thus (in the absence of inter-VM sharing) increase total computing power; but by keeping multiple sockets and cores powered up, also increases the electrical power used by the system. We want a configurable way to balance between maximizing processing power vs minimizing electrical power. 3. Target interface: The target interface will be similar to credit1: * The basic unit is the VM "weight". When competing for cpu resources, VMs will get a share of the resources proportional to their weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will get 33% and 67% of the cpu, respectively). * Additionally, we will be introducing a "reservation" or "floor". (I'm open to name changes on this one.) This will be a minimum amount of cpu time that a VM can get if it wants it. For example, one could give dom0 a "reservation" of 50%, but leave the weight at 256. No matter how many other VMs run with a weight of 256, dom0 will be guaranteed to get 50% of one cpu if it wants it. * The "cap" functionality of credit1 will be retained. This is a maximum amount of cpu time that a VM can get: i.e., a VM with a cap of 50% will only get half of one cpu, even if the rest of the system is completely idle. * We will also have an interface to the cpu-vs-electrical power. This is yet to be defined. At the hypervisor level, it will probably be a number representing the "badness" of powering up extra cpus / cores. At the tools level, there will probably be the option of either specifying the number, or of using one of 2/3 pre-defined values {power, balance, green/battery}. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel . _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-10 16:15 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Tian, Kevin wrote:>> From: Jeremy Fitzhardinge >> Sent: 2009年4月10日 2:42 >> >> George Dunlap wrote: >> >>> 1. Design targets >>> >>> We have three general use cases in mind: Server >>> >> consolidation, virtual >> >>> desktop providers, and clients (e.g. XenClient). >>> >>> For servers, our target "sweet spot" for which we will optimize is a >>> system with 2 sockets, 4 cores each socket, and SMT (16 >>> >> logical cpus). >> >>> Ideal performance is expected to be reached at about 80% total system >>> cpu utilization; but the system should function reasonably well up to >>> a utilization of 800% (e.g., a load of 8). >>> >>> >> Is that forward-looking enough? That hardware is currently available; >> what''s going to be commonplace in 2-3 years? >> > > good point. > > >>> * HT-aware. >>> >>> Running on a logical processor with an idle peer thread is not the >>> same as running on a logical processor with a busy peer thread. The >>> scheduler needs to take this into account when deciding "fairness". >>> >>> >> Would it be worth just pair-scheduling HT threads so they''re always >> running in the same domain? >> > > running same domain doesn''t help fairness and instead, it worsens. >I don''t know what the performance characteristics of modern-HT is, but in P4-HT the throughput of a given thread was very dependent on what the other thread was doing. If its competing with some other arbitrary domain, then its hard to make any estimates about what the throughput of a given vcpu''s thread is. If we present them as sibling pairs to guests, then it becomes the guest OS''s problem (ie, we don''t try to hide the true nature of these pcpus). That''s fairer for the guest, because they know what they''re getting, and Xen can charge the guest for cpu use on a thread-pair, rather than trying to work out how the two threads compete. In other words, if only one thread is running, then it can charge max-thread-throughput; if both are running, it can charge max-core-throughput (possibly scaled by whatever performance mode the core is running in).>>> * Power-aware. >>> >>> Using as many sockets / cores as possible can increase the >>> >> total cache >> >>> size avalable to VMs, and thus (in the absence of inter-VM sharing) >>> increase total computing power; but by keeping multiple sockets and >>> cores powered up, also increases the electrical power used by the >>> system. We want a configurable way to balance between maximizing >>> processing power vs minimizing electrical power. >>> >>> >> I don''t remember if there''s a proper term for this, but what about >> having multiple domains sharing the same scheduling context, so that a >> stub domain can be co-scheduled with its main domain, rather >> than having >> them treated separately? >> > > This is really desired. > > >> Also, a somewhat related point, some kind of directed schedule so that >> when one vcpu is synchronously waiting on anohter vcpu, have >> it directly >> hand over its pcpu to avoid any cross-cpu overhead (including the >> ability to take advantage of directly using hot cache lines). That >> would be useful for intra-domain IPIs, etc, but also inter-domain >> context switches (domain<->stub, frontend<->backend, etc). >> > > The hard part here is to find the hint on WHICH vcpu that given > cpu is waiting, which is not straightforward. Of course stub > domain is most possible example, but it may be already cleanly > addressed if above co-scheduling could be added? :-) >I''m being unclear by conflating two issues. One is that when dom0 (or driver domain) does some work on behalf of a guest, it seems like it would be useful for the time used to be credited against the guest rather than against dom0. My thought is that, rather than having the scheduler parameters be the implicit result of "vcpu A belongs to domain X, charge X", each vcpu has a charging domain which can be updated via (privileged) hypercall. When dom0 is about to do some work, it updates the charging domain accordingly (with some machinery to make that a per-task property within the kernel so that task context switches update the vcpu state appropriately). A further extension would be the idea of charging grants, where domain A could grant domain B charging rights, and B could set its vcpus to charge A as an unprivileged operation. As with grant tables, revocation poses some interesting problems. This is a generalization of coscheduled stub domains, because you could achieve the same effect by making the stub domain simply switch all its vcpus to charge its main domain. How to schedule vcpus? They could either be scheduled as if they were part of the other domain; or be scheduled with their "home" domain, but their time spent is charged against the other domain. The former is effectively priority inheritance, and raises all the the normal issues - but it would be appropriate for co-scheduled stub domains. The latter makes more sense for dom0, but its less clear what it actually means: does it consume any home domain credits? What happens if the other domain''s credits are all consumed? Could two domains collude to get more than their fair share of cpu? The second issue is trying to share pcpu resources between vcpus where appropriate. The obvious case is doing some kind of cross-domain copy operation, where the data could well be hot in cache, so if you use the same pcpu you can just get cache hits. Of course there''s the tradeoff that you''re necessarily serialising things which could be done in parallel, so perhaps it doesn''t work well in practice. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2009-Apr-10 17:16 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
> I don''t know what the performance characteristics of modern-HT is, but > in P4-HT the throughput of a given thread was very dependent on what the > other thread was doing. If its competing with some other arbitrary > domain, then its hard to make any estimates about what the throughput > of a given vcpu''s thread is.The original Northwood P4''s were fairly horrible as regards performance predictability, but things got considerably better with later steppings. Nehalem has some interesting features that ought to make it better yet. Presenting sibling pairs to guests is probably preferable (it avoids any worries about side channel crypto attacks), but I certainly wouldn''t restrict it to just that: server hosted desktop workloads often involve large numbers of single VCPU guests, and you want every logical processor available. Scaling the accounting if two threads share a core is a good way of ensuring things tend toward longer term fairness. Possibly having two modes of operation would be good thing: 1. explicitly present HT to guests and gang schedule threads 2. normal free-for-all with HT aware accounting. Of course, #1 isn''t optimal if guests may migrate between HT and non-HT systems. Ian> If we present them as sibling pairs to guests, then it becomes the > guest > OS''s problem (ie, we don''t try to hide the true nature of these pcpus). > That''s fairer for the guest, because they know what they''re getting, > and > Xen can charge the guest for cpu use on a thread-pair, rather than > trying to work out how the two threads compete. In other words, if only > one thread is running, then it can charge max-thread-throughput; if > both > are running, it can charge max-core-throughput (possibly scaled by > whatever performance mode the core is running in). > > >>> * Power-aware. > >>> > >>> Using as many sockets / cores as possible can increase the > >>> > >> total cache > >> > >>> size avalable to VMs, and thus (in the absence of inter-VM sharing) > >>> increase total computing power; but by keeping multiple sockets and > >>> cores powered up, also increases the electrical power used by the > >>> system. We want a configurable way to balance between maximizing > >>> processing power vs minimizing electrical power. > >>> > >>> > >> I don''t remember if there''s a proper term for this, but what about > >> having multiple domains sharing the same scheduling context, so that > a > >> stub domain can be co-scheduled with its main domain, rather > >> than having > >> them treated separately? > >> > > > > This is really desired. > > > > > >> Also, a somewhat related point, some kind of directed schedule so > that > >> when one vcpu is synchronously waiting on anohter vcpu, have > >> it directly > >> hand over its pcpu to avoid any cross-cpu overhead (including the > >> ability to take advantage of directly using hot cache lines). That > >> would be useful for intra-domain IPIs, etc, but also inter-domain > >> context switches (domain<->stub, frontend<->backend, etc). > >> > > > > The hard part here is to find the hint on WHICH vcpu that given > > cpu is waiting, which is not straightforward. Of course stub > > domain is most possible example, but it may be already cleanly > > addressed if above co-scheduling could be added? :-) > > > > I''m being unclear by conflating two issues. > > One is that when dom0 (or driver domain) does some work on behalf of a > guest, it seems like it would be useful for the time used to be > credited > against the guest rather than against dom0. > > My thought is that, rather than having the scheduler parameters be the > implicit result of "vcpu A belongs to domain X, charge X", each vcpu > has > a charging domain which can be updated via (privileged) hypercall. When > dom0 is about to do some work, it updates the charging domain > accordingly (with some machinery to make that a per-task property > within > the kernel so that task context switches update the vcpu state > appropriately). > > A further extension would be the idea of charging grants, where domain > A > could grant domain B charging rights, and B could set its vcpus to > charge A as an unprivileged operation. As with grant tables, revocation > poses some interesting problems. > > This is a generalization of coscheduled stub domains, because you could > achieve the same effect by making the stub domain simply switch all its > vcpus to charge its main domain. > > How to schedule vcpus? They could either be scheduled as if they were > part of the other domain; or be scheduled with their "home" domain, but > their time spent is charged against the other domain. The former is > effectively priority inheritance, and raises all the the normal issues > - > but it would be appropriate for co-scheduled stub domains. The latter > makes more sense for dom0, but its less clear what it actually means: > does it consume any home domain credits? What happens if the other > domain''s credits are all consumed? Could two domains collude to get > more > than their fair share of cpu? > > > > The second issue is trying to share pcpu resources between vcpus where > appropriate. The obvious case is doing some kind of cross-domain copy > operation, where the data could well be hot in cache, so if you use the > same pcpu you can just get cache hits. Of course there''s the tradeoff > that you''re necessarily serialising things which could be done in > parallel, so perhaps it doesn''t work well in practice. > > J > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-10 17:19 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Ian Pratt wrote:>> I don''t know what the performance characteristics of modern-HT is, but >> in P4-HT the throughput of a given thread was very dependent on what the >> other thread was doing. If its competing with some other arbitrary >> domain, then its hard to make any estimates about what the throughput >> of a given vcpu''s thread is. >> > > The original Northwood P4''s were fairly horrible as regards performance predictability, but things got considerably better with later steppings. Nehalem has some interesting features that ought to make it better yet. > > Presenting sibling pairs to guests is probably preferable (it avoids any worries about side channel crypto attacks), but I certainly wouldn''t restrict it to just that: server hosted desktop workloads often involve large numbers of single VCPU guests, and you want every logical processor available. > > Scaling the accounting if two threads share a core is a good way of ensuring things tend toward longer term fairness. > > Possibly having two modes of operation would be good thing: > > 1. explicitly present HT to guests and gang schedule threads > > 2. normal free-for-all with HT aware accounting. > > Of course, #1 isn''t optimal if guests may migrate between HT and non-HT systems. >This can probably be extended to Intel''s hyper-dynamic flux mode (that may not be the real marketing name), where it can overclock one core if the other is idle. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-10 17:34 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Ian Pratt wrote:> 1. explicitly present HT to guests and gang schedule threads > > 2. normal free-for-all with HT aware accounting. > > Of course, #1 isn''t optimal if guests may migrate between HT and non-HT systems. >Well, we could extend vcpu hotplug to deal with those kinds of cpu topology changes, but I guess that doesn''t help most Windows/hvm guests. But I think if those vcpus stop being siblings, I don''t think it would hurt if we stopped gang scheduling them, so long as they''re kept close (same package, I guess). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-11 09:52 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] >Sent: 2009年4月11日 0:15 >>>> * HT-aware. >>>> >>>> Running on a logical processor with an idle peer thread is not the >>>> same as running on a logical processor with a busy peer >thread. The >>>> scheduler needs to take this into account when deciding "fairness". >>>> >>>> >>> Would it be worth just pair-scheduling HT threads so they're always >>> running in the same domain? >>> >> >> running same domain doesn't help fairness and instead, it worsens. >> > >I don't know what the performance characteristics of modern-HT is, but >in P4-HT the throughput of a given thread was very dependent >on what the >other thread was doing. If its competing with some other arbitrary >domain, then its hard to make any estimates about what the >throughput of >a given vcpu's thread is. > >If we present them as sibling pairs to guests, then it becomes >the guest >OS's problem (ie, we don't try to hide the true nature of these pcpus). >That's fairer for the guest, because they know what they're >getting, and >Xen can charge the guest for cpu use on a thread-pair, rather than >trying to work out how the two threads compete. In other words, if only >one thread is running, then it can charge >max-thread-throughput; if both >are running, it can charge max-core-throughput (possibly scaled by >whatever performance mode the core is running in).It bases on one assumption that workloads within VM is more HT friendly than workloads cross VMs. Maybe it's true in some cases but I don't think it a strong point in most deployments. The major worry to me is added complexity by exposing such sibling pairs to guest. You then have to schedule at core level for that VM, since the implication of HT should be always maintained or else reverse effect could be seen when VM does try to utilize that topology. This brings trouble to scheduler. Not all VMs are guest SMP, and then the VM being exposed with HT is actually treated unfair as one more limitation is imposed that partial idle core can't be used by it while other VMs is immune. Another tricky part is that you have to gang schedule that VM, which is in concept fancy but no one has come up a solid implementaion in real. Above is why I said the fairness could be worse in a general level. It could be useful in some specific scenario. one is in client, where however it's better to expose full topology instead of HT. the other is some mission critical usages where cpu resource are paritioned and thus to expose HT could be also useful.>>> Also, a somewhat related point, some kind of directed >schedule so that >>> when one vcpu is synchronously waiting on anohter vcpu, have >>> it directly >>> hand over its pcpu to avoid any cross-cpu overhead (including the >>> ability to take advantage of directly using hot cache lines). That >>> would be useful for intra-domain IPIs, etc, but also inter-domain >>> context switches (domain<->stub, frontend<->backend, etc). >>> >> >> The hard part here is to find the hint on WHICH vcpu that given >> cpu is waiting, which is not straightforward. Of course stub >> domain is most possible example, but it may be already cleanly >> addressed if above co-scheduling could be added? :-) >> > >I'm being unclear by conflating two issues. > >One is that when dom0 (or driver domain) does some work on behalf of a >guest, it seems like it would be useful for the time used to >be credited >against the guest rather than against dom0. > >My thought is that, rather than having the scheduler parameters be the >implicit result of "vcpu A belongs to domain X, charge X", >each vcpu has >a charging domain which can be updated via (privileged) hypercall. When >dom0 is about to do some work, it updates the charging domain >accordingly (with some machinery to make that a per-task >property within >the kernel so that task context switches update the vcpu state >appropriately). > >A further extension would be the idea of charging grants, >where domain A >could grant domain B charging rights, and B could set its vcpus to >charge A as an unprivileged operation. As with grant tables, revocation >poses some interesting problems. > >This is a generalization of coscheduled stub domains, because you could >achieve the same effect by making the stub domain simply switch all its >vcpus to charge its main domain.Yup. This is one long missing part in Xen. Current accounting mechanism like in xentop is raw incomplete. In this part KVM could be easier under the cap of container.> >How to schedule vcpus? They could either be scheduled as if they were >part of the other domain; or be scheduled with their "home" domain, but >their time spent is charged against the other domain. The former is >effectively priority inheritance, and raises all the the >normal issues - >but it would be appropriate for co-scheduled stub domains. The latter >makes more sense for dom0, but its less clear what it actually means: >does it consume any home domain credits? What happens if the other >domain's credits are all consumed? Could two domains collude >to get more >than their fair share of cpu? > > > >The second issue is trying to share pcpu resources between vcpus where >appropriate. The obvious case is doing some kind of cross-domain copy >operation, where the data could well be hot in cache, so if you use the >same pcpu you can just get cache hits. Of course there's the tradeoff >that you're necessarily serialising things which could be done in >parallel, so perhaps it doesn't work well in practice. > >J >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-11 09:57 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] >Sent: 2009年4月11日 1:16 > >> I don't know what the performance characteristics of >modern-HT is, but >> in P4-HT the throughput of a given thread was very dependent >on what the >> other thread was doing. If its competing with some other arbitrary >> domain, then its hard to make any estimates about what the throughput >> of a given vcpu's thread is. > >The original Northwood P4's were fairly horrible as regards >performance predictability, but things got considerably better >with later steppings. Nehalem has some interesting features >that ought to make it better yet. > >Presenting sibling pairs to guests is probably preferable (it >avoids any worries about side channel crypto attacks), but I >certainly wouldn't restrict it to just that: server hosted >desktop workloads often involve large numbers of single VCPU >guests, and you want every logical processor available. > >Scaling the accounting if two threads share a core is a good >way of ensuring things tend toward longer term fairness. > >Possibly having two modes of operation would be good thing: > > 1. explicitly present HT to guests and gang schedule threads > > 2. normal free-for-all with HT aware accounting. > >Of course, #1 isn't optimal if guests may migrate between HT >and non-HT systems. > >what do you mean by 'free-for-all'? Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-11 10:00 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] >Sent: 2009年4月11日 1:20 >Ian Pratt wrote: >>> I don't know what the performance characteristics of >modern-HT is, but >>> in P4-HT the throughput of a given thread was very >dependent on what the >>> other thread was doing. If its competing with some other arbitrary >>> domain, then its hard to make any estimates about what the >throughput >>> of a given vcpu's thread is. >>> >> >> The original Northwood P4's were fairly horrible as regards >performance predictability, but things got considerably better >with later steppings. Nehalem has some interesting features >that ought to make it better yet. >> >> Presenting sibling pairs to guests is probably preferable >(it avoids any worries about side channel crypto attacks), but >I certainly wouldn't restrict it to just that: server hosted >desktop workloads often involve large numbers of single VCPU >guests, and you want every logical processor available. >> >> Scaling the accounting if two threads share a core is a good >way of ensuring things tend toward longer term fairness. >> >> Possibly having two modes of operation would be good thing: >> >> 1. explicitly present HT to guests and gang schedule threads >> >> 2. normal free-for-all with HT aware accounting. >> >> Of course, #1 isn't optimal if guests may migrate between HT >and non-HT systems. >> > >This can probably be extended to Intel's hyper-dynamic flux mode (that >may not be the real marketing name), where it can overclock >one core if >the other is idle.the normal name for this is Turbo Boost. However it'd be difficult for software to accounting for extra cycles gained from overclock, as whether boost actually happens and how much cycles can be boosted are completely controlled by hardware unit. There's some feedback mechanism though, to gain average frequency in an elapsed time. However currently cpufreq governor runs in time based style w/o connection to scheduler. That's one part we could further enhance. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2009-Apr-11 17:11 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
> >Possibly having two modes of operation would be good thing: > > > > 1. explicitly present HT to guests and gang schedule threads > > > > 2. normal free-for-all with HT aware accounting. > > > >Of course, #1 isn''t optimal if guests may migrate between HT > >and non-HT systems. > > what do you mean by ''free-for-all''?Same as today, i.e. we don''t gang schedule and all threads are available for running VCPUs. I think it''s reasonable to have two different modes of operation. For some CPU-intensive server virtualization-type workloads the admin basically wants to partition the machine. In this situation it''s reasonable to expose the physical topology to guests (not just hyperthreads, but potentially cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too). For more general virtualization workloads where the total number of VCPUs is rather greater than the number of physical CPUs then the current behaviour is preferable. HT aware accounting will mean that VCPUs that run concurrently on the same core will be charged less than the full period they are scheduled for. Thanks, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-12 06:27 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] >Sent: 2009年4月12日 1:12 >> >Possibly having two modes of operation would be good thing: >> > >> > 1. explicitly present HT to guests and gang schedule threads >> > >> > 2. normal free-for-all with HT aware accounting. >> > >> >Of course, #1 isn't optimal if guests may migrate between HT >> >and non-HT systems. >> >> what do you mean by 'free-for-all'? > >Same as today, i.e. we don't gang schedule and all threads are >available for running VCPUs. > >I think it's reasonable to have two different modes of >operation. For some CPU-intensive server virtualization-type >workloads the admin basically wants to partition the machine. >In this situation it's reasonable to expose the physical >topology to guests (not just hyperthreads, but potentially >cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too). > >For more general virtualization workloads where the total >number of VCPUs is rather greater than the number of physical >CPUs then the current behaviour is preferable. HT aware >accounting will mean that VCPUs that run concurrently on the >same core will be charged less than the full period they are >scheduled for. >Agree. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-15 13:54 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> This can probably be extended to Intel''s hyper-dynamic flux mode (that may > not be the real marketing name), where it can overclock one core if the > other is idle.Jeremy, Did you mean we could expose an entire socket to a guest VM, so that it could schedule so as to take advantage of the effects of Turbo Boost, just as we can expose thread pairs to a VM and let the guest OS scheduler deal with threading issues? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-15 14:29 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
On Thu, Apr 9, 2009 at 7:41 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> I don''t remember if there''s a proper term for this, but what about having > multiple domains sharing the same scheduling context, so that a stub domain > can be co-scheduled with its main domain, rather than having them treated > separately?I think it''s been informally called "co-scheduling". :-) Yes, I''m going to be looking into that. One of the things that makes it a little less easy is that (as I understand it) there is only one stub domain "vcpu" per VM, which is shared by all a VM''s vcpus.> Also, a somewhat related point, some kind of directed schedule so that when > one vcpu is synchronously waiting on anohter vcpu, have it directly hand > over its pcpu to avoid any cross-cpu overhead (including the ability to take > advantage of directly using hot cache lines). That would be useful for > intra-domain IPIs, etc, but also inter-domain context switches > (domain<->stub, frontend<->backend, etc).The only problem is if the "service" domain has other work that it may do after it''s done. In my tests on a 2-core box doing scp to an HVM guest, it''s faster if I pin dom0 and domU to separate cores than if I pin them to the same core. Looking at the traces, it seems as though after dom0 has woken up domU, it spends another 50K cycles or so before blocking. Stub domains may behave differently; in any case, it''s something that needs experimentation to decide.>> For example, one could give dom0 a "reservation" of 50%, but leave the >> weight at 256. No matter how many other VMs run with a weight of 256, >> dom0 will be guaranteed to get 50% of one cpu if it wants it. >> > > How does the reservation interact with the credits? Is the reservtion in > addition to its credits, or does using the reservation consume them?I think your question is, how does the reservation interact with weight? (Credits is the mechanism to implement both.) The idea is that a VM would get either an amount of cpu proportional to its weight, or the reservation, whichever is greater. So suppose that VMs A, B, and C have weights of 256 on a system with 1 core, no reservations. If A and B are burning as much cpu as they can and C is idle, then A and B should get 50% each. If all of them (A,B,C) are burning as much cpu as they can, they will should 33% each. Now suppose that we give B a reservation of 40%. If A and B are burning as much as they can and C is idle, then A and B should again get 50% each. However, if all of them are burning as much as they can, then B should get 40% (its reservation), and A and C should each get 30% (i.e., the remaining 60% divided by weight). Does that make sense?> Is it worth taking into account the power cost of cache misses vs hits?If we have a general framework for "goodness" and "badness", and we have a way of measuring cache hits / misses, we should be able to extend the scheduler to do so.> Do vcpus running on pcpus running at less than 100% speed consume fewer > credits?Yes, we''ll also need to account for cpu frequency states in our accounting.> Is there any explicit interface to cpu power state management, or would that > be decoupled?I think we might be able to fold this in; it depends on how complicated things get. Just as one can imagine a "badness factor" to powering up a second CPU, which we can weight against the "badness" of vcpus waiting on the runqueue, we can imagine a "badness factor" of running at a higher cpu HZ that can be weighed against either powering up extra cores / cpus or having to wait on the runqueue. Let''s start with a basic "badness factor" and see if we can get it worked out properly, and then look at extending it to these sorts of things. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-15 15:07 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
2009/4/10 Tian, Kevin <kevin.tian@intel.com>:> How about VM number in total you''d like to support?A rule-of-thumb number would be that we want to perform well at 4 VMs per core, and wouldn''t mind having a performance "cliff" past 8 per core (not thread). So for a 16-core system, that would be "good" for 64 VMs and "acceptable" up to 128 VMs.> Do you mean that same elapsed time in above two scenarios will be > translated into different credits?Yes. Ideally, we want to give "processing power" based on weight. But the "processing power" of a thread whose sibling is idle is significantly more than the "processing power" of a thread whose sibling is running. (Same thing possibly for cpu frequency scaling.) So we''d want to arrange the credits such that VMs with equal weight equal "processing power", not just equal "time on a logical cpu".> Xen3.4 now supports "sched_smt_power_savings" (both boot option > and touchable by xenpm) to change power/performance preference. > It''s simple implementation to simply reverse the span order from > existing package->core->thread to thread->core->package. More > fine-grained flexibility could be given in future if hierarchical scheduling > concept could be more clearly constructed like domain scheduler > in Linux.I haven''t looked at this code. From your description here it sounds like a sort of a simple hack to get the effect we want (either spreading things out or pushing them together) -- is that correct? My general feeling is that hacks are good short-term solutions, but not long-term. Things always get more complicated, and often have unexpected side-effects. I think since we''re doing scheduler work, it''s worth it to try to see if we can actually solve the power/performance problem.> imo, weight is not strictly translated into the care for latency. any > elaboration on that? I remembered that previously Nishiguchi-san > gave idea to boost credit, and Disheng proposed static priority. > Maybe you can make a summary to help people how latency would > be exactly ensured in your proposalAll of this needs to be run through experiments. So far, I''ve had really good success with putting waking VMs in "boost" priority for 1ms if they still have credits. (And unlike the credit scheduler, I try to make sure that a VM rarely runs out of credits.)> there should be some way to adjust or limit usage of ''reservation'' when > multiple vcpus both claim a desire which however sum up to some > exceeding cpu''s computing power or weaken your general > ''weight-as-basic-unit'' idea?All "reservations" on the system must add up to less than the total processing power of the system. So a system with 2 cores can''t have a sum of reservations more than 200%. Xen will check this when setting the reservation and return an appropriate error message if necessary.>>* We will also have an interface to the cpu-vs-electrical power. >> >>This is yet to be defined. At the hypervisor level, it will probably >>be a number representing the "badness" of powering up extra cpus / >>cores. At the tools level, there will probably be the option of >>either specifying the number, or of using one of 2/3 pre-defined >>values {power, balance, green/battery}. > > Not sure how that number will be defined. Maybe we can follow > current way to just add individual name-based options matching > its purpose (such as migration_cost and sched_smt_power_savings...)At the scheduler level, I was thinking along the lines of "core_power_up_cost". This would be comparible to the cost of having things waiting on the runqueue. So (for example) if the cost was 0.1, then when the load on the current processors reached 1.1, then it would power up another core. You could set it to 0.5 or 1.0 to save more power (at the cost of some performance). I think defining it that way is the closest to what you really want: a way to define the performance impact vs power consumption. Obviously at the user interface level, we might have something more manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or something like that. But as I said, the *goal* is to have a useful configurable interface; the implementation will depend on what actually can be made to work in practice. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-15 15:47 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
2009/4/11 Tian, Kevin <kevin.tian@intel.com>:> the normal name for this is Turbo Boost. However it''d be difficult > for software to accounting for extra cycles gained from overclock, > as whether boost actually happens and how much cycles can > be boosted are completely controlled by hardware unit.>From the context, it sounded like Jeremy was saying that if we exposea whole socket to a guest, then the guest can try to schedule things either to take advantage of multiple cores or to take advantage of Turbo Boost. (i.e., punt the Turbo Boost performance optimization to the guest, just as we could punt the hyperthreading problem to the guest.) In any case, even if we can''t control it, we may be able to either do some estimates (i.e., we expect this core to run at about 120%). There will probably be some performance counters that we could use to estimate how much "boost" a VM actually got and deal with credits accordingly... but that''s yet another level of complication. I''ll put it in the list of things to look at, and we''ll see how far we get. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-15 15:56 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
2009/4/11 Tian, Kevin <kevin.tian@intel.com>:> The major worry to me is added complexity by exposing such sibling > pairs to guest. You then have to schedule at core level for that VM, > since the implication of HT should be always maintained or else > reverse effect could be seen when VM does try to utilize that topology. > This brings trouble to scheduler. Not all VMs are guest SMP, and > then the VM being exposed with HT is actually treated unfair as one > more limitation is imposed that partial idle core can''t be used by it > while other VMs is immune. Another tricky part is that you have to > gang schedule that VM, which is in concept fancy but no one has > come up a solid implementaion in real.I think gang scheduling with this limited scope (a hyper-pair to be scheduled on a hyper-pair) should be a lot easier than the general case. In any case, as long as we assign and deduct credits appropriately, a threaded VM shouldn''t have a disadvantage compared to a single-thread VM. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-15 16:23 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
George Dunlap wrote:> On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote: > >> This can probably be extended to Intel''s hyper-dynamic flux mode (that may >> not be the real marketing name), where it can overclock one core if the >> other is idle. >> > > Jeremy, > > Did you mean we could expose an entire socket to a guest VM, so that > it could schedule so as to take advantage of the effects of Turbo > Boost, just as we can expose thread pairs to a VM and let the guest OS > scheduler deal with threading issues? >Yes, precisely. They''re the same in that Xen concurrently schedules two (or more?) vcpus to the guest which have interdependent performance. One could imagine a case where a guest with a single-threaded workload gets best performance by being given a thread/core pair, running their work on one while explicitly keeping the other idle. Of course that idle core is lost to the rest of the system in the meantime, so the guest should get charged for both. And some kind of small-scale gang scheduling might be useful for small SMP guests anyway, because their spinlocks and IPIs will work as expected, and they''ll presumably get shared cache at some level. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-16 04:58 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: George Dunlap >Sent: 2009年4月15日 23:07 > >> Do you mean that same elapsed time in above two scenarios will be >> translated into different credits? > >Yes. Ideally, we want to give "processing power" based on weight. >But the "processing power" of a thread whose sibling is idle is >significantly more than the "processing power" of a thread whose >sibling is running. (Same thing possibly for cpu frequency scaling.) >So we'd want to arrange the credits such that VMs with equal weight >equal "processing power", not just equal "time on a logical cpu".Yup, this is one interesting part to be further explored.> >> Xen3.4 now supports "sched_smt_power_savings" (both boot option >> and touchable by xenpm) to change power/performance preference. >> It's simple implementation to simply reverse the span order from >> existing package->core->thread to thread->core->package. More >> fine-grained flexibility could be given in future if >hierarchical scheduling >> concept could be more clearly constructed like domain scheduler >> in Linux. > >I haven't looked at this code. From your description here it sounds >like a sort of a simple hack to get the effect we want (either >spreading things out or pushing them together) -- is that correct?yes, spread first vs. fill first.> >My general feeling is that hacks are good short-term solutions, but >not long-term. Things always get more complicated, and often have >unexpected side-effects. I think since we're doing scheduler work, >it's worth it to try to see if we can actually solve the >power/performance problem.Agree. Have you look at Linux side domain scheduler idea? Not sure whether that topology based multi-level scheduler could help or over- complicate here.> >> imo, weight is not strictly translated into the care for latency. any >> elaboration on that? I remembered that previously Nishiguchi-san >> gave idea to boost credit, and Disheng proposed static priority. >> Maybe you can make a summary to help people how latency would >> be exactly ensured in your proposal > >All of this needs to be run through experiments. So far, I've had >really good success with putting waking VMs in "boost" priority for >1ms if they still have credits. (And unlike the credit scheduler, I >try to make sure that a VM rarely runs out of credits.)btw, accurate accounting (at context switch instead of current tick- based) should be also incorporated, if you do want to manipulate credits in fine-grain.> >> there should be some way to adjust or limit usage of >'reservation' when >> multiple vcpus both claim a desire which however sum up to some >> exceeding cpu's computing power or weaken your general >> 'weight-as-basic-unit' idea? > >All "reservations" on the system must add up to less than the total >processing power of the system. So a system with 2 cores can't have a >sum of reservations more than 200%. Xen will check this when setting >the reservation and return an appropriate error message if necessary.return error, or scale previous successful reservations down?> >>>* We will also have an interface to the cpu-vs-electrical power. >>> >>>This is yet to be defined. At the hypervisor level, it will probably >>>be a number representing the "badness" of powering up extra cpus / >>>cores. At the tools level, there will probably be the option of >>>either specifying the number, or of using one of 2/3 pre-defined >>>values {power, balance, green/battery}. >> >> Not sure how that number will be defined. Maybe we can follow >> current way to just add individual name-based options matching >> its purpose (such as migration_cost and sched_smt_power_savings...) > >At the scheduler level, I was thinking along the lines of >"core_power_up_cost". This would be comparible to the cost of having >things waiting on the runqueue. So (for example) if the cost was 0.1,who decides what the cost could be? how is it easily useful to an end customer?>then when the load on the current processors reached 1.1, then it >would power up another core. You could set it to 0.5 or 1.0 to savewhat do you mean by 'power up'? boost its frequency or migrate load to that core?>more power (at the cost of some performance). I think defining it >that way is the closest to what you really want: a way to define the >performance impact vs power consumption.I'm still a bit confused here. What (at which situation) is translated into a comparable value to the "core_power_up_cost"?> >Obviously at the user interface level, we might have something more >manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or >something like that.Then how is this triple mapped to above "core_power_up_cost"?> >But as I said, the *goal* is to have a useful configurable interface; >the implementation will depend on what actually can be made to work in >practice. >I agree with this goal, but not convinced by above example. :-) Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-16 05:11 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: George Dunlap >Sent: 2009年4月15日 23:56 > >2009/4/11 Tian, Kevin <kevin.tian@intel.com>: >> The major worry to me is added complexity by exposing such sibling >> pairs to guest. You then have to schedule at core level for that VM, >> since the implication of HT should be always maintained or else >> reverse effect could be seen when VM does try to utilize >that topology. >> This brings trouble to scheduler. Not all VMs are guest SMP, and >> then the VM being exposed with HT is actually treated unfair as one >> more limitation is imposed that partial idle core can't be used by it >> while other VMs is immune. Another tricky part is that you have to >> gang schedule that VM, which is in concept fancy but no one has >> come up a solid implementaion in real. > >I think gang scheduling with this limited scope (a hyper-pair to be >scheduled on a hyper-pair) should be a lot easier than the general >case. In any case, as long as we assign and deduct credits >appropriately, a threaded VM shouldn't have a disadvantage compared to >a single-thread VM. >Could you elaborate more about what's being simplified compared to generic gang scheduling? I used to be scared by complexity to have multiple vcpus sync in and sync out, especially with other heterogeneous VMs (w/o gang scheduling requirement). It's possibly simpler if all VMs in that system are hyper-pair based... :-) Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-16 10:27 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
2009/4/16 Tian, Kevin <kevin.tian@intel.com>:> Could you elaborate more about what''s being simplified compared to > generic gang scheduling? I used to be scared by complexity to have > multiple vcpus sync in and sync out, especially with other > heterogeneous VMs (w/o gang scheduling requirement). It''s possibly > simpler if all VMs in that system are hyper-pair based... :-)(I''ve only done some thought experiments re gang scheduling, so take that into account as you read this description.) The general gang scheduling problem is that you have P processors, and N vms, each of which have up to M processors. Each vm may have a different number of processors N.vcpu < P, and at any one time there may be a different number of processors runnable N.runnable < N.vcpu < P; this may change at any time. The general problem is to maximize the number of used processors P (and thus maximize the throughput). If you have a VM with 4 vcpus, but it''s only using 2, you have a choice: do I run it on 2 cores, and let another VM use the other 2 cores? Or do I "reserve" 4 cores for it, so that if the other 2 vcpus wake up they can run immediately? If you do the former, then if one of the other vcpus wakes up you have to quickly preempt someone else; if not, you risk leaving the two cores idle for the entire timeslice, effectively throwing away the processing power. The whole problem is likely to be NP-complete, and really obnoxious to have good heuristics for. In the case of HT gang-scheduling, the problem is significantly constrained: * The pairs of processors to look at is constrianed: each logical cpu has a pre-defined sibling. * The quantity we''re trying to gang-schedule is significantly constrained: only 2; and each gang-scheduled vcpu has its pre-defined HT sibling as well. * If only one sibling of a HT pair is active, the other one isn''t wasted; the active thread will get more processing power. So we don''t have that risky choice. So there are really only a handful of new cases to consider: * A HT vcpu pair wakes up or comes to the front of the queue when another HT vcpu pair is running. + Simple: order normally. If this pair is a higher priority (whatever that means) than the running pair, preempt the running pair (i.e., preempt the vcpus on both logical cpus). * A non-HT vcpu becomes runnable (or comes to the front of the runqueue) when a HT vcpu pair is on a pair of threads + If the non-HT vcpu priority is lower, wait until the HT vcpu pair is finished. + If the non-HT vcpu priority is higher, we need to decide whether to wait longer or whether to preempt both threads. This may depends on whether there are other non-HT vcpus waiting to run, and what their priority is. * A HT vcpu pair becomes runnable when non-HT vcpus are running on the threads + Deciding whether to wait or preempt both threads will depend on the relative weights of both. These decisions are a lot easier to deal with than the full gang-scheduling problem (AFAICT). One can imagine, for instance, that each HT pair could share one runqueue. A vcpu HT pair would be put on the runqueue as an individual entity. When it reached the top of the queue, both threads would preempt what was on them and run the pair of threads (or idle if the vcpu was idle). Otherwise, each thread would take a single non-pair vcpu and execute it. At any rate, that''s a step further down the road. First we need to address basic credits, latency, and load-balancing. (Implementation e-mails to come in a bit.) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-16 14:10 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From a resource utilization perspective, hyper-pairing maymake sense. But what about the user perspective? How would an administrator specify hyper-pairing? And more importantly why? When consolidating workloads from, say, a group of dual-core or dual-processor servers onto some future larger hyperthreaded server, why would anyone say "please assign this to a hyper-pair", which is essentially saying "give me less peak performance than I had before"? Also, in the analysis below, the problem is greatly simplified because today''s (x86) processors are limited to two hyperthreads. How soon will we see more threads per core, given that other non-x86 CPUs already support four or more?> -----Original Message----- > From: George Dunlap [mailto:George.Dunlap@eu.citrix.com] > Sent: Thursday, April 16, 2009 4:28 AM > To: Tian, Kevin > Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: > High-level goals > and interface. > > > 2009/4/16 Tian, Kevin <kevin.tian@intel.com>: > > Could you elaborate more about what''s being simplified compared to > > generic gang scheduling? I used to be scared by complexity to have > > multiple vcpus sync in and sync out, especially with other > > heterogeneous VMs (w/o gang scheduling requirement). It''s possibly > > simpler if all VMs in that system are hyper-pair based... :-) > > (I''ve only done some thought experiments re gang scheduling, so take > that into account as you read this description.) > > The general gang scheduling problem is that you have P processors, and > N vms, each of which have up to M processors. Each vm may have a > different number of processors N.vcpu < P, and at any one time there > may be a different number of processors runnable N.runnable < N.vcpu < > P; this may change at any time. The general problem is to maximize > the number of used processors P (and thus maximize the throughput). > If you have a VM with 4 vcpus, but it''s only using 2, you have a > choice: do I run it on 2 cores, and let another VM use the other 2 > cores? Or do I "reserve" 4 cores for it, so that if the other 2 vcpus > wake up they can run immediately? If you do the former, then if one > of the other vcpus wakes up you have to quickly preempt someone else; > if not, you risk leaving the two cores idle for the entire timeslice, > effectively throwing away the processing power. The whole problem is > likely to be NP-complete, and really obnoxious to have good heuristics > for. > > In the case of HT gang-scheduling, the problem is > significantly constrained: > * The pairs of processors to look at is constrianed: each logical cpu > has a pre-defined sibling. > * The quantity we''re trying to gang-schedule is significantly > constrained: only 2; and each gang-scheduled vcpu has its pre-defined > HT sibling as well. > * If only one sibling of a HT pair is active, the other one isn''t > wasted; the active thread will get more processing power. So we don''t > have that risky choice. > > So there are really only a handful of new cases to consider: > * A HT vcpu pair wakes up or comes to the front of the queue when > another HT vcpu pair is running. > + Simple: order normally. If this pair is a higher priority > (whatever that means) than the running pair, preempt the running pair > (i.e., preempt the vcpus on both logical cpus). > * A non-HT vcpu becomes runnable (or comes to the front of the > runqueue) when a HT vcpu pair is on a pair of threads > + If the non-HT vcpu priority is lower, wait until the HT vcpu pair > is finished. > + If the non-HT vcpu priority is higher, we need to decide whether to > wait longer or whether to preempt both threads. This may depends on > whether there are other non-HT vcpus waiting to run, and what their > priority is. > * A HT vcpu pair becomes runnable when non-HT vcpus are > running on the threads > + Deciding whether to wait or preempt both threads will depend on the > relative weights of both. > > These decisions are a lot easier to deal with than the full > gang-scheduling problem (AFAICT). One can imagine, for instance, that > each HT pair could share one runqueue. A vcpu HT pair would be put on > the runqueue as an individual entity. When it reached the top of the > queue, both threads would preempt what was on them and run the pair of > threads (or idle if the vcpu was idle). Otherwise, each thread would > take a single non-pair vcpu and execute it. > > At any rate, that''s a step further down the road. First we need to > address basic credits, latency, and load-balancing. (Implementation > e-mails to come in a bit.) > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-16 16:32 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Dan Magenheimer wrote:> From a resource utilization perspective, hyper-pairing may > make sense. But what about the user perspective? How would > an administrator specify hyper-pairing? And more importantly > why? When consolidating workloads from, say, a group > of dual-core or dual-processor servers onto some future > larger hyperthreaded server, why would anyone say > "please assign this to a hyper-pair", which is essentially > saying "give me less peak performance than I had before"? >I don''t see how it makes a difference. At the moment, you''re never sure if a pair of vcpus are HT thread pairs, two cores on the same socket, or on completely different sockets - all of which will have quite different performance characteristics. And unless your server is under-committed, you''re always running the risk that one domain is competing with another for CPU when it needs it most - and if you''re under-committed, you can always pin everything in exactly the config you want. Besides, the chances are good that the single-threaded performance of each core on your shiny new server will be fast enough to overcome the cost of HT compared to your old server...> Also, in the analysis below, the problem is greatly > simplified because today''s (x86) processors are limited > to two hyperthreads. How soon will we see more threads > per core, given that other non-x86 CPUs already support > four or more? >I think the simplifying factor is that the number of threads/cores you''re ganging together is a relatively small proportion of the total number of available threads/cores, so the problem is under-constrained and there are lots of nearly-optimal solutions. If you''re trying to gang schedule a large proportion of your total resources, then you get into tricky boxpacking territory. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Lyon
2009-Apr-16 18:20 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
On Thu, Apr 16, 2009 at 5:32 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> Dan Magenheimer wrote: >> >> From a resource utilization perspective, hyper-pairing may >> make sense. But what about the user perspective? How would >> an administrator specify hyper-pairing? And more importantly >> why? When consolidating workloads from, say, a group >> of dual-core or dual-processor servers onto some future >> larger hyperthreaded server, why would anyone say >> "please assign this to a hyper-pair", which is essentially >> saying "give me less peak performance than I had before"? >> > > I don''t see how it makes a difference. At the moment, you''re never sure if > a pair of vcpus are HT thread pairs, two cores on the same socket, or on > completely different sockets - all of which will have quite different > performance characteristics. And unless your server is under-committed, > you''re always running the risk that one domain is competing with another for > CPU when it needs it most - and if you''re under-committed, you can always > pin everything in exactly the config you want. > > Besides, the chances are good that the single-threaded performance of each > core on your shiny new server will be fast enough to overcome the cost of HT > compared to your old server...Is HT particularly worthwhile for virtualization loads? we have several older servers which have ht and I found that when running windows terminal services it actually slowed the machine down, and under certain circumstances it seemed to cause the system to become extremely slow and had to be rebooted, we disabled ht and the problem went away. Many documents on the web recommend disabling ht for specific workloads, and most benchmarks show that when it is benificial the performance gain is quite small. Or is your plan to make use of ht in a way that gets the most benefit with no impact under edge cases? Andy> >> Also, in the analysis below, the problem is greatly >> simplified because today''s (x86) processors are limited >> to two hyperthreads. How soon will we see more threads >> per core, given that other non-x86 CPUs already support >> four or more? >> > > I think the simplifying factor is that the number of threads/cores you''re > ganging together is a relatively small proportion of the total number of > available threads/cores, so the problem is under-constrained and there are > lots of nearly-optimal solutions. If you''re trying to gang schedule a large > proportion of your total resources, then you get into tricky boxpacking > territory. > > J > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-16 18:28 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Andrew Lyon wrote:> Is HT particularly worthwhile for virtualization loads? we have > several older servers which have ht and I found that when running > windows terminal services it actually slowed the machine down, and > under certain circumstances it seemed to cause the system to become > extremely slow and had to be rebooted, we disabled ht and the problem > went away. > > Many documents on the web recommend disabling ht for specific > workloads, and most benchmarks show that when it is benificial the > performance gain is quite small. > > Or is your plan to make use of ht in a way that gets the most benefit > with no impact under edge cases? >"New" HT, as reintroduced in i7, is supposed to be "much better". J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-17 10:02 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: George Dunlap >Sent: 2009年4月16日 18:28 > >2009/4/16 Tian, Kevin <kevin.tian@intel.com>: >> Could you elaborate more about what's being simplified compared to >> generic gang scheduling? I used to be scared by complexity to have >> multiple vcpus sync in and sync out, especially with other >> heterogeneous VMs (w/o gang scheduling requirement). It's possibly >> simpler if all VMs in that system are hyper-pair based... :-) > >(I've only done some thought experiments re gang scheduling, so take >that into account as you read this description.)Thanks for writing this down.> >The general gang scheduling problem is that you have P processors, and >N vms, each of which have up to M processors. Each vm may have a >different number of processors N.vcpu < P, and at any one time there >may be a different number of processors runnable N.runnable < N.vcpu < >P; this may change at any time. The general problem is to maximize >the number of used processors P (and thus maximize the throughput). >If you have a VM with 4 vcpus, but it's only using 2, you have a >choice: do I run it on 2 cores, and let another VM use the other 2 >cores? Or do I "reserve" 4 cores for it, so that if the other 2 vcpus >wake up they can run immediately? If you do the former, then if one >of the other vcpus wakes up you have to quickly preempt someone else; >if not, you risk leaving the two cores idle for the entire timeslice, >effectively throwing away the processing power. The whole problem is >likely to be NP-complete, and really obnoxious to have good heuristics >for.This is related to the definition of gang scheduling. In my memory gang scheduling comes from parallel computing requirement where massive inter-threads communications/synchronizations exists and one thread in blocking also impacts the rest. That normally requires a 'gang' concept that once one thread is to be scheduled, all other threads in same group are schdeduled in all together; on the other hand context switches are minimized to keep threads always in ready state. This idea is very application specific and thus normally not find way into generic market. Based on your whole write-up, you don't consider how to ensure vcpus of one VM scheduled in sync at given time point. Instead, your model is to ensure exclusive usage on a thread-pair within a given quantum window. It's nothing related to 'gang scheduling' but it's OK for us to call it 'simplified gang scheduling' in this context since the keypoint in this discussion is whether we can do some optimization on HT, not strictly limited to gang scheduling itself. :-)> >In the case of HT gang-scheduling, the problem is >significantly constrained: >* The pairs of processors to look at is constrianed: each logical cpu >has a pre-defined sibling. >* The quantity we're trying to gang-schedule is significantly >constrained: only 2; and each gang-scheduled vcpu has its pre-defined >HT sibling as well.Then I assume only VMs with <=2 vcpus are considered here>* If only one sibling of a HT pair is active, the other one isn't >wasted; the active thread will get more processing power. So we don't >have that risky choice.May or may not. Be in mind that why HT is introduced is because one thread normally includes lots of stall cycles and thus stalled cycles can be utilized to drive another thread. whether more processing power is gained is very workload specific.> >So there are really only a handful of new cases to consider: >* A HT vcpu pair wakes up or comes to the front of the queue when >another HT vcpu pair is running. > + Simple: order normally. If this pair is a higher priority >(whatever that means) than the running pair, preempt the running pair >(i.e., preempt the vcpus on both logical cpus)."vcpu pair"? your below descriptions are all based on vcpu pair. How do you define the status (blocked, runnable, running) of a pair? If it's AND of individual status of two vcpus, it's too restrictive to have a runnable status. If it's OR operation, you may then need consider how to define a sane priorioty (is one-vcpu-ready-the-other-block with higher credits considered higher priority than two-vcpu-ready with lower credits? etc.)>* A non-HT vcpu becomes runnable (or comes to the front of the >runqueue) when a HT vcpu pair is on a pair of threads > + If the non-HT vcpu priority is lower, wait until the HT vcpu pair >is finished. > + If the non-HT vcpu priority is higher, we need to decide whether to >wait longer or whether to preempt both threads. This may depends on >whether there are other non-HT vcpus waiting to run, and what their >priority is. >* A HT vcpu pair becomes runnable when non-HT vcpus are >running on the threads > + Deciding whether to wait or preempt both threads will depend on the >relative weights of both.Above all looks adding much complexity to a common single-vcpu-based scheduler.> >These decisions are a lot easier to deal with than the full >gang-scheduling problem (AFAICT). One can imagine, for instance, that >each HT pair could share one runqueue. A vcpu HT pair would be put on >the runqueue as an individual entity. When it reached the top of the >queue, both threads would preempt what was on them and run the pair of >threads (or idle if the vcpu was idle). Otherwise, each thread would >take a single non-pair vcpu and execute it.That may bring undesired effect. The more cross-cpu talks you add to scheduler, less efficient could scheduler be due to locks on both runqueues.> >At any rate, that's a step further down the road. First we need to >address basic credits, latency, and load-balancing. (Implementation >e-mails to come in a bit.) >Yup. First implementation should take commonness, simpleness and overall efficiency as major goals. Last, any effort toward better HT efficiency is welcomed to me. However we need be careful enough to avoid introducing unnecessary complexity into scheduler. I'm educated that the more complex (or flexible options) the scheduler is, more unexpected side-effects may occur in other areas. By far I'm not convinced that above idea could lead to a more fair system either to HT vcpu pairs or to non-HT vcpus. I agree with Ian's idea that topology is exposed in partition case where scheduler doesn't require change, and for general case we just need HT accounting to ensure fairness which is simple enhancement. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-17 10:17 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:> >From a resource utilization perspective, hyper-pairing may > make sense. But what about the user perspective? How would > an administrator specify hyper-pairing? And more importantly > why? When consolidating workloads from, say, a group > of dual-core or dual-processor servers onto some future > larger hyperthreaded server, why would anyone say > "please assign this to a hyper-pair", which is essentially > saying "give me less peak performance than I had before"?I think what you''re saying is that when we only expose vcpus to the guest, we can either run 2 vcpus on HT pairs, or give them an entire core to themselves; but if we expose them as HT pairs and gang schedule, then we''re promising only to run them on HT pairs, limiting the peak performance. Hmm, I''m not sure that''s actually true. We could, if we had a particularly idle system, split HT pairs and let them run as independent vcpus. I''m pretty sure the resulting throughput would be usually higher. In any case, it''s a bit like asking, "Why would I buy a machine with two hyperthreads instead of two cores?" Yes, going from 2 vcpus to 2 vhts (virtual hyperthreads) is a step down in computing power; so would going from a dual-core processor w/o HT to a single-core processor with HT. If you want to monotonically increase power, give it 4 vhts. At any rate, I think we can bring these up again when we actually start to implement this feature. First things first. :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-17 14:13 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
> In any case, it''s a bit like asking, "Why would I buy > a machine with two hyperthreads instead of two cores?"Yes. In a physical machine, the OS takes advantage of all resources available. So it doesn''t matter if some of the "processors" are cores and some are hyperthreads. You are using ALL of the CPU resources you paid for. But in a virtualized environment, each VM gets a fraction of the resources and if grabbing some fixed number of "processors" sometimes gets hyperthreads and sometimes gets cores, this will cause interesting issues for some workloads. Think about a cloud where one pays for resources used. You likely would demand to pay less for a hyperpair than a non-vht pair. As a result, I think it will be a requirement that a system administrator be able to specify "I want two FULL cores" vs "I am willing to accept two hyperthreads". And once you get beyond hyperpairs, this is going to get very messy.> At any rate, I think we can bring these up again when > we actually start to implement this feature. First > things first. :-)Well yes and no. While I am a big fan of iterative prototyping, I wonder if this might be a case where the architecture should drive the design and the design should drive the implementation. IOW, first think through what choices a system admin should be able to make (and, if given a mind-boggling array of choices, which ones he/she WOULD make). Just my two cents...> -----Original Message----- > From: George Dunlap [mailto:George.Dunlap@eu.citrix.com] > Sent: Friday, April 17, 2009 4:17 AM > To: Dan Magenheimer > Cc: Tian, Kevin; Jeremy Fitzhardinge; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: > High-level goals > and interface. > > > On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > > >From a resource utilization perspective, hyper-pairing may > > make sense. But what about the user perspective? How would > > an administrator specify hyper-pairing? And more importantly > > why? When consolidating workloads from, say, a group > > of dual-core or dual-processor servers onto some future > > larger hyperthreaded server, why would anyone say > > "please assign this to a hyper-pair", which is essentially > > saying "give me less peak performance than I had before"? > > I think what you''re saying is that when we only expose vcpus to the > guest, we can either run 2 vcpus on HT pairs, or give them an entire > core to themselves; but if we expose them as HT pairs and gang > schedule, then we''re promising only to run them on HT pairs, limiting > the peak performance. > > Hmm, I''m not sure that''s actually true. We could, if we had a > particularly idle system, split HT pairs and let them run as > independent vcpus. I''m pretty sure the resulting throughput would be > usually higher. > > In any case, it''s a bit like asking, "Why would I buy a machine with > two hyperthreads instead of two cores?" Yes, going from 2 vcpus to 2 > vhts (virtual hyperthreads) is a step down in computing power; so > would going from a dual-core processor w/o HT to a single-core > processor with HT. If you want to monotonically increase power, give > it 4 vhts. > > At any rate, I think we can bring these up again when we actually > start to implement this feature. First things first. :-) > > -George >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-17 14:55 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Dan Magenheimer wrote:>> In any case, it''s a bit like asking, "Why would I buy >> a machine with two hyperthreads instead of two cores?" >> > > Yes. In a physical machine, the OS takes advantage of all > resources available. So it doesn''t matter if some of the > "processors" are cores and some are hyperthreads. You > are using ALL of the CPU resources you paid for. > > But in a virtualized environment, each VM gets a fraction > of the resources and if grabbing some fixed number of > "processors" sometimes gets hyperthreads and sometimes > gets cores, this will cause interesting issues for some > workloads. > > Think about a cloud where one pays for resources used. > You likely would demand to pay less for a hyperpair than > a non-vht pair. > > As a result, I think it will be a requirement that > a system administrator be able to specify "I want two > FULL cores" vs "I am willing to accept two hyperthreads". > And once you get beyond hyperpairs, this is going to > get very messy. >I think you''re over-complicating it. At very worst, it will be no worse than the current situation where Xen will place the vcpus on threads/cores in more or less arbitrary ways. I think George''s proposal can already accommodate the user needs you''re talking about: If the scheduler accounts for time spent executing on a contended HT thread (ie, the threads are not paired, so the other thread could be idle or running any other code) at a lesser rate than a full core/uncontended thread, then the charging works out. If the user has a requirement that domain X''s vcpus must be running at full speed, then they can set their reservation to 100%. If we say that a contended HT thread is only worth, say, 70% of a "real" core, then that not only factors into the charging, but also means that any domain with a reservation > 70% is ineligible to run on a contended HT thread. (I think in practise this means that any domain with high reservations will end up running on gang scheduled thread pairs, just to guarantee that the other thread is idle, so the uncontended HT thread can run at 100%.) (Another way to look at it is that HT contention is a bit like having your vcpu being preempted by Xen, but rather than going from 100% running to 0% running, your vcpu drops to 70%.) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-17 15:55 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
> I think you''re over-complicating it.Perhaps. Or maybe you are oversimplifying it? ;-)> At very worst, it will > be no worse > than the current situation where Xen will place the vcpus on > threads/cores in more or less arbitrary ways.Agreed. Treating threads as cores is bad. Since that''s what''s happening today, one would think that any fix is better than nothing.> a contended HT thread is only worth, say, 70% of a > "real" core > : > (Another way to look at it is that HT contention is a bit like having > your vcpu being preempted by Xen, but rather than going from 100% > running to 0% running, your vcpu drops to 70%.)And that''s the oversimplification I think. Just because Intel provides a rule-of-thumb that the extra thread increases performance by 30% doesn''t mean that it is a good number to choose for scheduling purposes. I suspect (and maybe this has even already been proven) that this varies from 0%-100% depending on the workload, and may even vary from *negative* to *more* than 100%. (Yes, I understand that i7 is supposed to be better than the last round of HT... but is it always better?) Dan> -----Original Message----- > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Friday, April 17, 2009 8:56 AM > To: Dan Magenheimer > Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: > High-level goals > and interface. > > > Dan Magenheimer wrote: > >> In any case, it''s a bit like asking, "Why would I buy > >> a machine with two hyperthreads instead of two cores?" > >> > > > > Yes. In a physical machine, the OS takes advantage of all > > resources available. So it doesn''t matter if some of the > > "processors" are cores and some are hyperthreads. You > > are using ALL of the CPU resources you paid for. > > > > But in a virtualized environment, each VM gets a fraction > > of the resources and if grabbing some fixed number of > > "processors" sometimes gets hyperthreads and sometimes > > gets cores, this will cause interesting issues for some > > workloads. > > > > Think about a cloud where one pays for resources used. > > You likely would demand to pay less for a hyperpair than > > a non-vht pair. > > > > As a result, I think it will be a requirement that > > a system administrator be able to specify "I want two > > FULL cores" vs "I am willing to accept two hyperthreads". > > And once you get beyond hyperpairs, this is going to > > get very messy. > > > > I think you''re over-complicating it. At very worst, it will > be no worse > than the current situation where Xen will place the vcpus on > threads/cores in more or less arbitrary ways. > > I think George''s proposal can already accommodate the user > needs you''re > talking about: > > If the scheduler accounts for time spent executing on a contended HT > thread (ie, the threads are not paired, so the other thread could be > idle or running any other code) at a lesser rate than a full > core/uncontended thread, then the charging works out. > > If the user has a requirement that domain X''s vcpus must be > running at > full speed, then they can set their reservation to 100%. If > we say that > a contended HT thread is only worth, say, 70% of a "real" core, then > that not only factors into the charging, but also means that > any domain > with a reservation > 70% is ineligible to run on a contended > HT thread. > (I think in practise this means that any domain with high > reservations > will end up running on gang scheduled thread pairs, just to guarantee > that the other thread is idle, so the uncontended HT thread > can run at > 100%.) > > (Another way to look at it is that HT contention is a bit like having > your vcpu being preempted by Xen, but rather than going from 100% > running to 0% running, your vcpu drops to 70%.) > > J > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Apr-17 16:17 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Dan Magenheimer wrote:> And that''s the oversimplification I think. Just > because Intel provides a rule-of-thumb that the extra > thread increases performance by 30% doesn''t mean that > it is a good number to choose for scheduling purposes. >Actually the 70% was a number I plucked out of the air with no justification at all.> I suspect (and maybe this has even already been proven) > that this varies from 0%-100% depending on the workload, > and may even vary from *negative* to *more* than 100%. > (Yes, I understand that i7 is supposed to be better than > the last round of HT... but is it always better?) >The only way to know is by measurement, ideally with some specific performance counter which tells you what went on in that last timeslice. But if this is a big issue, you can always disable HT, as lots of people did the last time around. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-17 16:46 UTC
RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
> But if this is a big issue, you can always disable HT, as > lots of people did the last time around.That would be a shame because HT will almost certainly provide SOME performance benefit MOST of the time. After pondering a bit, I guess I am arguing that once processors have HT, Turboboost, and power management, scheduling as a discipline has to move from the realm of discrete to the realm of continuous. A "second of CPU" no longer has any real meaning when the value of "a CPU" varies across time and workload. (I suppose due to shared cache effects and bus contention, this has probably always been the case, but to a less obvious degree.)> The only way to know is by measurement, ideally with > some specific performance counter which tells you > what went on in that last timeslice.Indeed. Even if it is impossible to predict the throughput of a specific workload on a specific CPU, it sure would be nice if we could at least roughly measure the past. Processor architects take note! ;-)> -----Original Message----- > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Friday, April 17, 2009 10:17 AM > To: Dan Magenheimer > Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: > High-level goals > and interface. > > > Dan Magenheimer wrote: > > And that''s the oversimplification I think. Just > > because Intel provides a rule-of-thumb that the extra > > thread increases performance by 30% doesn''t mean that > > it is a good number to choose for scheduling purposes. > > > > Actually the 70% was a number I plucked out of the air with no > justification at all. > > > I suspect (and maybe this has even already been proven) > > that this varies from 0%-100% depending on the workload, > > and may even vary from *negative* to *more* than 100%. > > (Yes, I understand that i7 is supposed to be better than > > the last round of HT... but is it always better?) > > > > The only way to know is by measurement, ideally with some specific > performance counter which tells you what went on in that last > timeslice. But if this is a big issue, you can always disable HT, as > lots of people did the last time around. > > J >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap
2009-Apr-17 17:05 UTC
Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
Jeremy Fitzhardinge wrote:> The only way to know is by measurement, ideally with some specific > performance counter which tells you what went on in that last > timeslice. But if this is a big issue, you can always disable HT, as > lots of people did the last time around. >I think measurement, both of total system throughput and individual VM throughput, is the final word on all designs. I certainly plan on testing and comparing throughput for a variety of workloads as I develop the scheduler. And I encourage anyone with the time and inclination to try to find workloads for which the scheduler performs poorly as new features (such as the proposed HT scheduling) are introduced. :-) Peace, -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel