thr3ads.net - Xen devel - [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface. [Apr 2009]

If this information is useful, please help other people find it:
Share via:

George Dunlap

2009-Apr-09 15:58 UTC

[Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I''m posting a very early
design prototype of the credit2 scheduler.  We''ve had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.

This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.

The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.

Please feel free to comment / discuss / suggest improvements.

1. Design targets

We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).

For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).

For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.

For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They''ll also be running video and
audio workloads, which are extremely latency-sensitive.

2. Design goals

For each of the target systems and workloads above, we have some
high-level goals for the scheduler:

* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.

We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.

* Good scheduling for latency-sensitive workloads.

To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn''t
break up if I start a cpu hog process in the VM playing the audio.

* HT-aware.

Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".

* Power-aware.

Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.

3. Target interface:

The target interface will be similar to credit1:

* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).

* Additionally, we will be introducing a "reservation" or
"floor".
  (I''m open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.

For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.

* The "cap" functionality of credit1 will be retained.

This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.

* We will also have an interface to the cpu-vs-electrical power.

This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-09 18:41 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

George Dunlap wrote:> 1. Design targets
>
> We have three general use cases in mind: Server consolidation, virtual
> desktop providers, and clients (e.g. XenClient).
>
> For servers, our target "sweet spot" for which we will optimize
is a
> system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
> Ideal performance is expected to be reached at about 80% total system
> cpu utilization; but the system should function reasonably well up to
> a utilization of 800% (e.g., a load of 8).
>   
Is that forward-looking enough?  That hardware is currently available; 
what''s going to be commonplace in 2-3 years?
> For virtual desktop systems, we will have a large number of
> interactive VMs with a lot of shared memory.  Most of these will be
> single-vcpu, or at most 2 vcpus.
>
> For client systems, we expect to have 3-4 VMs (including dom0).
> Systems will probably ahve a single socket with 2 cores and SMT (4
> logical cpus).  Many VMs will be using PCI pass-through to access
> network, video, and audio cards.  They''ll also be running video
and
> audio workloads, which are extremely latency-sensitive.
>
> 2. Design goals
>
> For each of the target systems and workloads above, we have some
> high-level goals for the scheduler:
>
> * Fairness.  In this context, we define "fairness" as the ability
to
> get cpu time proportional to weight.
>
> We want to try to make this true even for latency-sensitive workloads
> such as networking, where long scheduling latency can reduce the
> throughput, and thus the total amount of time the VM can effectively
> use.
>
> * Good scheduling for latency-sensitive workloads.
>
> To the degree we are able, we want this to be true even those which
> use a significant amount of cpu power: That is, my audio shouldn''t
> break up if I start a cpu hog process in the VM playing the audio.
>
> * HT-aware.
>
> Running on a logical processor with an idle peer thread is not the
> same as running on a logical processor with a busy peer thread.  The
> scheduler needs to take this into account when deciding
"fairness".
>   
Would it be worth just pair-scheduling HT threads so they''re always 
running in the same domain?
> * Power-aware.
>
> Using as many sockets / cores as possible can increase the total cache
> size avalable to VMs, and thus (in the absence of inter-VM sharing)
> increase total computing power; but by keeping multiple sockets and
> cores powered up, also increases the electrical power used by the
> system.  We want a configurable way to balance between maximizing
> processing power vs minimizing electrical power.
>   
I don''t remember if there''s a proper term for this, but what
about
having multiple domains sharing the same scheduling context, so that a 
stub domain can be co-scheduled with its main domain, rather than having 
them treated separately?

Also, a somewhat related point, some kind of directed schedule so that 
when one vcpu is synchronously waiting on anohter vcpu, have it directly 
hand over its pcpu to avoid any cross-cpu overhead (including the 
ability to take advantage of directly using hot cache lines).  That 
would be useful for intra-domain IPIs, etc, but also inter-domain 
context switches (domain<->stub, frontend<->backend, etc).
> 3. Target interface:
>
> The target interface will be similar to credit1:
>
> * The basic unit is the VM "weight".  When competing for cpu
> resources, VMs will get a share of the resources proportional to their
> weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
> get 33% and 67% of the cpu, respectively).
>
> * Additionally, we will be introducing a "reservation" or
"floor".
>   (I''m open to name changes on this one.)  This will be a minimum
>   amount of cpu time that a VM can get if it wants it.
>
> For example, one could give dom0 a "reservation" of 50%, but
leave the
> weight at 256.  No matter how many other VMs run with a weight of 256,
> dom0 will be guaranteed to get 50% of one cpu if it wants it.
>   
How does the reservation interact with the credits?  Is the reservtion 
in addition to its credits, or does using the reservation consume them?
> * The "cap" functionality of credit1 will be retained.
>
> This is a maximum amount of cpu time that a VM can get: i.e., a VM
> with a cap of 50% will only get half of one cpu, even if the rest of
> the system is completely idle.
>
> * We will also have an interface to the cpu-vs-electrical power.
>
> This is yet to be defined.  At the hypervisor level, it will probably
> be a number representing the "badness" of powering up extra cpus
/
> cores.  At the tools level, there will probably be the option of
> either specifying the number, or of using one of 2/3 pre-defined
> values {power, balance, green/battery}.
>   
Is it worth taking into account the power cost of cache misses vs hits?

Do vcpus running on pcpus running at less than 100% speed consume fewer 
credits?

Is there any explicit interface to cpu power state management, or would 
that be decoupled?

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-10 00:15 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: George Dunlap
>Sent: 2009年4月9日 23:59
>
>For servers, our target "sweet spot" for which we will optimize is
a
>system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
>Ideal performance is expected to be reached at about 80% total system
>cpu utilization; but the system should function reasonably well up to
>a utilization of 800% (e.g., a load of 8).
How is 80%/800% chosen here?
>
>For virtual desktop systems, we will have a large number of
>interactive VMs with a lot of shared memory.  Most of these will be
>single-vcpu, or at most 2 vcpus
How about VM number in total you'd like to support?
.>
>* HT-aware.
>
>Running on a logical processor with an idle peer thread is not the
>same as running on a logical processor with a busy peer thread.  The
>scheduler needs to take this into account when deciding
"fairness".
Do you mean that same elapsed time in above two scenarios will be
translated into different credits?
>
>* Power-aware.
>
>Using as many sockets / cores as possible can increase the total cache
>size avalable to VMs, and thus (in the absence of inter-VM sharing)
>increase total computing power; but by keeping multiple sockets and
>cores powered up, also increases the electrical power used by the
>system.  We want a configurable way to balance between maximizing
>processing power vs minimizing electrical power.
Xen3.4 now supports "sched_smt_power_savings" (both boot option 
and touchable by xenpm) to change power/performance preference.
It's simple implementation to simply reverse the span order from 
existing package->core->thread to thread->core->package. More 
fine-grained flexibility could be given in future if hierarchical scheduling 
concept could be more clearly constructed like domain scheduler
in Linux.

Another possible 'fairness' point affected by power management
could be to take freq scaling into consideration, since credit by far
is simply calculated by elapsed time while elapsed time with 
different frequency actually indicates different consumed cycles.
>
>3. Target interface:
>
>The target interface will be similar to credit1:
>
>* The basic unit is the VM "weight".  When competing for cpu
>resources, VMs will get a share of the resources proportional to their
>weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
>get 33% and 67% of the cpu, respectively).
imo, weight is not strictly translated into the care for latency. any
elaboration on that? I remembered that previously Nishiguchi-san
gave idea to boost credit, and Disheng proposed static priority. 
Maybe you can make a summary to help people how latency would
be exactly ensured in your proposal
>
>* Additionally, we will be introducing a "reservation" or
"floor".
>  (I'm open to name changes on this one.)  This will be a minimum
>  amount of cpu time that a VM can get if it wants it.
this is good idea.
>
>For example, one could give dom0 a "reservation" of 50%, but leave
the
>weight at 256.  No matter how many other VMs run with a weight of 256,
>dom0 will be guaranteed to get 50% of one cpu if it wants it.
there should be some way to adjust or limit usage of 'reservation' when 
multiple vcpus both claim a desire which however sum up to some 
exceeding cpu's computing power or weaken your general
'weight-as-basic-unit' idea?

>
>* The "cap" functionality of credit1 will be retained.
>
>This is a maximum amount of cpu time that a VM can get: i.e., a VM
>with a cap of 50% will only get half of one cpu, even if the rest of
>the system is completely idle.
>
>* We will also have an interface to the cpu-vs-electrical power.
>
>This is yet to be defined.  At the hypervisor level, it will probably
>be a number representing the "badness" of powering up extra cpus /
>cores.  At the tools level, there will probably be the option of
>either specifying the number, or of using one of 2/3 pre-defined
>values {power, balance, green/battery}.
Not sure how that number will be defined. Maybe we can follow
current way to just add individual name-based options matching
its purpose (such as migration_cost and sched_smt_power_savings...)

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-10 00:33 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: Jeremy Fitzhardinge
>Sent: 2009年4月10日 2:42
>
>George Dunlap wrote:
>> 1. Design targets
>>
>> We have three general use cases in mind: Server 
>consolidation, virtual
>> desktop providers, and clients (e.g. XenClient).
>>
>> For servers, our target "sweet spot" for which we will
optimize is a
>> system with 2 sockets, 4 cores each socket, and SMT (16 
>logical cpus).
>> Ideal performance is expected to be reached at about 80% total system
>> cpu utilization; but the system should function reasonably well up to
>> a utilization of 800% (e.g., a load of 8).
>>   
>
>Is that forward-looking enough?  That hardware is currently available; 
>what's going to be commonplace in 2-3 years?
good point.
>> * HT-aware.
>>
>> Running on a logical processor with an idle peer thread is not the
>> same as running on a logical processor with a busy peer thread.  The
>> scheduler needs to take this into account when deciding
"fairness".
>>   
>
>Would it be worth just pair-scheduling HT threads so they're always 
>running in the same domain?
running same domain doesn't help fairness and instead, it worsens.
>
>> * Power-aware.
>>
>> Using as many sockets / cores as possible can increase the 
>total cache
>> size avalable to VMs, and thus (in the absence of inter-VM sharing)
>> increase total computing power; but by keeping multiple sockets and
>> cores powered up, also increases the electrical power used by the
>> system.  We want a configurable way to balance between maximizing
>> processing power vs minimizing electrical power.
>>   
>
>I don't remember if there's a proper term for this, but what about 
>having multiple domains sharing the same scheduling context, so that a 
>stub domain can be co-scheduled with its main domain, rather 
>than having 
>them treated separately?
This is really desired.
>
>Also, a somewhat related point, some kind of directed schedule so that 
>when one vcpu is synchronously waiting on anohter vcpu, have 
>it directly 
>hand over its pcpu to avoid any cross-cpu overhead (including the 
>ability to take advantage of directly using hot cache lines).  That 
>would be useful for intra-domain IPIs, etc, but also inter-domain 
>context switches (domain<->stub, frontend<->backend, etc).
The hard part here is to find the hint on WHICH vcpu that given
cpu is waiting, which is not straightforward. Of course stub
domain is most possible example, but it may be already cleanly
addressed if above co-scheduling could be added? :-)

>> * We will also have an interface to the cpu-vs-electrical power.
>>
>> This is yet to be defined.  At the hypervisor level, it will probably
>> be a number representing the "badness" of powering up extra
cpus /
>> cores.  At the tools level, there will probably be the option of
>> either specifying the number, or of using one of 2/3 pre-defined
>> values {power, balance, green/battery}.
>>   
>
>Is it worth taking into account the power cost of cache misses vs hits?
>
>Do vcpus running on pcpus running at less than 100% speed 
>consume fewer 
>credits?
>
>Is there any explicit interface to cpu power state management, 
>or would 
>that be decoupled?
>
now cpu power management has sysctl interface exposed and xenpm
is the tool using that interface so far.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Zhiyuan Shao

2009-Apr-10 02:28 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Hi all,

Actually I think I/O responsiveness is important to control by the scheduling
algorithm. And this is especially true for vritualized desktop/client
environment, since in such environments, there are so many I/O events to handle,
which is different from the server consolidation case, where many of the tasks
are CPU-intensive.
I would like to show this point by a simple scheduling algorithm, which is
attached with this mail, i wrotten in the last winter (Jan. 2009). That time, I
am in Intel OTC for a visitation, and thank Intel guys (Disheng, Kevin Tian and
etc.) for their help.
The scheduler is named as SDP (you have to use "sched=sdp" parameter
in Xen kernel line when boot), which i mean to use ideas of dynamic priority to
make the virtualized clients meat their needs of usage. However, this scheduler
is basically a simple prototype for idea proving till now, I had not implement
the dynamic priority mechanisms in yet. The solution used in this simple
scheduler is largely ad-hoc, and i hope it can contribute something to the
future development of next generation Xen scheduler. BTW, I borrowed large
portion of code from Credit scheduler.

This patch should work in VT-d platform well (it does not doing well in the
emulated device environment, since device emulation, especially the video
results in too high overhead to handle by the scheduler). We (thank again Intel
OTC for the VT-d platform) tested the scheduler in a 3.0Ghz 2-core system,
invoked 2 HVM guest domains (one is primary domain, and another is auxiliary,
both have two VCPUs), pinned each vcpu of each domain to different PCPUs (VCPUs
of domain0 is pinned as well), since the i still had not implement a proper VCPU
migration mechanism in SDP (sorry for that, I do not think the aggresive
migration mechanism of Credit is proper for virtualized clients, and working on
this now, hope can find a proper one for Xen in near future). The sound and
video card are directly assigned to the primary domain, while the auxiliary
domain uses emulated ones.

Set the "priority" (should be named as "I/O responsiveness",
I think i had make a mistake on this, since the initial objectiveness is to use
dynamic priority ideas) value of domain0 to 91, and set that of primary domain
to 90. regarding the auxiliary domain, you can left it as default (80). Please
used the attached domain0 command line tool (i.e., sched_sdp_set [domain_id]
[new_priority]) to set the new priority, I am not good at Python, sorry for
that!
We had tested a scenario that the primary domain plays a DivX video, and at the
same time, copy very big files in the auxiliary domain. The video can be played
well! The effectiveness we experienced beats BCredit in this usage case, no
matter how we adjust the parameters of BCredit.

Some explanations of SDP:
The "priority" parameter is actually used to control I/O
responsiveness. If a VCPU is woken by an I/O event during the runtime, and at
the same time, and its "priority" value happened to be higher than the
current VCPU, the current VCPU will be preempted, and the woken VCPU is
scheduled. A "bonus" will be given to the woken VCPU to leave
"enough possible" time for it to complete its I/O handling. The bonus
value is computed by substracting the "priority" parameter of the two
(the woken and the current ) VCPUs. This strategy actually inhibits the
preemption of a currently running VCPU with high "priority" by another
VCPU with lower one, while permits preemption vice vesa, and i think this method
fits well for the asymmetric domain role scenario of virtualized clients and
desktops.
Regarding the computation resource allocation, the simple scheduler actually
shares the CPU resource by a round-robin fashion. I/O event happened at a high
"priority" VCPU can give it a little "bonus", after using up
that, the VCPU fall back to the round-robin scheduling ring. By this way, it
maintains some kind of fairness even in virtualized client environment. e.g., in
our testing scenario, we found the file-copying in auxiliary domain proceeds
well (although a little slow) when the primary domain plays a DivX video, which
results in high volumn of I/O events.

By this experience, i think I/O responsiveness is an important parameter to be
added in the development of new scheduler, since platforms have their
independent performance metrics, and user can adjust the I/O responsiveness
parameter of the domains to make them work well.

Moreover, I think some characters of Credit scheduler does not fit well for the
virtualized clients/desktops (for further discussion if possible). If used in
the virtualized clients/desktop scenatio, the worst side of Credit is its little
state space to mark the VCPUs (i.e., BOOT, UNDER and OVER). This make it very
inconvenient at least to differentiate the VCPUs of different domains, and with
different kinds of tasks, although the little state space do work well in
consolidated servers.  The second inconevient side of Credit scheduler is the
method that the scheduler "boosts" the VCPUs. In the original version
Credit, a woken VCPU has to have enough credits (UNDER state) to make itself
promoted to BOOST state. However, the domain (VCPU) may used up its credit, and
at the same time it do have critical task. At this moment, fairness is of the
secondary consideration, and should be maintained in later phases. At this
point, BCredit brings some changes, although Bcredit may give fairness little
consideration, unfortunately.


Thanks,


2009-04-10 



Zhiyuan Shao 



发件人： George Dunlap 
发送时间： 2009-04-09  23:59:18 
收件人： xen-devel@lists.xensource.com 
抄送： 
主题： [Xen-devel] [RFC] Scheduler work,part 1: High-level goals and interface. 
 
In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler.  We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.

This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.

The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.

Please feel free to comment / discuss / suggest improvements.

1. Design targets

We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).

For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).

For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.

For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They'll also be running video and
audio workloads, which are extremely latency-sensitive.

2. Design goals

For each of the target systems and workloads above, we have some
high-level goals for the scheduler:

* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.

We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.

* Good scheduling for latency-sensitive workloads.

To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.

* HT-aware.

Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".

* Power-aware.

Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.

3. Target interface:

The target interface will be similar to credit1:

* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).

* Additionally, we will be introducing a "reservation" or
"floor".
  (I'm open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.

For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.

* The "cap" functionality of credit1 will be retained.

This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.

* We will also have an interface to the cpu-vs-electrical power.

This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-10 16:15 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Tian, Kevin wrote:>> From: Jeremy Fitzhardinge
>> Sent: 2009年4月10日 2:42
>>
>> George Dunlap wrote:
>>     
>>> 1. Design targets
>>>
>>> We have three general use cases in mind: Server 
>>>       
>> consolidation, virtual
>>     
>>> desktop providers, and clients (e.g. XenClient).
>>>
>>> For servers, our target "sweet spot" for which we will
optimize is a
>>> system with 2 sockets, 4 cores each socket, and SMT (16 
>>>       
>> logical cpus).
>>     
>>> Ideal performance is expected to be reached at about 80% total
system
>>> cpu utilization; but the system should function reasonably well up
to
>>> a utilization of 800% (e.g., a load of 8).
>>>   
>>>       
>> Is that forward-looking enough?  That hardware is currently available; 
>> what''s going to be commonplace in 2-3 years?
>>     
>
> good point.
>
>   
>>> * HT-aware.
>>>
>>> Running on a logical processor with an idle peer thread is not the
>>> same as running on a logical processor with a busy peer thread. 
The
>>> scheduler needs to take this into account when deciding
"fairness".
>>>   
>>>       
>> Would it be worth just pair-scheduling HT threads so they''re
always
>> running in the same domain?
>>     
>
> running same domain doesn''t help fairness and instead, it worsens.
>   
I don''t know what the performance characteristics of modern-HT is, but
in P4-HT the throughput of a given thread was very dependent on what the
other thread was doing. If its competing with some other arbitrary
domain, then its hard to make any estimates about what the throughput of
a given vcpu''s thread is.

If we present them as sibling pairs to guests, then it becomes the guest
OS''s problem (ie, we don''t try to hide the true nature of
these pcpus).
That''s fairer for the guest, because they know what they''re
getting, and
Xen can charge the guest for cpu use on a thread-pair, rather than
trying to work out how the two threads compete. In other words, if only
one thread is running, then it can charge max-thread-throughput; if both
are running, it can charge max-core-throughput (possibly scaled by
whatever performance mode the core is running in).
>>> * Power-aware.
>>>
>>> Using as many sockets / cores as possible can increase the 
>>>       
>> total cache
>>     
>>> size avalable to VMs, and thus (in the absence of inter-VM sharing)
>>> increase total computing power; but by keeping multiple sockets and
>>> cores powered up, also increases the electrical power used by the
>>> system.  We want a configurable way to balance between maximizing
>>> processing power vs minimizing electrical power.
>>>   
>>>       
>> I don''t remember if there''s a proper term for this,
but what about
>> having multiple domains sharing the same scheduling context, so that a 
>> stub domain can be co-scheduled with its main domain, rather 
>> than having 
>> them treated separately?
>>     
>
> This is really desired.
>
>   
>> Also, a somewhat related point, some kind of directed schedule so that 
>> when one vcpu is synchronously waiting on anohter vcpu, have 
>> it directly 
>> hand over its pcpu to avoid any cross-cpu overhead (including the 
>> ability to take advantage of directly using hot cache lines).  That 
>> would be useful for intra-domain IPIs, etc, but also inter-domain 
>> context switches (domain<->stub, frontend<->backend, etc).
>>     
>
> The hard part here is to find the hint on WHICH vcpu that given
> cpu is waiting, which is not straightforward. Of course stub
> domain is most possible example, but it may be already cleanly
> addressed if above co-scheduling could be added? :-)
>   
I''m being unclear by conflating two issues.

One is that when dom0 (or driver domain) does some work on behalf of a
guest, it seems like it would be useful for the time used to be credited
against the guest rather than against dom0.

My thought is that, rather than having the scheduler parameters be the
implicit result of "vcpu A belongs to domain X, charge X", each vcpu
has
a charging domain which can be updated via (privileged) hypercall. When
dom0 is about to do some work, it updates the charging domain
accordingly (with some machinery to make that a per-task property within
the kernel so that task context switches update the vcpu state
appropriately).

A further extension would be the idea of charging grants, where domain A
could grant domain B charging rights, and B could set its vcpus to
charge A as an unprivileged operation. As with grant tables, revocation
poses some interesting problems.

This is a generalization of coscheduled stub domains, because you could
achieve the same effect by making the stub domain simply switch all its
vcpus to charge its main domain.

How to schedule vcpus? They could either be scheduled as if they were
part of the other domain; or be scheduled with their "home" domain,
but
their time spent is charged against the other domain. The former is
effectively priority inheritance, and raises all the the normal issues -
but it would be appropriate for co-scheduled stub domains. The latter
makes more sense for dom0, but its less clear what it actually means:
does it consume any home domain credits? What happens if the other
domain''s credits are all consumed? Could two domains collude to get
more
than their fair share of cpu?

The second issue is trying to share pcpu resources between vcpus where
appropriate. The obvious case is doing some kind of cross-domain copy
operation, where the data could well be hot in cache, so if you use the
same pcpu you can just get cache hits. Of course there''s the tradeoff
that you''re necessarily serialising things which could be done in
parallel, so perhaps it doesn''t work well in practice.

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2009-Apr-10 17:16 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

> I don''t know what the performance characteristics of modern-HT is,
but
> in P4-HT the throughput of a given thread was very dependent on what the
> other thread was doing. If its competing with some other arbitrary
> domain, then its hard to make any estimates about what the throughput
> of a given vcpu''s thread is.
The original Northwood P4''s were fairly horrible as regards performance
predictability, but things got considerably better with later steppings. Nehalem
has some interesting features that ought to make it better yet.

Presenting sibling pairs to guests is probably preferable (it avoids any worries
about side channel crypto attacks), but I certainly wouldn''t restrict
it to just that: server hosted desktop workloads often involve large numbers of
single VCPU guests, and you want every logical processor available.

Scaling the accounting if two threads share a core is a good way of ensuring
things tend toward longer term fairness.

Possibly having two modes of operation would be good thing:

 1. explicitly present HT to guests and gang schedule threads

 2. normal free-for-all with HT aware accounting.

Of course, #1 isn''t optimal if guests may migrate between HT and non-HT
systems.


Ian 

> If we present them as sibling pairs to guests, then it becomes the
> guest
> OS''s problem (ie, we don''t try to hide the true nature of
these pcpus).
> That''s fairer for the guest, because they know what
they''re getting,
> and
> Xen can charge the guest for cpu use on a thread-pair, rather than
> trying to work out how the two threads compete. In other words, if only
> one thread is running, then it can charge max-thread-throughput; if
> both
> are running, it can charge max-core-throughput (possibly scaled by
> whatever performance mode the core is running in).
> 
> >>> * Power-aware.
> >>>
> >>> Using as many sockets / cores as possible can increase the
> >>>
> >> total cache
> >>
> >>> size avalable to VMs, and thus (in the absence of inter-VM
sharing)
> >>> increase total computing power; but by keeping multiple
sockets and
> >>> cores powered up, also increases the electrical power used by
the
> >>> system.  We want a configurable way to balance between
maximizing
> >>> processing power vs minimizing electrical power.
> >>>
> >>>
> >> I don''t remember if there''s a proper term for
this, but what about
> >> having multiple domains sharing the same scheduling context, so
that
> a
> >> stub domain can be co-scheduled with its main domain, rather
> >> than having
> >> them treated separately?
> >>
> >
> > This is really desired.
> >
> >
> >> Also, a somewhat related point, some kind of directed schedule so
> that
> >> when one vcpu is synchronously waiting on anohter vcpu, have
> >> it directly
> >> hand over its pcpu to avoid any cross-cpu overhead (including the
> >> ability to take advantage of directly using hot cache lines). 
That
> >> would be useful for intra-domain IPIs, etc, but also inter-domain
> >> context switches (domain<->stub, frontend<->backend,
etc).
> >>
> >
> > The hard part here is to find the hint on WHICH vcpu that given
> > cpu is waiting, which is not straightforward. Of course stub
> > domain is most possible example, but it may be already cleanly
> > addressed if above co-scheduling could be added? :-)
> >
> 
> I''m being unclear by conflating two issues.
> 
> One is that when dom0 (or driver domain) does some work on behalf of a
> guest, it seems like it would be useful for the time used to be
> credited
> against the guest rather than against dom0.
> 
> My thought is that, rather than having the scheduler parameters be the
> implicit result of "vcpu A belongs to domain X, charge X", each
vcpu
> has
> a charging domain which can be updated via (privileged) hypercall. When
> dom0 is about to do some work, it updates the charging domain
> accordingly (with some machinery to make that a per-task property
> within
> the kernel so that task context switches update the vcpu state
> appropriately).
> 
> A further extension would be the idea of charging grants, where domain
> A
> could grant domain B charging rights, and B could set its vcpus to
> charge A as an unprivileged operation. As with grant tables, revocation
> poses some interesting problems.
> 
> This is a generalization of coscheduled stub domains, because you could
> achieve the same effect by making the stub domain simply switch all its
> vcpus to charge its main domain.
> 
> How to schedule vcpus? They could either be scheduled as if they were
> part of the other domain; or be scheduled with their "home"
domain, but
> their time spent is charged against the other domain. The former is
> effectively priority inheritance, and raises all the the normal issues
> -
> but it would be appropriate for co-scheduled stub domains. The latter
> makes more sense for dom0, but its less clear what it actually means:
> does it consume any home domain credits? What happens if the other
> domain''s credits are all consumed? Could two domains collude to
get
> more
> than their fair share of cpu?
> 
> 
> 
> The second issue is trying to share pcpu resources between vcpus where
> appropriate. The obvious case is doing some kind of cross-domain copy
> operation, where the data could well be hot in cache, so if you use the
> same pcpu you can just get cache hits. Of course there''s the
tradeoff
> that you''re necessarily serialising things which could be done in
> parallel, so perhaps it doesn''t work well in practice.
> 
> J
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-10 17:19 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Ian Pratt wrote:>> I don''t know what the performance characteristics of modern-HT
is, but
>> in P4-HT the throughput of a given thread was very dependent on what
the
>> other thread was doing. If its competing with some other arbitrary
>> domain, then its hard to make any estimates about what the throughput
>> of a given vcpu''s thread is.
>>     
>
> The original Northwood P4''s were fairly horrible as regards
performance predictability, but things got considerably better with later
steppings. Nehalem has some interesting features that ought to make it better
yet.
>
> Presenting sibling pairs to guests is probably preferable (it avoids any
worries about side channel crypto attacks), but I certainly wouldn''t
restrict it to just that: server hosted desktop workloads often involve large
numbers of single VCPU guests, and you want every logical processor available.
>
> Scaling the accounting if two threads share a core is a good way of
ensuring things tend toward longer term fairness.
>
> Possibly having two modes of operation would be good thing:
>
>  1. explicitly present HT to guests and gang schedule threads
>
>  2. normal free-for-all with HT aware accounting.
>
> Of course, #1 isn''t optimal if guests may migrate between HT and
non-HT systems.
>   
This can probably be extended to Intel''s hyper-dynamic flux mode (that 
may not be the real marketing name), where it can overclock one core if 
the other is idle.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-10 17:34 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Ian Pratt wrote:>  1. explicitly present HT to guests and gang schedule threads
>
>  2. normal free-for-all with HT aware accounting.
>
> Of course, #1 isn''t optimal if guests may migrate between HT and
non-HT systems.
>   
Well, we could extend vcpu hotplug to deal with those kinds of cpu 
topology changes, but I guess that doesn''t help most Windows/hvm 
guests.  But I think if those vcpus stop being siblings, I don''t think 
it would hurt if we stopped gang scheduling them, so long as they''re 
kept close (same package, I guess).

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-11 09:52 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
>Sent: 2009年4月11日 0:15
>>>> * HT-aware.
>>>>
>>>> Running on a logical processor with an idle peer thread is not
the
>>>> same as running on a logical processor with a busy peer 
>thread.  The
>>>> scheduler needs to take this into account when deciding
"fairness".
>>>>   
>>>>       
>>> Would it be worth just pair-scheduling HT threads so they're
always
>>> running in the same domain?
>>>     
>>
>> running same domain doesn't help fairness and instead, it worsens.
>>   
>
>I don't know what the performance characteristics of modern-HT is, but
>in P4-HT the throughput of a given thread was very dependent 
>on what the
>other thread was doing. If its competing with some other arbitrary
>domain, then its hard to make any estimates about what the 
>throughput of
>a given vcpu's thread is.
>
>If we present them as sibling pairs to guests, then it becomes 
>the guest
>OS's problem (ie, we don't try to hide the true nature of these
pcpus).
>That's fairer for the guest, because they know what they're 
>getting, and
>Xen can charge the guest for cpu use on a thread-pair, rather than
>trying to work out how the two threads compete. In other words, if only
>one thread is running, then it can charge 
>max-thread-throughput; if both
>are running, it can charge max-core-throughput (possibly scaled by
>whatever performance mode the core is running in).
It bases on one assumption that workloads within VM is more HT
friendly than workloads cross VMs. Maybe it's true in some cases
but I don't think it a strong point in most deployments.

The major worry to me is added complexity by exposing such sibling 
pairs to guest. You then have to schedule at core level for that VM, 
since the implication of HT should be always maintained or else 
reverse effect could be seen when VM does try to utilize that topology.
This brings trouble to scheduler. Not all VMs are guest SMP, and
then the VM being exposed with HT is actually treated unfair as one
more limitation is imposed that partial idle core can't be used by it 
while other VMs is immune. Another tricky part is that you have to 
gang schedule that VM, which is in concept fancy but no one has 
come up a solid implementaion in real.

Above is why I said the fairness could be worse in a general level.
It could be useful in some specific scenario. one is in client, where
however it's better to expose full topology instead of HT. the other
is some mission critical usages where cpu resource are paritioned
and thus to expose HT could be also useful.

>>> Also, a somewhat related point, some kind of directed 
>schedule so that 
>>> when one vcpu is synchronously waiting on anohter vcpu, have 
>>> it directly 
>>> hand over its pcpu to avoid any cross-cpu overhead (including the 
>>> ability to take advantage of directly using hot cache lines).  That
>>> would be useful for intra-domain IPIs, etc, but also inter-domain 
>>> context switches (domain<->stub, frontend<->backend,
etc).
>>>     
>>
>> The hard part here is to find the hint on WHICH vcpu that given
>> cpu is waiting, which is not straightforward. Of course stub
>> domain is most possible example, but it may be already cleanly
>> addressed if above co-scheduling could be added? :-)
>>   
>
>I'm being unclear by conflating two issues.
>
>One is that when dom0 (or driver domain) does some work on behalf of a
>guest, it seems like it would be useful for the time used to 
>be credited
>against the guest rather than against dom0.
>
>My thought is that, rather than having the scheduler parameters be the
>implicit result of "vcpu A belongs to domain X, charge X", 
>each vcpu has
>a charging domain which can be updated via (privileged) hypercall. When
>dom0 is about to do some work, it updates the charging domain
>accordingly (with some machinery to make that a per-task 
>property within
>the kernel so that task context switches update the vcpu state
>appropriately).
>
>A further extension would be the idea of charging grants, 
>where domain A
>could grant domain B charging rights, and B could set its vcpus to
>charge A as an unprivileged operation. As with grant tables, revocation
>poses some interesting problems.
>
>This is a generalization of coscheduled stub domains, because you could
>achieve the same effect by making the stub domain simply switch all its
>vcpus to charge its main domain.

Yup. This is one long missing part in Xen. Current accounting mechanism
like in xentop is raw incomplete. In this part KVM could be easier under the
cap of container.

>
>How to schedule vcpus? They could either be scheduled as if they were
>part of the other domain; or be scheduled with their "home"
domain, but
>their time spent is charged against the other domain. The former is
>effectively priority inheritance, and raises all the the 
>normal issues -
>but it would be appropriate for co-scheduled stub domains. The latter
>makes more sense for dom0, but its less clear what it actually means:
>does it consume any home domain credits? What happens if the other
>domain's credits are all consumed? Could two domains collude 
>to get more
>than their fair share of cpu?
>
>
>
>The second issue is trying to share pcpu resources between vcpus where
>appropriate. The obvious case is doing some kind of cross-domain copy
>operation, where the data could well be hot in cache, so if you use the
>same pcpu you can just get cache hits. Of course there's the tradeoff
>that you're necessarily serialising things which could be done in
>parallel, so perhaps it doesn't work well in practice.
>
>J
>_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-11 09:57 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
>Sent: 2009年4月11日 1:16
>
>> I don't know what the performance characteristics of 
>modern-HT is, but
>> in P4-HT the throughput of a given thread was very dependent 
>on what the
>> other thread was doing. If its competing with some other arbitrary
>> domain, then its hard to make any estimates about what the throughput
>> of a given vcpu's thread is.
>
>The original Northwood P4's were fairly horrible as regards 
>performance predictability, but things got considerably better 
>with later steppings. Nehalem has some interesting features 
>that ought to make it better yet.
>
>Presenting sibling pairs to guests is probably preferable (it 
>avoids any worries about side channel crypto attacks), but I 
>certainly wouldn't restrict it to just that: server hosted 
>desktop workloads often involve large numbers of single VCPU 
>guests, and you want every logical processor available.
>
>Scaling the accounting if two threads share a core is a good 
>way of ensuring things tend toward longer term fairness.
>
>Possibly having two modes of operation would be good thing:
>
> 1. explicitly present HT to guests and gang schedule threads
>
> 2. normal free-for-all with HT aware accounting.
>
>Of course, #1 isn't optimal if guests may migrate between HT 
>and non-HT systems.
>
>
what do you mean by 'free-for-all'? 

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-11 10:00 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
>Sent: 2009年4月11日 1:20
>Ian Pratt wrote:
>>> I don't know what the performance characteristics of 
>modern-HT is, but
>>> in P4-HT the throughput of a given thread was very 
>dependent on what the
>>> other thread was doing. If its competing with some other arbitrary
>>> domain, then its hard to make any estimates about what the 
>throughput
>>> of a given vcpu's thread is.
>>>     
>>
>> The original Northwood P4's were fairly horrible as regards 
>performance predictability, but things got considerably better 
>with later steppings. Nehalem has some interesting features 
>that ought to make it better yet.
>>
>> Presenting sibling pairs to guests is probably preferable 
>(it avoids any worries about side channel crypto attacks), but 
>I certainly wouldn't restrict it to just that: server hosted 
>desktop workloads often involve large numbers of single VCPU 
>guests, and you want every logical processor available.
>>
>> Scaling the accounting if two threads share a core is a good 
>way of ensuring things tend toward longer term fairness.
>>
>> Possibly having two modes of operation would be good thing:
>>
>>  1. explicitly present HT to guests and gang schedule threads
>>
>>  2. normal free-for-all with HT aware accounting.
>>
>> Of course, #1 isn't optimal if guests may migrate between HT 
>and non-HT systems.
>>   
>
>This can probably be extended to Intel's hyper-dynamic flux mode (that 
>may not be the real marketing name), where it can overclock 
>one core if 
>the other is idle.
the normal name for this is Turbo Boost. However it'd be difficult
for software to accounting for extra cycles gained from overclock,
as whether boost actually happens and how much cycles can
be boosted are completely controlled by hardware unit. There's
some feedback mechanism though, to gain average frequency in
an elapsed time. However currently cpufreq governor runs in 
time based style w/o connection to scheduler. That's one part
we could further enhance.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2009-Apr-11 17:11 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

> >Possibly having two modes of operation would be good thing:
> >
> > 1. explicitly present HT to guests and gang schedule threads
> >
> > 2. normal free-for-all with HT aware accounting.
> >
> >Of course, #1 isn''t optimal if guests may migrate between HT
> >and non-HT systems.
> 
> what do you mean by ''free-for-all''?
Same as today, i.e. we don''t gang schedule and all threads are
available for running VCPUs.

I think it''s reasonable to have two different modes of operation. For
some CPU-intensive server virtualization-type workloads the admin basically
wants to partition the machine. In this situation it''s reasonable to
expose the physical topology to guests (not just hyperthreads, but potentially
cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too).

For more general virtualization workloads where the total number of VCPUs is
rather greater than the number of physical CPUs then the current behaviour is
preferable. HT aware accounting will mean that VCPUs that run concurrently on
the same core will be charged less than the full period they are scheduled for.

Thanks,
Ian




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-12 06:27 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
>Sent: 2009年4月12日 1:12
>> >Possibly having two modes of operation would be good thing:
>> >
>> > 1. explicitly present HT to guests and gang schedule threads
>> >
>> > 2. normal free-for-all with HT aware accounting.
>> >
>> >Of course, #1 isn't optimal if guests may migrate between HT
>> >and non-HT systems.
>> 
>> what do you mean by 'free-for-all'?
>
>Same as today, i.e. we don't gang schedule and all threads are 
>available for running VCPUs. 
>
>I think it's reasonable to have two different modes of 
>operation. For some CPU-intensive server virtualization-type 
>workloads the admin basically wants to partition the machine. 
>In this situation it's reasonable to expose the physical 
>topology to guests (not just hyperthreads, but potentially 
>cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too). 
>
>For more general virtualization workloads where the total 
>number of VCPUs is rather greater than the number of physical 
>CPUs then the current behaviour is preferable. HT aware 
>accounting will mean that VCPUs that run concurrently on the 
>same core will be charged less than the full period they are 
>scheduled for.
>
Agree.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-15 13:54 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge <jeremy@goop.org>
wrote:> This can probably be extended to Intel''s hyper-dynamic flux mode
(that may
> not be the real marketing name), where it can overclock one core if the
> other is idle.
Jeremy,

Did you mean we could expose an entire socket to a guest VM, so that
it could schedule so as to take advantage of the effects of Turbo
Boost, just as we can expose thread pairs to a VM and let the guest OS
scheduler deal with threading issues?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-15 14:29 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

On Thu, Apr 9, 2009 at 7:41 PM, Jeremy Fitzhardinge <jeremy@goop.org>
wrote:> I don''t remember if there''s a proper term for this, but
what about having
> multiple domains sharing the same scheduling context, so that a stub domain
> can be co-scheduled with its main domain, rather than having them treated
> separately?
I think it''s been informally called "co-scheduling". :-) 
Yes, I''m
going to be looking into that.  One of the things that makes it a
little less easy is that (as I understand it) there is only one stub
domain "vcpu" per VM, which is shared by all a VM''s vcpus.
> Also, a somewhat related point, some kind of directed schedule so that when
> one vcpu is synchronously waiting on anohter vcpu, have it directly hand
> over its pcpu to avoid any cross-cpu overhead (including the ability to
take
> advantage of directly using hot cache lines).  That would be useful for
> intra-domain IPIs, etc, but also inter-domain context switches
> (domain<->stub, frontend<->backend, etc).
The only problem is if the "service" domain has other work that it may
do after it''s done.  In my tests on a 2-core box doing scp to an HVM
guest, it''s faster if I pin dom0 and domU to separate cores than if I
pin them to the same core.  Looking at the traces, it seems as though
after dom0 has woken up domU, it spends another 50K cycles or so
before blocking.  Stub domains may behave differently; in any case,
it''s something that needs experimentation to decide.
>> For example, one could give dom0 a "reservation" of 50%, but
leave the
>> weight at 256.  No matter how many other VMs run with a weight of 256,
>> dom0 will be guaranteed to get 50% of one cpu if it wants it.
>>
>
> How does the reservation interact with the credits?  Is the reservtion in
> addition to its credits, or does using the reservation consume them?
I think your question is, how does the reservation interact with
weight?  (Credits is the mechanism to implement both.)  The idea is
that a VM would get either an amount of cpu proportional to its
weight, or the reservation, whichever is greater.

So suppose that VMs A, B, and C have weights of 256 on a system with 1
core, no reservations.

If A and B are burning as much cpu as they can and C is idle, then A
and B should get 50% each.

If all of them (A,B,C) are burning as much cpu as they can, they will
should 33% each.

Now suppose that we give B a reservation of 40%.

If A and B are burning as much as they can and C is idle, then A and B
should again get 50% each.

However, if all of them are burning as much as they can, then B should
get 40% (its reservation), and A and C should each get 30% (i.e., the
remaining 60% divided by weight).

Does that make sense?
> Is it worth taking into account the power cost of cache misses vs hits?
If we have a general framework for "goodness" and "badness",
and we
have a way of measuring cache hits / misses, we should be able to
extend the scheduler to do so.
> Do vcpus running on pcpus running at less than 100% speed consume fewer
> credits?
Yes, we''ll also need to account for cpu frequency states in our
accounting.
> Is there any explicit interface to cpu power state management, or would
that
> be decoupled?
I think we might be able to fold this in; it depends on how
complicated things get.  Just as one can imagine a "badness factor" to
powering up a second CPU, which we can weight against the "badness" of
vcpus waiting on the runqueue, we can imagine a "badness factor" of
running at a higher cpu HZ that can be weighed against either powering
up extra cores / cpus or having to wait on the runqueue.

Let''s start with a basic "badness factor" and see if we can
get it
worked out properly, and then look at extending it to these sorts of
things.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-15 15:07 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

2009/4/10 Tian, Kevin <kevin.tian@intel.com>:> How about VM number in total you''d like to support?
A rule-of-thumb number would be that we want to perform well at 4 VMs
per core, and wouldn''t mind having a performance "cliff" past
8 per
core (not thread).  So for a 16-core system, that would be "good" for
64 VMs and "acceptable" up to 128 VMs.
> Do you mean that same elapsed time in above two scenarios will be
> translated into different credits?
Yes.  Ideally, we want to give "processing power" based on weight.
But the "processing power" of a thread whose sibling is idle is
significantly more than the "processing power" of a thread whose
sibling is running.  (Same thing possibly for cpu frequency scaling.)
So we''d want to arrange the credits such that VMs with equal weight
equal "processing power", not just equal "time on a logical
cpu".
> Xen3.4 now supports "sched_smt_power_savings" (both boot option
> and touchable by xenpm) to change power/performance preference.
> It''s simple implementation to simply reverse the span order from
> existing package->core->thread to thread->core->package. More
> fine-grained flexibility could be given in future if hierarchical
scheduling
> concept could be more clearly constructed like domain scheduler
> in Linux.
I haven''t looked at this code.  From your description here it sounds
like a sort of a simple hack to get the effect we want (either
spreading things out or pushing them together) -- is that correct?

My general feeling is that hacks are good short-term solutions, but
not long-term.  Things always get more complicated, and often have
unexpected side-effects.  I think since we''re doing scheduler work,
it''s worth it to try to see if we can actually solve the
power/performance problem.
> imo, weight is not strictly translated into the care for latency. any
> elaboration on that? I remembered that previously Nishiguchi-san
> gave idea to boost credit, and Disheng proposed static priority.
> Maybe you can make a summary to help people how latency would
> be exactly ensured in your proposal
All of this needs to be run through experiments.  So far, I''ve had
really good success with putting waking VMs in "boost" priority for
1ms if they still have credits.  (And unlike the credit scheduler, I
try to make sure that a VM rarely runs out of credits.)
> there should be some way to adjust or limit usage of
''reservation'' when
> multiple vcpus both claim a desire which however sum up to some
> exceeding cpu''s computing power or weaken your general
> ''weight-as-basic-unit'' idea?
All "reservations" on the system must add up to less than the total
processing power of the system.  So a system with 2 cores can''t have a
sum of reservations more than 200%.  Xen will check this when setting
the reservation and return an appropriate error message if necessary.
>>* We will also have an interface to the cpu-vs-electrical power.
>>
>>This is yet to be defined.  At the hypervisor level, it will probably
>>be a number representing the "badness" of powering up extra
cpus /
>>cores.  At the tools level, there will probably be the option of
>>either specifying the number, or of using one of 2/3 pre-defined
>>values {power, balance, green/battery}.
>
> Not sure how that number will be defined. Maybe we can follow
> current way to just add individual name-based options matching
> its purpose (such as migration_cost and sched_smt_power_savings...)
At the scheduler level, I was thinking along the lines of
"core_power_up_cost".  This would be comparible to the cost of having
things waiting on the runqueue.  So (for example) if the cost was 0.1,
then when the load on the current processors reached 1.1, then it
would power up another core.  You could set it to 0.5 or 1.0 to save
more power (at the cost of some performance).  I think defining it
that way is the closest to what you really want: a way to define the
performance impact vs power consumption.

Obviously at the user interface level, we might have something more
manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or
something like that.

But as I said, the *goal* is to have a useful configurable interface;
the implementation will depend on what actually can be made to work in
practice.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-15 15:47 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

2009/4/11 Tian, Kevin <kevin.tian@intel.com>:> the normal name for this is Turbo Boost. However it''d be difficult
> for software to accounting for extra cycles gained from overclock,
> as whether boost actually happens and how much cycles can
> be boosted are completely controlled by hardware unit.
>From the context, it sounded like Jeremy was saying that if we exposea whole socket to a guest, then the guest can try to schedule things
either to take advantage of multiple cores or to take advantage of
Turbo Boost.  (i.e., punt the Turbo Boost performance optimization to
the guest, just as we could punt the hyperthreading problem to the
guest.)

In any case, even if we can''t control it, we may be able to either do
some estimates (i.e., we expect this core to run at about 120%).
There will probably be some performance counters that we could use to
estimate how much "boost" a VM actually got and deal with credits
accordingly... but that''s yet another level of complication. 
I''ll put
it in the list of things to look at, and we''ll see how far we get.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-15 15:56 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

2009/4/11 Tian, Kevin <kevin.tian@intel.com>:> The major worry to me is added complexity by exposing such sibling
> pairs to guest. You then have to schedule at core level for that VM,
> since the implication of HT should be always maintained or else
> reverse effect could be seen when VM does try to utilize that topology.
> This brings trouble to scheduler. Not all VMs are guest SMP, and
> then the VM being exposed with HT is actually treated unfair as one
> more limitation is imposed that partial idle core can''t be used by
it
> while other VMs is immune. Another tricky part is that you have to
> gang schedule that VM, which is in concept fancy but no one has
> come up a solid implementaion in real.
I think gang scheduling with this limited scope (a hyper-pair to be
scheduled on a hyper-pair) should be a lot easier than the general
case.  In any case, as long as we assign and deduct credits
appropriately, a threaded VM shouldn''t have a disadvantage compared to
a single-thread VM.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-15 16:23 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

George Dunlap wrote:> On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge
<jeremy@goop.org> wrote:
>   
>> This can probably be extended to Intel''s hyper-dynamic flux
mode (that may
>> not be the real marketing name), where it can overclock one core if the
>> other is idle.
>>     
>
> Jeremy,
>
> Did you mean we could expose an entire socket to a guest VM, so that
> it could schedule so as to take advantage of the effects of Turbo
> Boost, just as we can expose thread pairs to a VM and let the guest OS
> scheduler deal with threading issues?
>   
Yes, precisely.  They''re the same in that Xen concurrently schedules
two
(or more?) vcpus to the guest which have interdependent performance.  
One could imagine a case where a guest with a single-threaded workload 
gets best performance by being given a thread/core pair, running their 
work on one while explicitly keeping the other idle.  Of course that 
idle core is lost to the rest of the system in the meantime, so the 
guest should get charged for both.

And some kind of small-scale gang scheduling might be useful for small 
SMP guests anyway, because their spinlocks and IPIs will work as 
expected, and they''ll presumably get shared cache at some level.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-16 04:58 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: George Dunlap
>Sent: 2009年4月15日 23:07
>
>> Do you mean that same elapsed time in above two scenarios will be
>> translated into different credits?
>
>Yes.  Ideally, we want to give "processing power" based on weight.
>But the "processing power" of a thread whose sibling is idle is
>significantly more than the "processing power" of a thread whose
>sibling is running.  (Same thing possibly for cpu frequency scaling.)
>So we'd want to arrange the credits such that VMs with equal weight
>equal "processing power", not just equal "time on a logical
cpu".
Yup, this is one interesting part to be further explored. 
>
>> Xen3.4 now supports "sched_smt_power_savings" (both boot
option
>> and touchable by xenpm) to change power/performance preference.
>> It's simple implementation to simply reverse the span order from
>> existing package->core->thread to thread->core->package.
More
>> fine-grained flexibility could be given in future if 
>hierarchical scheduling
>> concept could be more clearly constructed like domain scheduler
>> in Linux.
>
>I haven't looked at this code.  From your description here it sounds
>like a sort of a simple hack to get the effect we want (either
>spreading things out or pushing them together) -- is that correct?
yes, spread first vs. fill first.
>
>My general feeling is that hacks are good short-term solutions, but
>not long-term.  Things always get more complicated, and often have
>unexpected side-effects.  I think since we're doing scheduler work,
>it's worth it to try to see if we can actually solve the
>power/performance problem.
Agree. Have you look at Linux side domain scheduler idea? Not sure
whether that topology based multi-level scheduler could help or over-
complicate here.
>
>> imo, weight is not strictly translated into the care for latency. any
>> elaboration on that? I remembered that previously Nishiguchi-san
>> gave idea to boost credit, and Disheng proposed static priority.
>> Maybe you can make a summary to help people how latency would
>> be exactly ensured in your proposal
>
>All of this needs to be run through experiments.  So far, I've had
>really good success with putting waking VMs in "boost" priority
for
>1ms if they still have credits.  (And unlike the credit scheduler, I
>try to make sure that a VM rarely runs out of credits.)
btw, accurate accounting (at context switch instead of current tick-
based) should be also incorporated, if you do want to manipulate 
credits in fine-grain.
>
>> there should be some way to adjust or limit usage of 
>'reservation' when
>> multiple vcpus both claim a desire which however sum up to some
>> exceeding cpu's computing power or weaken your general
>> 'weight-as-basic-unit' idea?
>
>All "reservations" on the system must add up to less than the
total
>processing power of the system.  So a system with 2 cores can't have a
>sum of reservations more than 200%.  Xen will check this when setting
>the reservation and return an appropriate error message if necessary.
return error, or scale previous successful reservations down?
>
>>>* We will also have an interface to the cpu-vs-electrical power.
>>>
>>>This is yet to be defined.  At the hypervisor level, it will
probably
>>>be a number representing the "badness" of powering up
extra cpus /
>>>cores.  At the tools level, there will probably be the option of
>>>either specifying the number, or of using one of 2/3 pre-defined
>>>values {power, balance, green/battery}.
>>
>> Not sure how that number will be defined. Maybe we can follow
>> current way to just add individual name-based options matching
>> its purpose (such as migration_cost and sched_smt_power_savings...)
>
>At the scheduler level, I was thinking along the lines of
>"core_power_up_cost".  This would be comparible to the cost of
having
>things waiting on the runqueue.  So (for example) if the cost was 0.1,
who decides what the cost could be? how is it easily useful to an
end customer?
>then when the load on the current processors reached 1.1, then it
>would power up another core.  You could set it to 0.5 or 1.0 to save
what do you mean by 'power up'? boost its frequency or migrate load
to that core?
>more power (at the cost of some performance).  I think defining it
>that way is the closest to what you really want: a way to define the
>performance impact vs power consumption.
I'm still a bit confused here. What (at which situation) is translated into
a comparable value to the "core_power_up_cost"?
>
>Obviously at the user interface level, we might have something more
>manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or
>something like that.
Then how is this triple mapped to above "core_power_up_cost"?
>
>But as I said, the *goal* is to have a useful configurable interface;
>the implementation will depend on what actually can be made to work in
>practice.
>
I agree with this goal, but not convinced by above example. :-)

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-16 05:11 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: George Dunlap
>Sent: 2009年4月15日 23:56
>
>2009/4/11 Tian, Kevin <kevin.tian@intel.com>:
>> The major worry to me is added complexity by exposing such sibling
>> pairs to guest. You then have to schedule at core level for that VM,
>> since the implication of HT should be always maintained or else
>> reverse effect could be seen when VM does try to utilize 
>that topology.
>> This brings trouble to scheduler. Not all VMs are guest SMP, and
>> then the VM being exposed with HT is actually treated unfair as one
>> more limitation is imposed that partial idle core can't be used by
it
>> while other VMs is immune. Another tricky part is that you have to
>> gang schedule that VM, which is in concept fancy but no one has
>> come up a solid implementaion in real.
>
>I think gang scheduling with this limited scope (a hyper-pair to be
>scheduled on a hyper-pair) should be a lot easier than the general
>case.  In any case, as long as we assign and deduct credits
>appropriately, a threaded VM shouldn't have a disadvantage compared to
>a single-thread VM.
>
Could you elaborate more about what's being simplified compared to
generic gang scheduling? I used to be scared by complexity to have 
multiple vcpus sync in and sync out, especially with other 
heterogeneous VMs (w/o gang scheduling requirement). It's possibly
simpler if all VMs in that system are hyper-pair based... :-)

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-16 10:27 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

2009/4/16 Tian, Kevin <kevin.tian@intel.com>:> Could you elaborate more about what''s being simplified compared to
> generic gang scheduling? I used to be scared by complexity to have
> multiple vcpus sync in and sync out, especially with other
> heterogeneous VMs (w/o gang scheduling requirement). It''s possibly
> simpler if all VMs in that system are hyper-pair based... :-)
(I''ve only done some thought experiments re gang scheduling, so take
that into account as you read this description.)

The general gang scheduling problem is that you have P processors, and
N vms, each of which have up to M processors. Each vm may have a
different number of processors N.vcpu < P, and at any one time there
may be a different number of processors runnable N.runnable < N.vcpu <
P; this may change at any time. The general problem is to maximize
the number of used processors P (and thus maximize the throughput).
If you have a VM with 4 vcpus, but it''s only using 2, you have a
choice: do I run it on 2 cores, and let another VM use the other 2
cores? Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
wake up they can run immediately? If you do the former, then if one
of the other vcpus wakes up you have to quickly preempt someone else;
if not, you risk leaving the two cores idle for the entire timeslice,
effectively throwing away the processing power. The whole problem is
likely to be NP-complete, and really obnoxious to have good heuristics
for.

In the case of HT gang-scheduling, the problem is significantly constrained:
* The pairs of processors to look at is constrianed: each logical cpu
has a pre-defined sibling.
* The quantity we''re trying to gang-schedule is significantly
constrained: only 2; and each gang-scheduled vcpu has its pre-defined
HT sibling as well.
* If only one sibling of a HT pair is active, the other one isn''t
wasted; the active thread will get more processing power. So we don''t
have that risky choice.

So there are really only a handful of new cases to consider:
* A HT vcpu pair wakes up or comes to the front of the queue when
another HT vcpu pair is running.
+ Simple: order normally. If this pair is a higher priority
(whatever that means) than the running pair, preempt the running pair
(i.e., preempt the vcpus on both logical cpus).
* A non-HT vcpu becomes runnable (or comes to the front of the
runqueue) when a HT vcpu pair is on a pair of threads
+ If the non-HT vcpu priority is lower, wait until the HT vcpu pair
is finished.
+ If the non-HT vcpu priority is higher, we need to decide whether to
wait longer or whether to preempt both threads. This may depends on
whether there are other non-HT vcpus waiting to run, and what their
priority is.
* A HT vcpu pair becomes runnable when non-HT vcpus are running on the threads
+ Deciding whether to wait or preempt both threads will depend on the
relative weights of both.

These decisions are a lot easier to deal with than the full
gang-scheduling problem (AFAICT). One can imagine, for instance, that
each HT pair could share one runqueue. A vcpu HT pair would be put on
the runqueue as an individual entity. When it reached the top of the
queue, both threads would preempt what was on them and run the pair of
threads (or idle if the vcpu was idle). Otherwise, each thread would
take a single non-pair vcpu and execute it.

At any rate, that''s a step further down the road. First we need to
address basic credits, latency, and load-balancing. (Implementation
e-mails to come in a bit.)

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Apr-16 14:10 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From a resource utilization perspective, hyper-pairing maymake sense.  But what about the user perspective?  How would
an administrator specify hyper-pairing?  And more importantly
why?  When consolidating workloads from, say, a group
of dual-core or dual-processor servers onto some future
larger hyperthreaded server, why would anyone say
"please assign this to a hyper-pair", which is essentially
saying "give me less peak performance than I had before"?

Also, in the analysis below, the problem is greatly
simplified because today''s (x86) processors are limited
to two hyperthreads.  How soon will we see more threads
per core, given that other non-x86 CPUs already support
four or more?
> -----Original Message-----
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Sent: Thursday, April 16, 2009 4:28 AM
> To: Tian, Kevin
> Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> 2009/4/16 Tian, Kevin <kevin.tian@intel.com>:
> > Could you elaborate more about what''s being simplified
compared to
> > generic gang scheduling? I used to be scared by complexity to have
> > multiple vcpus sync in and sync out, especially with other
> > heterogeneous VMs (w/o gang scheduling requirement). It''s
possibly
> > simpler if all VMs in that system are hyper-pair based... :-)
> 
> (I''ve only done some thought experiments re gang scheduling, so
take
> that into account as you read this description.)
> 
> The general gang scheduling problem is that you have P processors, and
> N vms, each of which have up to M processors.  Each vm may have a
> different number of processors N.vcpu < P, and at any one time there
> may be a different number of processors runnable N.runnable < N.vcpu
<
> P; this may change at any time.  The general problem is to maximize
> the number of used processors P (and thus maximize the throughput).
> If you have a VM with 4 vcpus, but it''s only using 2, you have a
> choice: do I run it on 2 cores, and let another VM use the other 2
> cores?  Or do I "reserve" 4 cores for it, so that if the other 2
vcpus
> wake up they can run immediately?  If you do the former, then if one
> of the other vcpus wakes up you have to quickly preempt someone else;
> if not, you risk leaving the two cores idle for the entire timeslice,
> effectively throwing away the processing power.  The whole problem is
> likely to be NP-complete, and really obnoxious to have good heuristics
> for.
> 
> In the case of HT gang-scheduling, the problem is 
> significantly constrained:
> * The pairs of processors to look at is constrianed: each logical cpu
> has a pre-defined sibling.
> * The quantity we''re trying to gang-schedule is significantly
> constrained: only 2; and each gang-scheduled vcpu has its pre-defined
> HT sibling as well.
> * If only one sibling of a HT pair is active, the other one isn''t
> wasted; the active thread will get more processing power.  So we
don''t
> have that risky choice.
> 
> So there are really only a handful of new cases to consider:
> * A HT vcpu pair wakes up or comes to the front of the queue when
> another HT vcpu pair is running.
>  + Simple: order normally.  If this pair is a higher priority
> (whatever that means) than the running pair, preempt the running pair
> (i.e., preempt the vcpus on both logical cpus).
> * A non-HT vcpu becomes runnable (or comes to the front of the
> runqueue) when a HT vcpu pair is on a pair of threads
>  + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
> is finished.
>  + If the non-HT vcpu priority is higher, we need to decide whether to
> wait longer or whether to preempt both threads.  This may depends on
> whether there are other non-HT vcpus waiting to run, and what their
> priority is.
> * A HT vcpu pair becomes runnable when non-HT vcpus are 
> running on the threads
>  + Deciding whether to wait or preempt both threads will depend on the
> relative weights of both.
> 
> These decisions are a lot easier to deal with than the full
> gang-scheduling problem (AFAICT).  One can imagine, for instance, that
> each HT pair could share one runqueue.  A vcpu HT pair would be put on
> the runqueue as an individual entity.  When it reached the top of the
> queue, both threads would preempt what was on them and run the pair of
> threads (or idle if the vcpu was idle).  Otherwise, each thread would
> take a single non-pair vcpu and execute it.
> 
> At any rate, that''s a step further down the road.  First we need
to
> address basic credits, latency, and load-balancing. (Implementation
> e-mails to come in a bit.)
> 
>  -George
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-16 16:32 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Dan Magenheimer wrote:> From a resource utilization perspective, hyper-pairing may
> make sense.  But what about the user perspective?  How would
> an administrator specify hyper-pairing?  And more importantly
> why?  When consolidating workloads from, say, a group
> of dual-core or dual-processor servers onto some future
> larger hyperthreaded server, why would anyone say
> "please assign this to a hyper-pair", which is essentially
> saying "give me less peak performance than I had before"?
>   
I don''t see how it makes a difference.  At the moment, you''re
never sure
if a pair of vcpus are HT thread pairs, two cores on the same socket, or 
on completely different sockets - all of which will have quite different 
performance characteristics.  And unless your server is under-committed, 
you''re always running the risk that one domain is competing with
another
for CPU when it needs it most - and if you''re under-committed, you can 
always pin everything in exactly the config you want.

Besides, the chances are good that the single-threaded performance of 
each core on your shiny new server will be fast enough to overcome the 
cost of HT compared to your old server...
> Also, in the analysis below, the problem is greatly
> simplified because today''s (x86) processors are limited
> to two hyperthreads.  How soon will we see more threads
> per core, given that other non-x86 CPUs already support
> four or more?
>   
I think the simplifying factor is that the number of threads/cores 
you''re ganging together is a relatively small proportion of the total 
number of available threads/cores, so the problem is under-constrained 
and there are lots of nearly-optimal solutions.  If you''re trying to 
gang schedule a large proportion of your total resources, then you get 
into tricky boxpacking territory.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andrew Lyon

2009-Apr-16 18:20 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

On Thu, Apr 16, 2009 at 5:32 PM, Jeremy Fitzhardinge <jeremy@goop.org>
wrote:> Dan Magenheimer wrote:
>>
>> From a resource utilization perspective, hyper-pairing may
>> make sense.  But what about the user perspective?  How would
>> an administrator specify hyper-pairing?  And more importantly
>> why?  When consolidating workloads from, say, a group
>> of dual-core or dual-processor servers onto some future
>> larger hyperthreaded server, why would anyone say
>> "please assign this to a hyper-pair", which is essentially
>> saying "give me less peak performance than I had before"?
>>
>
> I don''t see how it makes a difference.  At the moment,
you''re never sure if
> a pair of vcpus are HT thread pairs, two cores on the same socket, or on
> completely different sockets - all of which will have quite different
> performance characteristics.  And unless your server is under-committed,
> you''re always running the risk that one domain is competing with
another for
> CPU when it needs it most - and if you''re under-committed, you can
always
> pin everything in exactly the config you want.
>
> Besides, the chances are good that the single-threaded performance of each
> core on your shiny new server will be fast enough to overcome the cost of
HT
> compared to your old server...
Is HT particularly worthwhile for virtualization loads? we have
several older servers which have ht and I found that when running
windows terminal services it actually slowed the machine down, and
under certain circumstances it seemed to cause the system to become
extremely slow and had to be rebooted, we disabled ht and the problem
went away.

Many documents on the web recommend disabling ht for specific
workloads, and most benchmarks show that when it is benificial the
performance gain is quite small.

Or is your plan to make use of ht in a way that gets the most benefit
with no impact under edge cases?

Andy

>
>> Also, in the analysis below, the problem is greatly
>> simplified because today''s (x86) processors are limited
>> to two hyperthreads.  How soon will we see more threads
>> per core, given that other non-x86 CPUs already support
>> four or more?
>>
>
> I think the simplifying factor is that the number of threads/cores
you''re
> ganging together is a relatively small proportion of the total number of
> available threads/cores, so the problem is under-constrained and there are
> lots of nearly-optimal solutions.  If you''re trying to gang
schedule a large
> proportion of your total resources, then you get into tricky boxpacking
> territory.
>
>   J
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-16 18:28 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Andrew Lyon wrote:> Is HT particularly worthwhile for virtualization loads? we have
> several older servers which have ht and I found that when running
> windows terminal services it actually slowed the machine down, and
> under certain circumstances it seemed to cause the system to become
> extremely slow and had to be rebooted, we disabled ht and the problem
> went away.
>
> Many documents on the web recommend disabling ht for specific
> workloads, and most benchmarks show that when it is benificial the
> performance gain is quite small.
>
> Or is your plan to make use of ht in a way that gets the most benefit
> with no impact under edge cases?
>   
"New" HT, as reintroduced in i7, is supposed to be "much
better".

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2009-Apr-17 10:02 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

>From: George Dunlap
>Sent: 2009年4月16日 18:28
>
>2009/4/16 Tian, Kevin <kevin.tian@intel.com>:
>> Could you elaborate more about what's being simplified compared to
>> generic gang scheduling? I used to be scared by complexity to have
>> multiple vcpus sync in and sync out, especially with other
>> heterogeneous VMs (w/o gang scheduling requirement). It's possibly
>> simpler if all VMs in that system are hyper-pair based... :-)
>
>(I've only done some thought experiments re gang scheduling, so take
>that into account as you read this description.)
Thanks for writing this down.
>
>The general gang scheduling problem is that you have P processors, and
>N vms, each of which have up to M processors.  Each vm may have a
>different number of processors N.vcpu < P, and at any one time there
>may be a different number of processors runnable N.runnable < N.vcpu <
>P; this may change at any time.  The general problem is to maximize
>the number of used processors P (and thus maximize the throughput).
>If you have a VM with 4 vcpus, but it's only using 2, you have a
>choice: do I run it on 2 cores, and let another VM use the other 2
>cores?  Or do I "reserve" 4 cores for it, so that if the other 2
vcpus
>wake up they can run immediately?  If you do the former, then if one
>of the other vcpus wakes up you have to quickly preempt someone else;
>if not, you risk leaving the two cores idle for the entire timeslice,
>effectively throwing away the processing power.  The whole problem is
>likely to be NP-complete, and really obnoxious to have good heuristics
>for.
This is related to the definition of gang scheduling. In my memory
gang scheduling comes from parallel computing requirement where
massive inter-threads communications/synchronizations exists and
one thread in blocking also impacts the rest. That normally requires
a 'gang' concept that once one thread is to be scheduled, all other
threads in same group are schdeduled in all together; on the other 
hand context switches are minimized to keep threads always in 
ready state. This idea is very application specific and thus normally
not find way into generic market.

Based on your whole write-up, you don't consider how to ensure 
vcpus of one VM scheduled in sync at given time point. Instead,
your model is to ensure exclusive usage on a thread-pair within
a given quantum window. It's nothing related to 'gang scheduling'
but it's OK for us to call it 'simplified gang scheduling' in this
context
since the keypoint in this discussion is whether we can do some
optimization on HT, not strictly limited to gang scheduling itself. :-)

>
>In the case of HT gang-scheduling, the problem is 
>significantly constrained:
>* The pairs of processors to look at is constrianed: each logical cpu
>has a pre-defined sibling.
>* The quantity we're trying to gang-schedule is significantly
>constrained: only 2; and each gang-scheduled vcpu has its pre-defined
>HT sibling as well.
Then I assume only VMs with <=2 vcpus are considered here
>* If only one sibling of a HT pair is active, the other one isn't
>wasted; the active thread will get more processing power.  So we don't
>have that risky choice.
May or may not. Be in mind that why HT is introduced is because one
thread normally includes lots of stall cycles and thus stalled cycles
can be utilized to drive another thread. whether more processing power
is gained is very workload specific.
>
>So there are really only a handful of new cases to consider:
>* A HT vcpu pair wakes up or comes to the front of the queue when
>another HT vcpu pair is running.
> + Simple: order normally.  If this pair is a higher priority
>(whatever that means) than the running pair, preempt the running pair
>(i.e., preempt the vcpus on both logical cpus).
"vcpu pair"? your below descriptions are all based on vcpu pair. How
do
you define the status (blocked, runnable, running) of a pair? If it's AND
of individual status of two vcpus, it's too restrictive to have a runnable
status. If it's OR operation, you may then need consider how to define
a sane priorioty (is one-vcpu-ready-the-other-block with higher credits
considered higher priority than two-vcpu-ready with lower credits? etc.)

>* A non-HT vcpu becomes runnable (or comes to the front of the
>runqueue) when a HT vcpu pair is on a pair of threads
> + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
>is finished.
> + If the non-HT vcpu priority is higher, we need to decide whether to
>wait longer or whether to preempt both threads.  This may depends on
>whether there are other non-HT vcpus waiting to run, and what their
>priority is.
>* A HT vcpu pair becomes runnable when non-HT vcpus are 
>running on the threads
> + Deciding whether to wait or preempt both threads will depend on the
>relative weights of both.
Above all looks adding much complexity to a common single-vcpu-based
scheduler. 
>
>These decisions are a lot easier to deal with than the full
>gang-scheduling problem (AFAICT).  One can imagine, for instance, that
>each HT pair could share one runqueue.  A vcpu HT pair would be put on
>the runqueue as an individual entity.  When it reached the top of the
>queue, both threads would preempt what was on them and run the pair of
>threads (or idle if the vcpu was idle).  Otherwise, each thread would
>take a single non-pair vcpu and execute it.
That may bring undesired effect. The more cross-cpu talks you add
to scheduler, less efficient could scheduler be due to locks on both
runqueues.
>
>At any rate, that's a step further down the road.  First we need to
>address basic credits, latency, and load-balancing. (Implementation
>e-mails to come in a bit.)
>
Yup. First implementation should take commonness, simpleness
and overall efficiency as major goals.

Last, any effort toward better HT efficiency is welcomed to me. However
we need be careful enough to avoid introducing unnecessary complexity
into scheduler. I'm educated that the more complex (or flexible options)
the scheduler is, more unexpected side-effects may occur in other areas.
By far I'm not convinced that above idea could lead to a more fair system
either to HT vcpu pairs or to non-HT vcpus. I agree with Ian's idea that
topology is exposed in partition case where scheduler doesn't require
change, and for general case we just need HT accounting to ensure
fairness which is simple enhancement. :-)

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-17 10:17 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:> >From a resource utilization perspective, hyper-pairing may
> make sense.  But what about the user perspective?  How would
> an administrator specify hyper-pairing?  And more importantly
> why?  When consolidating workloads from, say, a group
> of dual-core or dual-processor servers onto some future
> larger hyperthreaded server, why would anyone say
> "please assign this to a hyper-pair", which is essentially
> saying "give me less peak performance than I had before"?
I think what you''re saying is that when we only expose vcpus to the
guest, we can either run 2 vcpus on HT pairs, or give them an entire
core to themselves; but if we expose them as HT pairs and gang
schedule, then we''re promising only to run them on HT pairs, limiting
the peak performance.

Hmm, I''m not sure that''s actually true.  We could, if we had a
particularly idle system, split HT pairs and let them run as
independent vcpus.  I''m pretty sure the resulting throughput would be
usually higher.

In any case, it''s a bit like asking, "Why would I buy a machine
with
two hyperthreads instead of two cores?"  Yes, going from 2 vcpus to 2
vhts (virtual hyperthreads) is a step down in computing power; so
would going from a dual-core processor w/o HT to a single-core
processor with HT.  If you want to monotonically increase power, give
it 4 vhts.

At any rate, I think we can bring these up again when we actually
start to implement this feature.  First things first. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Apr-17 14:13 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

> In any case, it''s a bit like asking, "Why would I buy
> a machine with two hyperthreads instead of two cores?"
Yes.  In a physical machine, the OS takes advantage of all
resources available.  So it doesn''t matter if some of the
"processors" are cores and some are hyperthreads.  You
are using ALL of the CPU resources you paid for.

But in a virtualized environment, each VM gets a fraction
of the resources and if grabbing some fixed number of
"processors" sometimes gets hyperthreads and sometimes
gets cores, this will cause interesting issues for some
workloads.

Think about a cloud where one pays for resources used.
You likely would demand to pay less for a hyperpair than
a non-vht pair.

As a result, I think it will be a requirement that
a system administrator be able to specify "I want two
FULL cores" vs "I am willing to accept two hyperthreads".
And once you get beyond hyperpairs, this is going to
get very messy.
> At any rate, I think we can bring these up again when
> we actually start to implement this feature.  First
> things first. :-)
Well yes and no.  While I am a big fan of iterative
prototyping, I wonder if this might be a case where
the architecture should drive the design and the
design should drive the implementation.  IOW, first
think through what choices a system admin should
be able to make (and, if given a mind-boggling array
of choices, which ones he/she WOULD make).

Just my two cents...
> -----Original Message-----
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Sent: Friday, April 17, 2009 4:17 AM
> To: Dan Magenheimer
> Cc: Tian, Kevin; Jeremy Fitzhardinge; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> > >From a resource utilization perspective, hyper-pairing may
> > make sense.  But what about the user perspective?  How would
> > an administrator specify hyper-pairing?  And more importantly
> > why?  When consolidating workloads from, say, a group
> > of dual-core or dual-processor servers onto some future
> > larger hyperthreaded server, why would anyone say
> > "please assign this to a hyper-pair", which is essentially
> > saying "give me less peak performance than I had before"?
> 
> I think what you''re saying is that when we only expose vcpus to
the
> guest, we can either run 2 vcpus on HT pairs, or give them an entire
> core to themselves; but if we expose them as HT pairs and gang
> schedule, then we''re promising only to run them on HT pairs,
limiting
> the peak performance.
> 
> Hmm, I''m not sure that''s actually true.  We could, if we
had a
> particularly idle system, split HT pairs and let them run as
> independent vcpus.  I''m pretty sure the resulting throughput would
be
> usually higher.
> 
> In any case, it''s a bit like asking, "Why would I buy a
machine with
> two hyperthreads instead of two cores?"  Yes, going from 2 vcpus to 2
> vhts (virtual hyperthreads) is a step down in computing power; so
> would going from a dual-core processor w/o HT to a single-core
> processor with HT.  If you want to monotonically increase power, give
> it 4 vhts.
> 
> At any rate, I think we can bring these up again when we actually
> start to implement this feature.  First things first. :-)
> 
>  -George
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-17 14:55 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Dan Magenheimer wrote:>> In any case, it''s a bit like asking, "Why would I buy
>> a machine with two hyperthreads instead of two cores?"
>>     
>
> Yes.  In a physical machine, the OS takes advantage of all
> resources available.  So it doesn''t matter if some of the
> "processors" are cores and some are hyperthreads.  You
> are using ALL of the CPU resources you paid for.
>
> But in a virtualized environment, each VM gets a fraction
> of the resources and if grabbing some fixed number of
> "processors" sometimes gets hyperthreads and sometimes
> gets cores, this will cause interesting issues for some
> workloads.
>
> Think about a cloud where one pays for resources used.
> You likely would demand to pay less for a hyperpair than
> a non-vht pair.
>
> As a result, I think it will be a requirement that
> a system administrator be able to specify "I want two
> FULL cores" vs "I am willing to accept two hyperthreads".
> And once you get beyond hyperpairs, this is going to
> get very messy.
>   
I think you''re over-complicating it.  At very worst, it will be no
worse
than the current situation where Xen will place the vcpus on 
threads/cores in more or less arbitrary ways.

I think George''s proposal can already accommodate the user needs
you''re
talking about:

If the scheduler accounts for time spent executing on a contended HT 
thread (ie, the threads are not paired, so the other thread could be 
idle or running any other code) at a lesser rate than a full 
core/uncontended thread, then the charging works out.

If the user has a requirement that domain X''s vcpus must be running at 
full speed, then they can set their reservation to 100%.  If we say that 
a contended HT thread is only worth, say, 70% of a "real" core, then 
that not only factors into the charging, but also means that any domain 
with a reservation > 70% is ineligible to run on a contended HT thread.  
(I think in practise this means that any domain with high reservations 
will end up running on gang scheduled thread pairs, just to guarantee 
that the other thread is idle, so the uncontended HT thread can run at 
100%.)

(Another way to look at it is that HT contention is a bit like having 
your vcpu being preempted by Xen, but rather than going from 100% 
running to 0% running, your vcpu drops to 70%.)

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Apr-17 15:55 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

> I think you''re over-complicating it.
Perhaps.  Or maybe you are oversimplifying it? ;-)
> At very worst, it will 
> be no worse 
> than the current situation where Xen will place the vcpus on 
> threads/cores in more or less arbitrary ways.
Agreed.  Treating threads as cores is bad.  Since that''s
what''s happening today, one would think that any fix is
better than nothing.
> a contended HT thread is only worth, say, 70% of a
> "real" core
> :
> (Another way to look at it is that HT contention is a bit like having 
> your vcpu being preempted by Xen, but rather than going from 100% 
> running to 0% running, your vcpu drops to 70%.)
And that''s the oversimplification I think.  Just
because Intel provides a rule-of-thumb that the extra
thread increases performance by 30% doesn''t mean that
it is a good number to choose for scheduling purposes.

I suspect (and maybe this has even already been proven)
that this varies from 0%-100% depending on the workload,
and may even vary from *negative* to *more* than 100%.
(Yes, I understand that i7 is supposed to be better than
the last round of HT... but is it always better?)

Dan
> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, April 17, 2009 8:56 AM
> To: Dan Magenheimer
> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> Dan Magenheimer wrote:
> >> In any case, it''s a bit like asking, "Why would I
buy
> >> a machine with two hyperthreads instead of two cores?"
> >>     
> >
> > Yes.  In a physical machine, the OS takes advantage of all
> > resources available.  So it doesn''t matter if some of the
> > "processors" are cores and some are hyperthreads.  You
> > are using ALL of the CPU resources you paid for.
> >
> > But in a virtualized environment, each VM gets a fraction
> > of the resources and if grabbing some fixed number of
> > "processors" sometimes gets hyperthreads and sometimes
> > gets cores, this will cause interesting issues for some
> > workloads.
> >
> > Think about a cloud where one pays for resources used.
> > You likely would demand to pay less for a hyperpair than
> > a non-vht pair.
> >
> > As a result, I think it will be a requirement that
> > a system administrator be able to specify "I want two
> > FULL cores" vs "I am willing to accept two
hyperthreads".
> > And once you get beyond hyperpairs, this is going to
> > get very messy.
> >   
> 
> I think you''re over-complicating it.  At very worst, it will 
> be no worse 
> than the current situation where Xen will place the vcpus on 
> threads/cores in more or less arbitrary ways.
> 
> I think George''s proposal can already accommodate the user 
> needs you''re 
> talking about:
> 
> If the scheduler accounts for time spent executing on a contended HT 
> thread (ie, the threads are not paired, so the other thread could be 
> idle or running any other code) at a lesser rate than a full 
> core/uncontended thread, then the charging works out.
> 
> If the user has a requirement that domain X''s vcpus must be 
> running at 
> full speed, then they can set their reservation to 100%.  If 
> we say that 
> a contended HT thread is only worth, say, 70% of a "real" core,
then
> that not only factors into the charging, but also means that 
> any domain 
> with a reservation > 70% is ineligible to run on a contended 
> HT thread.  
> (I think in practise this means that any domain with high 
> reservations 
> will end up running on gang scheduled thread pairs, just to guarantee 
> that the other thread is idle, so the uncontended HT thread 
> can run at 
> 100%.)
> 
> (Another way to look at it is that HT contention is a bit like having 
> your vcpu being preempted by Xen, but rather than going from 100% 
> running to 0% running, your vcpu drops to 70%.)
> 
>     J
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Apr-17 16:17 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Dan Magenheimer wrote:> And that''s the oversimplification I think.  Just
> because Intel provides a rule-of-thumb that the extra
> thread increases performance by 30% doesn''t mean that
> it is a good number to choose for scheduling purposes.
>   
Actually the 70% was a number I plucked out of the air with no 
justification at all.
> I suspect (and maybe this has even already been proven)
> that this varies from 0%-100% depending on the workload,
> and may even vary from *negative* to *more* than 100%.
> (Yes, I understand that i7 is supposed to be better than
> the last round of HT... but is it always better?)
>   
The only way to know is by measurement, ideally with some specific 
performance counter which tells you what went on in that last 
timeslice.  But if this is a big issue, you can always disable HT, as 
lots of people did the last time around.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Apr-17 16:46 UTC

head link

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

> But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.
That would be a shame because HT will almost certainly
provide SOME performance benefit MOST of the time.

After pondering a bit, I guess I am arguing that once
processors have HT, Turboboost, and power management,
scheduling as a discipline has to move from the realm
of discrete to the realm of continuous.  A "second of
CPU" no longer has any real meaning when the value
of "a CPU" varies across time and workload.  (I suppose
due to shared cache effects and bus contention, this has
probably always been the case, but to a less obvious
degree.)
> The only way to know is by measurement, ideally with
> some specific performance counter which tells you
> what went on in that last timeslice.
Indeed.  Even if it is impossible to predict the
throughput of a specific workload on a specific CPU,
it sure would be nice if we could at least roughly
measure the past.

Processor architects take note! ;-)

> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, April 17, 2009 10:17 AM
> To: Dan Magenheimer
> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> Dan Magenheimer wrote:
> > And that''s the oversimplification I think.  Just
> > because Intel provides a rule-of-thumb that the extra
> > thread increases performance by 30% doesn''t mean that
> > it is a good number to choose for scheduling purposes.
> >   
> 
> Actually the 70% was a number I plucked out of the air with no 
> justification at all.
> 
> > I suspect (and maybe this has even already been proven)
> > that this varies from 0%-100% depending on the workload,
> > and may even vary from *negative* to *more* than 100%.
> > (Yes, I understand that i7 is supposed to be better than
> > the last round of HT... but is it always better?)
> >   
> 
> The only way to know is by measurement, ideally with some specific 
> performance counter which tells you what went on in that last 
> timeslice.  But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.
> 
>     J
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2009-Apr-17 17:05 UTC

head link

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Jeremy Fitzhardinge wrote:> The only way to know is by measurement, ideally with some specific 
> performance counter which tells you what went on in that last 
> timeslice.  But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.
>   I think measurement, both of total system throughput and individual VM 
throughput, is the final word on all designs.  I certainly plan on 
testing and comparing throughput for a variety of workloads as I develop 
the scheduler.  And I encourage anyone with the time and inclination to 
try to find workloads for which the scheduler performs poorly as new 
features (such as the proposed HT scheduling) are introduced.  :-)

Peace,
 -George




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Apr 2009 - [RFC] Scheduler work, part 1: High-level goals and interface.

[Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.