After the original announcement of plans to do some work on csched there wasn''t much activity, so I''d like to ask about some observations that I made with the current implementation and whether it would be expected that those planned changes would take care of them. On a lightly loaded many-core non-hyperthreaded system (e.g. a single CPU bound process in one VM, and only some background load elsewhere), I see this CPU bound vCPU permanently switch between sockets, which is a result of csched_cpu_pick() eagerly moving vCPU-s to "more idle" sockets. It would seem that some minimal latency consideration might be useful to get added here, so that a very brief interruption by another vCPU doesn''t result in unnecessary migration. As a consequence of that eager moving, in the vast majority of cases the vCPU in question then (within a very short period of time) either triggers a cascade of other vCPU migrations, or begins a series of ping-pongs between (usually two) pCPU-s - until things settle again for a while. Again, some minimal latency added here might help avoiding that. Finally, in the complete inverse scenario of severely overcommitted systems (more than two fully loaded vCPU-s per pCPU) I frequently see Linux'' softlockup watchdog kick in, now and then even resulting in the VM hanging. I had always thought that starvation of a vCPU for several seconds shouldn''t be an issue that early - am I wrong here? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, Oct 9, 2009 at 3:53 PM, Jan Beulich <JBeulich@novell.com> wrote:> After the original announcement of plans to do some work on csched there > wasn''t much activity, so I''d like to ask about some observations that I made > with the current implementation and whether it would be expected that > those planned changes would take care of them.There has been activity, but nothing worth sharing yet. :-) I''m working on the new "fairness" algorithm (perhaps called credits, perhaps not), which is a prerequisite for any further work such as load-balancing, power consumption, and so on. Unfortunately, I haven''t been able to work on it for more than a week at a time for the last several months before being interrupted with other work-related tasks. :-( Re the items you bring up below: I believe that my planned changes to load-balancing should address the first. First, I plan on making all cores which share an L2 cache to share a runqueue. This will automatically share work among those cores without needing any special load-balancing to be done. Then, I plan on actually calculating: * The per-runqueue load over the last time period * The amount each vcpu is contributing to that load. Then load balancing won''t be a matter of looking at the instantaneous runqueue lengths (as it is currently) but to the actual amount of "business" the runqueue has over a period of time. Load balancing will be just that: actually moving vcpus around to make the loads more balanced. Balancing operations will happen at fixed intervals, rather than "whenever a runqueue is idle". But those are just plans now; not a line of code has been written, and schedulers especially are notorious for the Law of Unexpected Consequences. Re soft-lockups: That really shouldn''t be possible with the current scheduler; if it happens, it''s a bug. Have you pulled from xen-unstable recently? There was a bug introduced a few weeks ago that would cause problems; Keir checked in a fix for that one last week. Otherwise, if you''re sure it''s not a long hypercall issue, there must be a bug somewhere. The new scheduler will be an almost complete re-write; so it will probably erase this bug, and introduce its own bugs. However, I doubt it will be ready by 3.5, so it''s probably worth tracking down and fixing if we can. Hope that answers your question. :-) -George> On a lightly loaded many-core non-hyperthreaded system (e.g. a single > CPU bound process in one VM, and only some background load elsewhere), > I see this CPU bound vCPU permanently switch between sockets, which is > a result of csched_cpu_pick() eagerly moving vCPU-s to "more idle" > sockets. It would seem that some minimal latency consideration might be > useful to get added here, so that a very brief interruption by another > vCPU doesn''t result in unnecessary migration. > > As a consequence of that eager moving, in the vast majority of cases > the vCPU in question then (within a very short period of time) either > triggers a cascade of other vCPU migrations, or begins a series of > ping-pongs between (usually two) pCPU-s - until things settle again for > a while. Again, some minimal latency added here might help avoiding > that. > > Finally, in the complete inverse scenario of severely overcommitted > systems (more than two fully loaded vCPU-s per pCPU) I frequently > see Linux'' softlockup watchdog kick in, now and then even resulting > in the VM hanging. I had always thought that starvation of a vCPU > for several seconds shouldn''t be an issue that early - am I wrong > here? > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich >Sent: 2009年10月9日 22:54 > >After the original announcement of plans to do some work on >csched there >wasn't much activity, so I'd like to ask about some >observations that I made >with the current implementation and whether it would be expected that >those planned changes would take care of them. > >On a lightly loaded many-core non-hyperthreaded system (e.g. a single >CPU bound process in one VM, and only some background load elsewhere), >I see this CPU bound vCPU permanently switch between sockets, which is >a result of csched_cpu_pick() eagerly moving vCPU-s to "more idle" >sockets. It would seem that some minimal latency consideration might be >useful to get added here, so that a very brief interruption by another >vCPU doesn't result in unnecessary migration.there's a migration delay (default is 1ms) to judge cache hotness and thus avoid unnecessary migration. However so far it's only checked when one cpu wants to steal vcpus from other runqueue. Possibly it makes sense to add this check to csched_vcpu_acct, as a cold cache and cascade of other VCPU migrations could easily beat benefit on a "more idle" socket. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 10.10.09 10:03 >>> >>From: Jan Beulich >>On a lightly loaded many-core non-hyperthreaded system (e.g. a single >>CPU bound process in one VM, and only some background load elsewhere), >>I see this CPU bound vCPU permanently switch between sockets, which is >>a result of csched_cpu_pick() eagerly moving vCPU-s to "more idle" >>sockets. It would seem that some minimal latency consideration might be >>useful to get added here, so that a very brief interruption by another >>vCPU doesn''t result in unnecessary migration. > >there''s a migration delay (default is 1ms) to judge cache hotness and >thus avoid unnecessary migration. However so far it''s only checked >when one cpu wants to steal vcpus from other runqueue. Possibly it >makes sense to add this check to csched_vcpu_acct, as a cold cache >and cascade of other VCPU migrations could easily beat benefit on a >"more idle" socket.Where do you see this 1ms delay - I can''t seem to spot it... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> George Dunlap <George.Dunlap@eu.citrix.com> 09.10.09 17:59 >>> >Re soft-lockups: That really shouldn''t be possible with the current >scheduler; if it happens, it''s a bug. Have you pulled from >xen-unstable recently? There was a bug introduced a few weeks ago >that would cause problems; Keir checked in a fix for that one last >week. Otherwise, if you''re sure it''s not a long hypercall issue, >there must be a bug somewhere.The testing that had exposed this was done in late July, on 3.3.1. I''ll have to re-do this on up-to-date -unstable then, and post results if the issue does reproduce there. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2009年10月12日 15:28 > >>>> "Tian, Kevin" <kevin.tian@intel.com> 10.10.09 10:03 >>> >>>From: Jan Beulich >>>On a lightly loaded many-core non-hyperthreaded system (e.g. a single >>>CPU bound process in one VM, and only some background load >elsewhere), >>>I see this CPU bound vCPU permanently switch between >sockets, which is >>>a result of csched_cpu_pick() eagerly moving vCPU-s to "more idle" >>>sockets. It would seem that some minimal latency >consideration might be >>>useful to get added here, so that a very brief interruption >by another >>>vCPU doesn't result in unnecessary migration. >> >>there's a migration delay (default is 1ms) to judge cache hotness and >>thus avoid unnecessary migration. However so far it's only checked >>when one cpu wants to steal vcpus from other runqueue. Possibly it >>makes sense to add this check to csched_vcpu_acct, as a cold cache >>and cascade of other VCPU migrations could easily beat benefit on a >>"more idle" socket. > >Where do you see this 1ms delay - I can't seem to spot it... >Sorry, that's default value in my memory. However taking a look at code doesn't give it. /* * Delay, in microseconds, between migrations of a VCPU between PCPUs. * This prevents rapid fluttering of a VCPU between CPUs, and reduces the * implicit overheads such as cache-warming. 1ms (1000) has been measured * as a good value. */ static unsigned int vcpu_migration_delay; integer_param("vcpu_migration_delay", vcpu_migration_delay); It's just the comment saying that. You may try to add that boot option for a try. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 12/10/2009 08:27, "Jan Beulich" <JBeulich@novell.com> wrote:>> there''s a migration delay (default is 1ms) to judge cache hotness and >> thus avoid unnecessary migration. However so far it''s only checked >> when one cpu wants to steal vcpus from other runqueue. Possibly it >> makes sense to add this check to csched_vcpu_acct, as a cold cache >> and cascade of other VCPU migrations could easily beat benefit on a >> "more idle" socket. > > Where do you see this 1ms delay - I can''t seem to spot it...The option of interest is vcpu_migration_delay, but it defaults to zero (disabled). Intel explicitly set it to 1000 (1ms) for their tests. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 9 Oct 2009 16:59:25 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> There has been activity, but nothing worth sharing yet. :-) I''m > working on the new "fairness" algorithm (perhaps called credits, > perhaps not), which is a prerequisite for any further work such as > load-balancing, power consumption, and so on. Unfortunately, I > haven''t been able to work on it for more than a week at a time for the > last several months before being interrupted with other work-related > tasks. :-(Incidentally, I''ve been thinking of a schedular plugin for database and other similar apps. Prelimanary DB benchmarks on xen vs bare metal are not as good. Of course, I am not blaming schedular for it. As with most big user apps, DB is very multi-threaded, and as such it does lot of tricks to get the OS schedular to play it it''s way. I may be getting my hands on a large box in few weeks to bring up xen and do some scalability work. Hope to find some low hanging fruit. thanks Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, Oct 17, 2009 at 1:16 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> Incidentally, I''ve been thinking of a schedular plugin for database and > other similar apps. Prelimanary DB benchmarks on xen vs bare metal are > not as good. Of course, I am not blaming schedular for it. As with most > big user apps, DB is very multi-threaded, and as such it does lot of > tricks to get the OS schedular to play it it''s way. > I may be getting my hands on a large box in few weeks to bring up xen > and do some scalability work. Hope to find some low hanging fruit.Scalability for Xen past 8 logical processors (where 1 hyperthread is 1 schedulable unit) is likely to be poor, due to the load-balancing algorithm. Regarding a Xen scheduler plug-in for DB applications, it seems to me it would be best to understand the characteristics of the DB workload and how they respond to different kinds of contention. There may be a few surprises; for example, a workload that you assumed was CPU-bound may in fact be making many qemu-handled operations, so it''s really blocking thousands of times per second. If we can make the default scheduler handle DB workloads well without making a special plug-in, that would be preferrable. Would you be willing, if you have the time, to help "beta-test" a new scheduler with a DB workload and compare it to the old one? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 19 Oct 2009 10:34:16 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote: ....> Scalability for Xen past 8 logical processors (where 1 hyperthread is > 1 schedulable unit) is likely to be poor, due to the load-balancing > algorithm.Yeah, I''ve been thinking in the back of my mind, some sort of multiple runqueus, with vcpu priority adjustments with cpu usage, but at the same time allow certain vcpus to maintain certain level of minimum usage. This for cases where an app has pinned a high priority thread to a vcpu. I''ve not looked at existing xen scheds, so may be it''s already doing that. More runqueues is less locking, but more load balancing, so may be something that''s tunable on the fly. Some thoughts at high level.....> Regarding a Xen scheduler plug-in for DB applications, it seems to me > it would be best to understand the characteristics of the DB workload > and how they respond to different kinds of contention. There may be a > few surprises; for example, a workload that you assumed was CPU-bound > may in fact be making many qemu-handled operations, so it''s really > blocking thousands of times per second. If we can make the default > scheduler handle DB workloads well without making a special plug-in, > that would be preferrable.Agree. I''m hoping to collect all that information over the next couple/few months. The last attempt, made a year ago, didn''t yield in a whole lot of information because of problems with 32bit tools and 64bit guest apps interaction. In a nutshell, there''s tremendous smarts in the DB, and so I think it prefers a simplified schedular/OS that it can provide hints to and interact a little with. Ideally, it would like ability for a privileged thread to tell the OS/hyp, I want to yield cpu to thread #xyz. Moreover, my focus is large, 32 to 128 logical processors, with 1/2 to 1TB memory. As such, I also want to address VCPUs being confined to logical block of physical CPUs, taking into consideration that licenses are per physical cpu core. Also, it''s important for a cluster heartbeat thread to get cpu at expected times, otherwise, it starts to freak out. Apparently, we are seeing some of that during live migrations. Waiting on more info on that myself.> Would you be willing, if you have the time, to help "beta-test" a new > scheduler with a DB workload and compare it to the old one?Yeah, sure. I hope to have a setup in few weeks. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Oct 20, 2009 at 1:01 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> Yeah, I''ve been thinking in the back of my mind, some sort of multiple > runqueusThere already are multiple runqueues; the overhead comes from the "steal work" method of moving vcpus between them, which works fine for low number of cpus but doesn''t scale well. Hmm, I thought I had written up my plans for load-balancing in an e-mail to the list, but I can''t seem to find them now. Standby sometime for a description. :-)> Agree. I''m hoping to collect all that information over the next couple/few > months. The last attempt, made a year ago, didn''t yield in a whole lot > of information because of problems with 32bit tools and 64bit guest apps > interaction.I have some good tools for collecting scheduling activity and analyzing using xentrace and xenalyze. When you get things set up, let me know and I''ll post some information about using xentrace / xenalyze to characterize a workload''s scheduling.> In a nutshell, there''s tremendous smarts in the DB, and so I think it > prefers a simplified schedular/OS that it can provide hints to and interact > a little with. Ideally, it would like ability for a privileged thread > to tell the OS/hyp, I want to yield cpu to thread #xyz.If the thread is not scheduled on a vcpu by the OS, then when the DB says to yield to that thread, the OS can switch on the running vcpu, no changes needed. The only potential modification would be if the DB wants to yield to a thread which is scheduled on another vcpu, but that vcpu is not currently running. Then the guest OS *may* want to be able to ask the HV to yield the currently running vcpu to the other vcpu. That interface is work thinking about.> Moreover, my focus is large, 32 to 128 logical processors, with 1/2 > to 1TB memory. As such, I also want to address VCPUs being confined > to logical block of physical CPUs, taking into consideration that > licenses are per physical cpu core.This sounds like it would benefit from the "CPU pools" patch submitted by Juergen Gross. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> On Tue, Oct 20, 2009 at 1:01 AM, Mukesh Rathor <mukesh.rathor@oracle.com> wrote: >> Moreover, my focus is large, 32 to 128 logical processors, with 1/2 >> to 1TB memory. As such, I also want to address VCPUs being confined >> to logical block of physical CPUs, taking into consideration that >> licenses are per physical cpu core. > > This sounds like it would benefit from the "CPU pools" patch submitted > by Juergen Gross.Indeed. Same problem, so same solution should be fine :-) Mukesh, if you need more infos about cpu pools, I would be glad to help. Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 636 47950 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Otto-Hahn-Ring 6 Internet: ts.fujitsu.com D-81739 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 20 Oct 2009 10:37:19 +0100 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Tue, Oct 20, 2009 at 1:01 AM, Mukesh Rathor > <mukesh.rathor@oracle.com> wrote: > > Yeah, I''ve been thinking in the back of my mind, some sort of > > multiple runqueus > > There already are multiple runqueues; the overhead comes from the > "steal work" method of moving vcpus between them, which works fine for > low number of cpus but doesn''t scale well.Exactly, we''d have to study to find the contentions and address those. Hopefully I can take what you got, tinker around a bit, and send the changes to see what you think.> Hmm, I thought I had written up my plans for load-balancing in an > e-mail to the list, but I can''t seem to find them now. Standby > sometime for a description. :-)Actually, I think you posted it on the list and I saved it somewhere, and plan on reading and figuring it once I get closer to doing the work.> > Agree. I''m hoping to collect all that information over the next > > couple/few months. The last attempt, made a year ago, didn''t yield > > in a whole lot of information because of problems with 32bit tools > > and 64bit guest apps interaction. > > I have some good tools for collecting scheduling activity and > analyzing using xentrace and xenalyze. When you get things set up, > let me know and I''ll post some information about using xentrace / > xenalyze to characterize a workload''s scheduling.Great thanks.> > In a nutshell, there''s tremendous smarts in the DB, and so I think > > it prefers a simplified schedular/OS that it can provide hints to > > and interact a little with. Ideally, it would like ability for a > > privileged thread to tell the OS/hyp, I want to yield cpu to thread > > #xyz. > > If the thread is not scheduled on a vcpu by the OS, then when the DB > says to yield to that thread, the OS can switch on the running vcpu, > no changes needed. > > The only potential modification would be if the DB wants to yield to a > thread which is scheduled on another vcpu, but that vcpu is not > currently running. Then the guest OS *may* want to be able to ask the > HV to yield the currently running vcpu to the other vcpu. That > interface is work thinking about.Yup, precisely.> > Moreover, my focus is large, 32 to 128 logical processors, with 1/2 > > to 1TB memory. As such, I also want to address VCPUs being confined > > to logical block of physical CPUs, taking into consideration that > > licenses are per physical cpu core. > > This sounds like it would benefit from the "CPU pools" patch submitted > by Juergen Gross.Yes, I saw that also on the list also, and when I get closer to doing the work, will take a closer look. Right now I am still trying to round up the hardware, then will have to round up folks familiar with benchmarks to setup them up. Then the easier part begins :)... thanks Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel