Hi guys, Check out the attached patches. I changed the spin lock semantics so the lock contains the vcpu id of the vcpu holding it. This then tells xen to make that vcpu runnable if not already running: Linux: spin_lock() if (try_lock() == failed) loop X times if (try_lock() == failed) sched_op_yield_to(vcpu_num of holder) start again; endif endif Xen: sched_op_yield_to: if (vcpu_running(vcpu_num arg)) do nothing else vcpu_kick(vcpu_num arg) do_yield() endif In my worst case test scenario, I get about 20-36% improvement when the system is two to three times over provisioned. Please provide any feedback. I would like to submit official patch for SCHEDOP_yield_to in xen. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
How does this compare with Jeremy''s existing paravirtualised spinlocks in pv_ops? They required no hypervsior changes. Cc''ing Jeremy. -- Keir On 17/08/2010 02:33, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:> Hi guys, > > Check out the attached patches. I changed the spin lock semantics so the > lock contains the vcpu id of the vcpu holding it. This then tells xen > to make that vcpu runnable if not already running: > > Linux: > spin_lock() > if (try_lock() == failed) > loop X times > if (try_lock() == failed) > sched_op_yield_to(vcpu_num of holder) > start again; > endif > endif > > Xen: > sched_op_yield_to: > if (vcpu_running(vcpu_num arg)) > do nothing > else > vcpu_kick(vcpu_num arg) > do_yield() > endif > > > In my worst case test scenario, I get about 20-36% improvement when the > system is two to three times over provisioned. > > Please provide any feedback. I would like to submit official patch for > SCHEDOP_yield_to in xen. > > thanks, > Mukesh >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 17.08.10 at 03:33, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:A mere vcpu_kick()+do_yield() seems pretty simplistic to me - if the current vCPU still has higher priority than the one kicked you''ll achieve nothing. Instead, I think you really want to offer the current vCPU''s time slice to the target, making sure the target yields back as soon as it released the lock (thus transferring the borrowed time slice back to where it belongs). And then, without using ticket locks, you likely increase unfairness (as any other actively running vCPU going for the same lock will have much better chances of acquiring it than the vCPU that originally tried to and yielded), including the risk of starvation. Still, I''m glad to see we''re not the only ones wanting a directed yield capability in Xen.>+struct sched_yield_to { >+ unsigned int version; >+ unsigned int vcpu_id; >+};Why do you need a version field here, the more as it doesn''t appear to get read by the hypervisor. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 8/16/2010 at 9:33 PM, in message<20100816183357.08623c4c@mantra.us.oracle.com>, Mukesh Rathor <mukesh.rathor@oracle.com> wrote:> Hi guys, > > Check out the attached patches. I changed the spin lock semantics so the > lock contains the vcpu id of the vcpu holding it. This then tells xen > to make that vcpu runnable if not already running: > > Linux: > spin_lock() > if (try_lock() == failed) > loop X times > if (try_lock() == failed) > sched_op_yield_to(vcpu_num of holder) > start again; > endif > endif > > Xen: > sched_op_yield_to: > if (vcpu_running(vcpu_num arg)) > do nothing > else > vcpu_kick(vcpu_num arg) > do_yield() > endif > > > In my worst case test scenario, I get about 20-36% improvement when the > system is two to three times over provisioned. > > Please provide any feedback. I would like to submit official patch for > SCHEDOP_yield_to in xen.While I agree that a directed yield is a useful construct, I am not sure how this protocol would deal with ticket spin locks as you would want to implement some form of priority inheritance - if the vcpu you are yielding to is currently blocked on another (ticket) spin lock, you would want to yield to the owner of that other spin lock. Clearly, this dependency information is only available in the guest and that is where we would need to implement this logic. I think Jan''s "enlightened" spin locks implemented this kind of logic. Perhaps, another way to deal with this generic problem of inopportune guest preemption might be to coordinate guest preemption - allow the guest to notify the hypervisor that it is in a critical section. If the no-preempt guest state is set, the hypervisor can choose to defer the preemption by giving the guest vcpu in question an additional time quantum to run. In this case, the hypervisor would post the fact that a preemption is pending on guest and the guest vcpu is expected to relinquish control to the hypervisor as part of exiting the critical section. Since guest preemption is not a "correctness" issue, the hypervisor can choose to not honor the "no-preempt" state the guest may post if the hypervisor detects that the guest is buggy (or malicious). Much of what we have been discussing with "enlightened" spin locks is how to recover from the situation that results when we have an inopportune guest preemption. The coordinated preemption protocol described here attempts to avoid getting into pathological situations. If I recall correctly, I think there were some patches for doing this form of preemption management. Regards, K. Y> > thanks, > Mukesh_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-17 17:43 UTC
Re: [Xen-devel] Linux spin lock enhancement on xen
On 08/16/2010 06:33 PM, Mukesh Rathor wrote:> In my worst case test scenario, I get about 20-36% improvement when the > system is two to three times over provisioned. > > Please provide any feedback. I would like to submit official patch for > SCHEDOP_yield_to in xen.This approach only works for old-style spinlocks. Ticketlocks also have the problem of making sure the next vcpu gets scheduled on unlock. Have you looked at the pv spinlocks I have upstream in the pvops kernels, which use the (existing) poll hypercall to block the waiting vcpu until the lock is free? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 17 Aug 2010 10:43:04 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:> On 08/16/2010 06:33 PM, Mukesh Rathor wrote: > > In my worst case test scenario, I get about 20-36% improvement when > > the system is two to three times over provisioned. > > > > Please provide any feedback. I would like to submit official patch > > for SCHEDOP_yield_to in xen. > > This approach only works for old-style spinlocks. Ticketlocks also > have the problem of making sure the next vcpu gets scheduled on > unlock.Well, unfortunately, looks like old-style spinlocks are gonna be around for a very long time. I''ve heard there are customers still on EL3!> Have you looked at the pv spinlocks I have upstream in the pvops > kernels, which use the (existing) poll hypercall to block the waiting > vcpu until the lock is free? > J>How does this compare with Jeremy''s existing paravirtualised spinlocks >in pv_ops? They required no hypervsior changes. Cc''ing Jeremy. > -- KeirYeah, I looked at it today. What pv-ops is doing is forcing a yield via a fake irq/event channel poll, after storing the lock pointer in a per cpu area. The unlock''er then IPIs the vcpus waiting. The lock holder may not be running tho, and there is no hint to hypervisor to run it. So you may have many waitor''s come and leave for no reason. To me this is more of an overhead than needed in a guest. In my approach, the hypervisor is hinted exactly which vcpu is the lock holder. Often many VCPUs are pinned to a set of physical cpus due to licensing and other reasons. So this really helps a vcpu that is holding a spin lock, wanting to do some possibly real time work, get scheduled and move on. Moreover, number of vcpus is going up pretty fast. Thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 17 Aug 2010 08:53:32 +0100 "Jan Beulich" <JBeulich@novell.com> wrote:> >>> On 17.08.10 at 03:33, Mukesh Rathor <mukesh.rathor@oracle.com> > >>> wrote: > > A mere vcpu_kick()+do_yield() seems pretty simplistic to me - if the > current vCPU still has higher priority than the one kicked you''ll > achieve nothing. Instead, I think you really want to offer the > current vCPU''s time slice to the target, making sure the target > yields back as soon as it released the lock (thus transferring the > borrowed time slice back to where it belongs).True, that is phase II enhancement.> And then, without using ticket locks, you likely increase unfairness > (as any other actively running vCPU going for the same lock will > have much better chances of acquiring it than the vCPU that > originally tried to and yielded), including the risk of starvation.Please see other thread on my thoughts on this.> Still, I''m glad to see we''re not the only ones wanting a directed > yield capability in Xen. > > >+struct sched_yield_to { > >+ unsigned int version; > >+ unsigned int vcpu_id; > >+}; > > Why do you need a version field here, the more as it doesn''t > appear to get read by the hypervisor.No reason, just forgot to remove it. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 17 Aug 2010 08:34:49 -0600 "Ky Srinivasan" <ksrinivasan@novell.com> wrote: ..> While I agree that a directed yield is a useful construct, I am not > sure how this protocol would deal with ticket spin locks as you would > want to implement some form of priority inheritance - if the vcpu you > are yielding to is currently blocked on another (ticket) spin lock, > you would want to yield to the owner of that other spin lock. > Clearly, this dependency information is only available in the guest > and that is where we would need to implement this logic. I think > Jan''s "enlightened" spin locks implemented this kind of logic.Frankly, I''m opposed to ticket spin locks. IMO, starvation and fairness are schedular problems and not of spin locks. If a vcpu has higher priority, it is for a reason, and I''d like it to get prioritized. Imagine a cluster stack in a 128 vcpu environment, the thread doing heartbeat definitely needs the prirority it deserves. Having said that, my proposal can be enhanced to take into consideration ticket spin locks by having unlock make sure next vcpu in line has temporary priority boost. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-18 16:37 UTC
Re: [Xen-devel] Linux spin lock enhancement on xen
On 08/17/2010 06:58 PM, Mukesh Rathor wrote:>> How does this compare with Jeremy''s existing paravirtualised spinlocks >> in pv_ops? They required no hypervsior changes. Cc''ing Jeremy. >> -- Keir > Yeah, I looked at it today. What pv-ops is doing is forcing a yield > via a fake irq/event channel poll, after storing the lock pointer in > a per cpu area. The unlock''er then IPIs the vcpus waiting. The lock > holder may not be running tho, and there is no hint to hypervisor > to run it. So you may have many waitor''s come and leave for no > reason.(They don''t leave for no reason; they leave when they''re told they can take the lock next.) I don''t see why the guest should micromanage Xen''s scheduler decisions. If a VCPU is waiting for another VCPU and can put itself to sleep in the meantime, then its up to Xen to take advantage of that newly freed PCPU to schedule something. It may decide to run something in your domain that''s runnable, or it may decide to run something else. There''s no reason why the spinlock holder is the best VCPU to run overall, or even the best VCPU in your domain. My view is you should just put any VCPU which has nothing to do to sleep, and let Xen sort out the scheduling of the remainder.> To me this is more of an overhead than needed in a guest. In my > approach, the hypervisor is hinted exactly which vcpu is the > lock holder.The slow path should be rare. In general locks should be taken uncontended, or with brief contention. Locks should be held for a short period of time, so risk of being preempted while holding the lock should be low. The effects of the preemption a pretty disastrous, so we need to handle it, but the slow path will be rare, so the time spent handling it is not a critical factor (and can be compensated for by tuning the timeout before dropping into the slow path).> Often many VCPUs are pinned to a set of physical cpus > due to licensing and other reasons. So this really helps a vcpu > that is holding a spin lock, wanting to do some possibly real > time work, get scheduled and move on.I''m not sure I understand this point. If you''re pinning vcpus to pcpus, then presumably you''re not going to share a pcpu among many, or any vcpus, so the lock holder will be able to run any time it wants. And a directed yield will only help if the lock waiter is sharing the same pcpu as the lock holder, so it can hand over its timeslice (since making the directed yield preempt something already running in order to run your target vcpu seems rude and ripe for abuse).> Moreover, number of vcpus is > going up pretty fast.Presumably the number of pcpus are also going up, so the amount of per-pcpu overcommit is about the same. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 18/08/2010 17:37, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> I don''t see why the guest should micromanage Xen''s scheduler decisions. > If a VCPU is waiting for another VCPU and can put itself to sleep in the > meantime, then its up to Xen to take advantage of that newly freed PCPU > to schedule something. It may decide to run something in your domain > that''s runnable, or it may decide to run something else. There''s no > reason why the spinlock holder is the best VCPU to run overall, or even > the best VCPU in your domain. > > My view is you should just put any VCPU which has nothing to do to > sleep, and let Xen sort out the scheduling of the remainder.Yeah, I''m no fan of yield or yield-to type operations. I''d reserve the right to implement both of them as no-op. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 18 Aug 2010 18:09:22 +0100 Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 18/08/2010 17:37, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > > > I don''t see why the guest should micromanage Xen''s scheduler > > decisions. If a VCPU is waiting for another VCPU and can put itself > > to sleep in the meantime, then its up to Xen to take advantage of > > that newly freed PCPU to schedule something. It may decide to run > > something in your domain that''s runnable, or it may decide to run > > something else. There''s no reason why the spinlock holder is the > > best VCPU to run overall, or even the best VCPU in your domain. > > > > My view is you should just put any VCPU which has nothing to do to > > sleep, and let Xen sort out the scheduling of the remainder. > > Yeah, I''m no fan of yield or yield-to type operations. I''d reserve > the right to implement both of them as no-op. > > -- Keir >I think making them advisory makes sense. Ultimately xen decides. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 18 Aug 2010 09:37:17 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:> (They don''t leave for no reason; they leave when they''re told they can > take the lock next.) > > I don''t see why the guest should micromanage Xen''s scheduler > decisions. If a VCPU is waiting for another VCPU and can put itself > to sleep in the meantime, then its up to Xen to take advantage of > that newly freed PCPU to schedule something. It may decide to run > something in your domain that''s runnable, or it may decide to run > something else. There''s no reason why the spinlock holder is the > best VCPU to run overall, or even the best VCPU in your domain. > > My view is you should just put any VCPU which has nothing to do to > sleep, and let Xen sort out the scheduling of the remainder.Agree for the most part. But if we can spare the cost of a vcpu coming on a cpu, realizing it has nothing to do and putting itself to sleep, by a simple solution, we''ve just saved cycles. Often we are looking for tiny gains in the benchmarks against competition. Yes we don''t want to micromanage xen''s schedular. But if a guest knows something that the schedular does not, and has no way of knowing it, then it would be nice to be able to exploit that. I didn''t think a vcpu telling xen that it''s not making forward progress was intrusive. Another approach, perhaps better, is a hypercall that allows to temporarily boost a vcpu''s priority. What do you guys think about that? This would be akin to a system call allowing a process to boost priority. Or some kernels, where a thread holding a lock gets a temporary bump in the priority because a waitor tells the kernel to.> I''m not sure I understand this point. If you''re pinning vcpus to > pcpus, then presumably you''re not going to share a pcpu among many, > or any vcpus, so the lock holder will be able to run any time it > wants. And a directed yield will only help if the lock waiter is > sharing the same pcpu as the lock holder, so it can hand over its > timeslice (since making the directed yield preempt something already > running in order to run your target vcpu seems rude and ripe for > abuse).No, if a customer licences 4 cpus, and runs a guest with 12 vcpus. You now have 12 vcpus confined to the 4 physical.> Presumably the number of pcpus are also going up, so the amount of > per-pcpu overcommit is about the same.Unless the vcpus''s are going up faster than pcpus :).... Thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-23 21:33 UTC
Re: [Xen-devel] Linux spin lock enhancement on xen
On 08/18/2010 07:52 PM, Mukesh Rathor wrote:>> My view is you should just put any VCPU which has nothing to do to >> sleep, and let Xen sort out the scheduling of the remainder. > Agree for the most part. But if we can spare the cost of a vcpu coming > on a cpu, realizing it has nothing to do and putting itself to sleep, by a > simple solution, we''ve just saved cycles. Often we are looking for tiny > gains in the benchmarks against competition.Well, how does your proposal compare to mine? Is it more efficient?> Yes we don''t want to micromanage xen''s schedular. But if a guest knows > something that the schedular does not, and has no way of knowing it, > then it would be nice to be able to exploit that. I didn''t think a vcpu > telling xen that it''s not making forward progress was intrusive.Well, blocking on an event channel is a good hint. And what''s more, it allows the guest even more control because it can choose which vcpu to wake up when.> Another approach, perhaps better, is a hypercall that allows to temporarily > boost a vcpu''s priority. What do you guys think about that? This would > be akin to a system call allowing a process to boost priority. Or > some kernels, where a thread holding a lock gets a temporary bump in > the priority because a waitor tells the kernel to.That kind of thing has many pitfalls - not least, how do you make sure it doesn''t get abused? A "proper" mechanism to deal with this is expose some kind of complete vcpu blocking dependency graph to Xen to inform its scheduling decisions, but that''s probably overkill...>> I''m not sure I understand this point. If you''re pinning vcpus to >> pcpus, then presumably you''re not going to share a pcpu among many, >> or any vcpus, so the lock holder will be able to run any time it >> wants. And a directed yield will only help if the lock waiter is >> sharing the same pcpu as the lock holder, so it can hand over its >> timeslice (since making the directed yield preempt something already >> running in order to run your target vcpu seems rude and ripe for >> abuse). > No, if a customer licences 4 cpus, and runs a guest with 12 vcpus. > You now have 12 vcpus confined to the 4 physical.In one domain? Why would they do that? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wow, I totally missed this thread. A couple of thoughts; Complicated solutions for the scheduler are a really bad idea. It''s hard enough to predict and debug the side-effects of simple mechanisms; a complex mechanism is doomed to failure at the outset. I agree with Jeremy, that the guest shouldn''t tell Xen to run a specific VCPU. At most it should be something along the lines of, "If you''re going to run any vcpu from this domain, please run vcpu X." Jeremy, do you think that changes to the HV are necessary, or do you think that the existing solution is sufficient? It seems to me like hinting to the HV to do a directed yield makes more sense than making the same thing happen via blocking and event channels. OTOH, that gives the guest a lot more control over when and how things happen. Mukesh, did you see the patch by Xiantao Zhang a few days ago, regarding what to do on an HVM pause instruction? I thought the solution he had was interesting: when yielding due to a spinlock, rather than going to the back of the queue, just go behind one person. I think an impleentation of "yield_to" that might make sense in the credit scheduler is: * Put the yielding vcpu behind one cpu * If the yield-to vcpu is not running, pull it to the front within its priority. (I.e., if it''s UNDER, put it at the front so it runs next; if it''s OVER, make it the first OVER cpu.) Thoughts? -George On Wed, Aug 18, 2010 at 6:09 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> On 18/08/2010 17:37, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > >> I don''t see why the guest should micromanage Xen''s scheduler decisions. >> If a VCPU is waiting for another VCPU and can put itself to sleep in the >> meantime, then its up to Xen to take advantage of that newly freed PCPU >> to schedule something. It may decide to run something in your domain >> that''s runnable, or it may decide to run something else. There''s no >> reason why the spinlock holder is the best VCPU to run overall, or even >> the best VCPU in your domain. >> >> My view is you should just put any VCPU which has nothing to do to >> sleep, and let Xen sort out the scheduling of the remainder. > > Yeah, I''m no fan of yield or yield-to type operations. I''d reserve the right > to implement both of them as no-op. > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 24/08/2010 09:08, "George Dunlap" <dunlapg@umich.edu> wrote:> Jeremy, do you think that changes to the HV are necessary, or do you > think that the existing solution is sufficient? It seems to me like > hinting to the HV to do a directed yield makes more sense than making > the same thing happen via blocking and event channels. OTOH, that > gives the guest a lot more control over when and how things happen. > > Mukesh, did you see the patch by Xiantao Zhang a few days ago, > regarding what to do on an HVM pause instruction?I think there''s a difference between providing some kind of yield_to as a private interafce within the hypervisor as some kind of heuristic for emulating something like PAUSE, versus providing such an operation as a public guest interface. It seems to me that Jeremy''s spinlock implementation provides all the info a scheduler would require: vcpus trying to acquire a lock are blocked, the lock holder wakes just the next vcpu in turn when it releases the lock. The scheduler at that point may have a decision to make as to whether to run the lock releaser, or the new lock holder, or both, but how can the guest help with that when its a system-wide scheduling decision? Obviously the guest would presumably like all its runnable vcpus to run all of the time! - Keir> I thought the > solution he had was interesting: when yielding due to a spinlock, > rather than going to the back of the queue, just go behind one person. > I think an impleentation of "yield_to" that might make sense in the > credit scheduler is: > * Put the yielding vcpu behind one cpu > * If the yield-to vcpu is not running, pull it to the front within its > priority. (I.e., if it''s UNDER, put it at the front so it runs next; > if it''s OVER, make it the first OVER cpu.)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 24, 2010 at 9:20 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:> I think there''s a difference between providing some kind of yield_to as a > private interafce within the hypervisor as some kind of heuristic for > emulating something like PAUSE, versus providing such an operation as a > public guest interface.I agree that any "yield_to" should be strictly a hint, and not a guarantee by the HV. If that''s the case, I don''t actually see a difference between a malicous guest knowing that "yield_to" happens to behave a certain way, and a malicious guest knowing that "PAUSE" behaves a certain way.> It seems to me that Jeremy''s spinlock implementation provides all the info a > scheduler would require: vcpus trying to acquire a lock are blocked, the > lock holder wakes just the next vcpu in turn when it releases the lock. The > scheduler at that point may have a decision to make as to whether to run the > lock releaser, or the new lock holder, or both, but how can the guest help > with that when its a system-wide scheduling decision? Obviously the guest > would presumably like all its runnable vcpus to run all of the time!I think that makes sense, but leaves out one important factor: that the credit scheduler, as it is, is essentially round-robin within a priority; and round-robin schedulers are known to discriminate against vcpus that yield in favor of those that burn up their whole timeslice. I think it makes sense to give yielding guests a bit of an advantage to compensate for that. That said, this whole thing needs measurement: any yield_to implementation would need to show that: * The performance is significantly better than either Jeremy''s patches, or simple yield (with, perhaps, boost-peers, as Xiantao suggests) * It does not give a spin-locking workload a cpu advantage over other workloads, such as specjbb (cpu-bound) or scp (very latency-sensitive). -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 24.08.10 at 10:20, Keir Fraser <keir.fraser@eu.citrix.com> wrote: > On 24/08/2010 09:08, "George Dunlap" <dunlapg@umich.edu> wrote: > It seems to me that Jeremy''s spinlock implementation provides all the info a > scheduler would require: vcpus trying to acquire a lock are blocked, the > lock holder wakes just the next vcpu in turn when it releases the lock. The > scheduler at that point may have a decision to make as to whether to run the > lock releaser, or the new lock holder, or both, but how can the guest help > with that when its a system-wide scheduling decision? Obviously the guest > would presumably like all its runnable vcpus to run all of the time!Blocking on an unavailable lock is somewhat different imo: If the blocked vCPU didn''t exhaust its time slice, I think it is very valid to for it to expect to not penalize the whole VM, and rather donate (part of) its remaining time slice to the lock holder. That keeps other domains unaffected, while allowing the subject domain to make better use of its resources.>> I thought the >> solution he had was interesting: when yielding due to a spinlock, >> rather than going to the back of the queue, just go behind one person. >> I think an impleentation of "yield_to" that might make sense in the >> credit scheduler is: >> * Put the yielding vcpu behind one cpuWhich clearly has the potential of burning more cycles without allowing the vCPU to actually make progress.>> * If the yield-to vcpu is not running, pull it to the front within its >> priority. (I.e., if it''s UNDER, put it at the front so it runs next; >> if it''s OVER, make it the first OVER cpu.)At the risk of fairness wrt other domains, or even within the domain. As said above, I think it would be better to temporarily merge the priorities and location in the run queue of the yielding and yielded-to vCPU-s, to have the yielded-to one get the better of both (with a way to revert to the original settings under the control of the guest, or enforced when the borrowed time quantum expires). The one more difficult case I would see in this model is what needs to happen when the yielding vCPU has event delivery enabled and receives an event, making it runnable again: In this situation, the swapping of priority and/or run queue placement might need to be forcibly reversed immediately, not so much for fairness reasons than for keeping event servicing latency reasonable. This includes the fact that in such a case the vCPU wouldn''t be able to do what it wants with the waited for lock acquired, but would rather run the event handling code first anyway, and hence the need for boosting the lock holder went away. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 24, 2010 at 9:48 AM, Jan Beulich <JBeulich@novell.com> wrote:>>> I thought the >>> solution he had was interesting: when yielding due to a spinlock, >>> rather than going to the back of the queue, just go behind one person. >>> I think an impleentation of "yield_to" that might make sense in the >>> credit scheduler is: >>> * Put the yielding vcpu behind one cpu > > Which clearly has the potential of burning more cycles without > allowing the vCPU to actually make progress.I think you may misunderstand; the yielding vcpu goes behind at least one vcpu on the runqueue, even if the next vcpu is lower priority. If there''s another vcpu on the runqueue, the other vcpu always runs. I posted some scheduler patches implementing this yield a week or two ago, and included some numbers. The numbers were with Windows Server 2008, which has queued spinlocks (equivalent of ticketed spinlocks). The throughput remained high even when highly over-committed. So a simple yield does have a significant effect. In the unlikely even that it is scheduled again, it will simply yield again when it sees that it''s still waiting for the spinlock. In fact, undirected-yield is one of yield-to''s competitors: I don''t think we should accept a "yield-to" patch unless it has significant performance gains over undirected-yield.> At the risk of fairness wrt other domains, or even within the > domain. As said above, I think it would be better to temporarily > merge the priorities and location in the run queue of the yielding > and yielded-to vCPU-s, to have the yielded-to one get the > better of both (with a way to revert to the original settings > under the control of the guest, or enforced when the borrowed > time quantum expires).I think doing tricks with priorities is too complicated. Complicated mechanisms are very difficult to predict and prone to nasty, hard-to-debug corner cases. I don''t think it''s worth exploring this kind of solution until it''s clear that a simple solution cannot get reasonable performance. And I would oppose accepting any priority-inheritance solution into the tree unless there were repeatable measurements that showed that it had significant performance gain over a simpler solution. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 24.08.10 at 11:09, George Dunlap <dunlapg@umich.edu> wrote: > On Tue, Aug 24, 2010 at 9:48 AM, Jan Beulich <JBeulich@novell.com> wrote: >>>> I thought the >>>> solution he had was interesting: when yielding due to a spinlock, >>>> rather than going to the back of the queue, just go behind one person. >>>> I think an impleentation of "yield_to" that might make sense in the >>>> credit scheduler is: >>>> * Put the yielding vcpu behind one cpu >> >> Which clearly has the potential of burning more cycles without >> allowing the vCPU to actually make progress. > > I think you may misunderstand; the yielding vcpu goes behind at least > one vcpu on the runqueue, even if the next vcpu is lower priority. If > there''s another vcpu on the runqueue, the other vcpu always runs.No, I understood it that way. What I was referring to is (as an example) the case where two vCPU-s on the sam pCPU''s run queue both yield: They will each move after the other in the run queue in close succession, but neither will really make progress, and neither will really increase the likelihood of the respective lock holder to get a chance to run.> I posted some scheduler patches implementing this yield a week or two > ago, and included some numbers. The numbers were with Windows Server > 2008, which has queued spinlocks (equivalent of ticketed spinlocks). > The throughput remained high even when highly over-committed. So a > simple yield does have a significant effect. In the unlikely even > that it is scheduled again, it will simply yield again when it sees > that it''s still waiting for the spinlock.Immediately, or after a few (hundred) spin cycles?> In fact, undirected-yield is one of yield-to''s competitors: I don''t > think we should accept a "yield-to" patch unless it has significant > performance gains over undirected-yield.This position I agree with.>> At the risk of fairness wrt other domains, or even within the >> domain. As said above, I think it would be better to temporarily >> merge the priorities and location in the run queue of the yielding >> and yielded-to vCPU-s, to have the yielded-to one get the >> better of both (with a way to revert to the original settings >> under the control of the guest, or enforced when the borrowed >> time quantum expires). > > I think doing tricks with priorities is too complicated. Complicated > mechanisms are very difficult to predict and prone to nasty, > hard-to-debug corner cases. I don''t think it''s worth exploring this > kind of solution until it''s clear that a simple solution cannot get > reasonable performance. And I would oppose accepting any > priority-inheritance solution into the tree unless there were > repeatable measurements that showed that it had significant > performance gain over a simpler solution.And so I do with this. Apart from suspecting fairness issues with your yield_to proposal (as I wrote), my point just is - we won''t know if a "complicated" solution outperforms a "simple" one if we don''t try it. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Aug 24, 2010 at 2:25 PM, Jan Beulich <JBeulich@novell.com> wrote:> No, I understood it that way. What I was referring to is (as an > example) the case where two vCPU-s on the sam pCPU''s run queue > both yield: They will each move after the other in the run queue in > close succession, but neither will really make progress, and neither > will really increase the likelihood of the respective lock holder to > get a chance to run.Ah, I see. In order for this to be a waste, it needs to be the case that: * Two vcpus from different domains grab a spinlock and are then preempted * Two vcpus from different domains then fail to grab the spinlock * The two vcpus holding the locks are kept from getting cpu by {another vcpu, other vcpus} which uses a long time-slice * The two waiting for the lock share a cpu with each other and no one else Of course in this situation, it would be nice if Xen could migrate one of the other vcpus to the cpu of the two yielding vcpus. That shouldn''t be too hard to implement, at least to see if it has a measurable impact on aggregate throughput.> Immediately, or after a few (hundred) spin cycles?It depends on the implementation. The Citrix guest tools do binary patching of spinlock routines for w2k3 and XP; I believe they spin for 1000 cycles or so. The viridian enlightenments I believe would yield immediately. I think the pause instruction causes a yield immediately as well. Yielding immediately when the host is not overloaded is actually probably not optimal: if the vcpu holding the lock is currently running, it''s likely that by the time the vcpu makes it to the scheduler, the lock it''s waiting for has already been released. (Which is part of the reason it''s a spinlock and not a semaphore.)> And so I do with this. Apart from suspecting fairness issues with > your yield_to proposal (as I wrote), my point just is - we won''t > know if a "complicated" solution outperforms a "simple" one if we > don''t try it.Are you volunteering? :-) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> Wow, I totally missed this thread. > > A couple of thoughts; > > Complicated solutions for the scheduler are a really bad idea. It''s > hard enough to predict and debug the side-effects of simple > mechanisms; a complex mechanism is doomed to failure at the outset. > > I agree with Jeremy, that the guest shouldn''t tell Xen to run a > specific VCPU. At most it should be something along the lines of, "If > you''re going to run any vcpu from this domain, please run vcpu X." > > Jeremy, do you think that changes to the HV are necessary, or do you > think that the existing solution is sufficient? It seems to me like > hinting to the HV to do a directed yield makes more sense than making > the same thing happen via blocking and event channels. OTOH, that > gives the guest a lot more control over when and how things happen. > > Mukesh, did you see the patch by Xiantao Zhang a few days ago, > regarding what to do on an HVM pause instruction? I thought the > solution he had was interesting: when yielding due to a spinlock, > rather than going to the back of the queue, just go behind one person. > I think an impleentation of "yield_to" that might make sense in the > credit scheduler is: > * Put the yielding vcpu behind one cpu > * If the yield-to vcpu is not running, pull it to the front within its > priority. (I.e., if it''s UNDER, put it at the front so it runs next; > if it''s OVER, make it the first OVER cpu.) > > Thoughts? >What Xiantao (and I internally) proposed is to implement temporary coscheduling to solve spin-lock issues no matter FIFO spin-lock or ordinary spin-lock, utilizing PLE exit (of course can work with PV spin-lock as well). Here is our thinking (please refer to Xiantao''s mail as well): There are 2 typical solution to improve spin lock efficiency in virtualization: A) lock holder preemption avoidance (or co-scheduling), and B) helping locks which donates the spinning CPU cycles for overal system utilization. #A solves spin-lock issue best, however it requires hardware assistance to detect lock holder which is impratical, or coscheduling which is hard to be implement efficiently and sacrifficing lots of scheduler flexibility. Neither Xen or KVM implemented that. #B (current Xen policy with PLE_yeilding) may help system performance, however it may not help the performance of spinning guest. In some cases the guest may become even worse due to long waiting (yield) of spin-lock. In some cases it may get back additional CPU cycles (and performance) from VMM scheduler complementing to its previous CPU cycle donation. In general, #B may help system performance if it is right overcommitted, but it also hurt single guest "speed" depending. An additional issue in #B is that it may hurt FIFO spin lock (ticket spin-lock in Linux and queued spin-lock in Windwos from Windows 2000), where only the first-in waiting VCPU is able to get lock from OS design perspective. Current PLE won''t be able to know which one is the next (First In) waiting VCPU and which one is lock holder. [Proposed optimization] Lock holder preemption avoidance is the right solution to fully utilize hardware PLE capability, the current solution is simply hurting the performance, and we need to improve it with solution #A. Given that current hardware is unable to tell which VCPU is lock holder or which one is the next (First In) waiting VCPU? Coscheduling may be the choice. However, Coscheduling has that many side effect as well (somebody said other company using co-scheduling is going to give up as well). This proposal is to do temporary coscheduling on top of existing VMM scheduling. The details are: When one or more of VCPU of a guest is waiting for a spin-lock, we can temporary increase the priority of all VCPUs of the same geust to be scheduled in for a short period. The period will be pretty small here to avoid the impact of "coscheduling" to overall VMM scheduler. The current Xen patch simply "boost" the VCPUs which already show great gain, but there may be more tuning in optimized parameter for this algorithm. I believe this will be a perfect solution to spin-lock issue with PLE in for now (when VCPU # is not dramatically large. vConsolidate (mix of LInux and Windows guest) shows 19% consolidation performance gain, that is so great to believe even, but it is true :) We are investing more for different workload, and will post new patch soon. Of course if PV guest is running in PVM container, the PVed spin-lock is still needed. But I am doubting its necessity if PVM is running on top of HVM container :) Thx, Eddie _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I think an impleentation of "yield_to" that might make sense in the > credit scheduler is: > * Put the yielding vcpu behind one cpu > * If the yield-to vcpu is not running, pull it to the front within its > priority. (I.e., if it''s UNDER, put it at the front so it runs next; > if it''s OVER, make it the first OVER cpu.) >Yup, I second it. thanks, Mukesh _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
At 17:11 +0100 on 24 Aug (1282669892), George Dunlap wrote:> It depends on the implementation. The Citrix guest tools do binary > patching of spinlock routines for w2k3 and XP; I believe they spin for > 1000 cycles or so. The viridian enlightenments I believe would yield > immediately.IIRC the Viridian interface includes a parameter that the hypervisor passes to the guest to tell it how long to spin for before yielding.> I think the pause instruction causes a yield immediately > as well.This is where the PLE hardware assist comes in - it effectively does the same as the Viridian interfaces by counting PAUSEs. FWIW (and I am defintely not a scheduler expert) I''m against anything that gives a priority boost to a domain''s VCPUs based on perceived locking behaviour, and in favour of keeping things dead simple. Targeted scheduler "improvements" have bitten us more than once. When George''s scheduler regression tests can give us a more rounded picture or the overall effect of scheduler tweaks (esp. on fairness) maybe that will change. Cheers, Tim.> Yielding immediately when the host is not overloaded is actually > probably not optimal: if the vcpu holding the lock is currently > running, it''s likely that by the time the vcpu makes it to the > scheduler, the lock it''s waiting for has already been released. > (Which is part of the reason it''s a spinlock and not a semaphore.) > > > And so I do with this. Apart from suspecting fairness issues with > > your yield_to proposal (as I wrote), my point just is - we won''t > > know if a "complicated" solution outperforms a "simple" one if we > > don''t try it. > > Are you volunteering? :-) > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel-- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, XenServer Engineering Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel