Jeremy, considering to utilize your pv-ops spinlock implementation for our kernels, I''d appreciate your opinion on the following thoughts: 1) While the goal of the per-CPU kicker irq appears to be to avoid all CPUs waiting for a particular lock to get kicked simultaneously, I think this doesn''t have the desired effect. This is because Xen doesn''t track what event channel you poll for (through SCHEDOP_poll), and rather kicks all CPUs polling for any event channel. 2) While on native not re-enabling interrupts in __raw_spin_lock_flags() may be tolerable (but perhaps questionable), not doing so at least on the slow path here seems suspicious. 3) Introducing yet another per-CPU IRQ for this purpose further tightens scalability. Using a single, IRQF_PER_CPU IRQ should be sufficient here, as long as it gets properly multiplexed onto individual event channels (of which we have far more than IRQs). I have a patch queued for the traditional tree that does just that conversion for the reschedule and call-function IPIs, which I had long planned to get submitted (but so far wasn''t able to due to lack of testing done on the migration aspects of it), and once successful was planning on trying to do something similar for the timer IRQ. I am attaching that (2.6.26 based) patch just for reference. Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 4/8/08 11:18, "Jan Beulich" <jbeulich@novell.com> wrote:> 1) While the goal of the per-CPU kicker irq appears to be to avoid all CPUs > waiting for a particular lock to get kicked simultaneously, I think this > doesn''t have the desired effect. This is because Xen doesn''t track what > event channel you poll for (through SCHEDOP_poll), and rather kicks all CPUs > polling for any event channel.Yes, this is true. We could easily do something better for VCPUs polling a single event channel though, but there hasn''t been a need up to now. I suppose it depends how often we have multiple VCPUs stuck waiting for spinlocks. I can sort out a Xen-side patch if someone wanted to measure the benefits from more selective wakeup from poll. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:> Jeremy, > > considering to utilize your pv-ops spinlock implementation for our kernels, > I''d appreciate your opinion on the following thoughts: > > 1) While the goal of the per-CPU kicker irq appears to be to avoid all CPUs > waiting for a particular lock to get kicked simultaneously, I think this > doesn''t have the desired effect. This is because Xen doesn''t track what > event channel you poll for (through SCHEDOP_poll), and rather kicks all CPUs > polling for any event channel. >There''s no problem with kicking all cpus waiting for a given lock, but it was intended to avoid kicking cpus waiting for some other lock. I hadn''t looked at the poll implementation that closely. I guess using the per-cpu interrupt gives Xen some room to live up to the expectations we have for it ;)> 2) While on native not re-enabling interrupts in __raw_spin_lock_flags() > may be tolerable (but perhaps questionable), not doing so at least on > the slow path here seems suspicious. >I wasn''t sure about that. Is it OK to enable interrupts in the middle of a spinlock? Can it be done unconditionally?> 3) Introducing yet another per-CPU IRQ for this purpose further > tightens scalability. Using a single, IRQF_PER_CPU IRQ should be > sufficient here, as long as it gets properly multiplexed onto individual > event channels (of which we have far more than IRQs). I have a patch > queued for the traditional tree that does just that conversion for the > reschedule and call-function IPIs, which I had long planned to get > submitted (but so far wasn''t able to due to lack of testing done on the > migration aspects of it), and once successful was planning on trying to > do something similar for the timer IRQ.There''s two lines of work I''m hoping to push to mitigate this: One is the unification of 32 and 64-bit interrupt handling, so that they both have an underlying notion of a vector, which is what we map event channels to. Since vectors can be mapped to a (irq,cpu) tuple, it would allow multiple per-cpu vectors/event channels to be mapped to a single irq, and do so generically for all event channel types. That would mean we''d end up allocating one set of interrupts for time, function calls, spinlocks, etc, rather than percpu. The other is eliminating NR_IRQ, and making irq allocation completely dynamic.> I am attaching that (2.6.26 based) patch just for reference.From a quick look, you''re thinking along similar lines. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>> 2) While on native not re-enabling interrupts in __raw_spin_lock_flags() >> may be tolerable (but perhaps questionable), not doing so at least on >> the slow path here seems suspicious. >> > >I wasn''t sure about that. Is it OK to enable interrupts in the middle >of a spinlock? Can it be done unconditionally?That used to be done in the pre-ticket lock implementation, but of course conditional upon the original interrupt flag. Later yesterday I noticed another issue: The code setting lock_spinners isn''t interruption safe - you''ll need to return the old value from spinning_lock() and restore it in unspinning_lock(). Also I''m considering doing it ticket-based nevertheless, as "mix(ing) up next cpu selection" won''t really help fairness in xen_spin_unlock_slow(). Apart from definitely needing the wakeup to happen for just the target CPU (Keir, I''d want the necessary support in Xen done for that to work regardless of performance measurements with the traditional locking, as it''s known that with ticket locks performance suffers from wrong- order CPU kicking), one thing we''d be in even more need for here than old-style spin locks had been would be a directed yield (sub-)hypercall. Has that ever been considered to become a schedop? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 5/8/08 07:32, "Jan Beulich" <jbeulich@novell.com> wrote:> one thing we''d be in even more need for here than > old-style spin locks had been would be a directed yield (sub-)hypercall. > Has that ever been considered to become a schedop?It has. Someone will need to implement it and show a convincing performance win with it with a real benchmark. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>>> 2) While on native not re-enabling interrupts in __raw_spin_lock_flags() >>> may be tolerable (but perhaps questionable), not doing so at least on >>> the slow path here seems suspicious. >>> >>> >> I wasn''t sure about that. Is it OK to enable interrupts in the middle >> of a spinlock? Can it be done unconditionally? >> > > That used to be done in the pre-ticket lock implementation, but of course > conditional upon the original interrupt flag. >Right, I see. The spin_lock_flags stuff which the current lock implementation just ignores. I''ll add a new lock op for it.> Later yesterday I noticed another issue: The code setting lock_spinners > isn''t interruption safe - you''ll need to return the old value from > spinning_lock() and restore it in unspinning_lock(). >Good catch.> Also I''m considering doing it ticket-based nevertheless, as "mix(ing) up > next cpu selection" won''t really help fairness in xen_spin_unlock_slow(). >Why''s that? An alternative might be to just wake all cpus waiting for the lock up, and then let them fight it out. It should be unusual that there''s a significant number of waiters anyway, since contention is (should be) rare. The main reason for ticket locks is to break the egregious unfairness that (some) bus protocols implement. That level of fairness shouldn''t be necessary here because once the cpus fall to blocking in the hypervisor, it''s up to Xen to tie-break.> Apart from definitely needing the wakeup to happen for just the target > CPU (Keir, I''d want the necessary support in Xen done for that to work > regardless of performance measurements with the traditional locking, > as it''s known that with ticket locks performance suffers from wrong- > order CPU kicking), one thing we''d be in even more need for here than > old-style spin locks had been would be a directed yield (sub-)hypercall. > Has that ever been considered to become a schedop? >Considered. But the point of this exercise was to come up with something that would work with an unmodified hypervisor. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Jeremy Fitzhardinge <jeremy@goop.org> 05.08.08 20:09 >>> >Jan Beulich wrote: >> Later yesterday I noticed another issue: The code setting lock_spinners >> isn''t interruption safe - you''ll need to return the old value from >> spinning_lock() and restore it in unspinning_lock(). >> > >Good catch.More on that: You''ll really need two per-CPU variables afaics, one for the current non-irq lock being spun upon, and one for the current irqs-disabled one. The latter one might not need saving/restoring as long as you don''t re-enable interrupts, but the code might turn out cleaner when doing the save/restore regardless, e.g. for me (doing ticket locking): int xen_spin_wait(raw_spinlock_t *lock, unsigned int token) { struct spinning *spinning; int rc; /* if kicker interrupt not initialized yet, just spin */ if (spinlock_irq < 0) return 0; /* announce we''re spinning */ spinning = &__get_cpu_var(spinning); if (spinning->lock) { BUG_ON(!raw_irqs_disabled()); spinning = &__get_cpu_var(spinning_irq); BUG_ON(spinning->lock); } spinning->ticket = token >> TICKET_SHIFT; smp_wmb(); spinning->lock = lock; /* clear pending */ xen_clear_irq_pending(spinlock_irq); /* check again make sure it didn''t become free while we weren''t looking */ rc = __raw_spin_trylock(lock); if (!rc) { /* block until irq becomes pending */ xen_poll_irq(spinlock_irq); kstat_this_cpu.irqs[spinlock_irq]++; } /* announce we''re done */ spinning->lock = NULL; return rc; }>> Also I''m considering doing it ticket-based nevertheless, as "mix(ing) up >> next cpu selection" won''t really help fairness in xen_spin_unlock_slow(). >> > >Why''s that? An alternative might be to just wake all cpus waiting for >the lock up, and then let them fight it out. It should be unusual that >there''s a significant number of waiters anyway, since contention is >(should be) rare.On an 8-core system I''m seeing between 20,000 (x86-64) and 35,000 (i686) wakeup interrupts per CPU. I''m not certain this still counts as rare. Though that number may go down a little once the hypervisor doesn''t needlessly wake all polling vCPU-s anymore.>The main reason for ticket locks is to break the egregious unfairness >that (some) bus protocols implement. That level of fairness shouldn''t >be necessary here because once the cpus fall to blocking in the >hypervisor, it''s up to Xen to tie-break.Why? The hypervisor doing the tie-break makes it possibly even more unfair, whereas with tickets and a way to kick the next owner (and only it) almost the same level of fairness as on native can be achieved. The only heuristic to determine by measurement is the number of spin loops before going into poll mode (which your original patch''s description and implementation for some reason disagree about - the description says 2^16 loops, the implementation uses 2^10). Obviously, the optimal numbers may turn out different for byte and ticket locks. And doing a wake-them-all approach isn''t good here, as was (supposedly, I wasn''t there) explained in a talk on the last summit. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:> More on that: You''ll really need two per-CPU variables afaics, one for the > current non-irq lock being spun upon, and one for the current irqs-disabled > one. The latter one might not need saving/restoring as long as you don''t > re-enable interrupts, but the code might turn out cleaner when doing the > save/restore regardless, e.g. for me (doing ticket locking): >Not sure I follow. How do you use the second array at the kick end of the process? I ended up just storing the previous value locally, and then restoring it. It assumes that locks will strictly nest, of course, but I think that''s reasonable. --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -135,25 +135,39 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners); -static inline void spinning_lock(struct xen_spinlock *xl) +/* + * Mark a cpu as interested in a lock. Returns the CPU''s previous + * lock of interest, in case we got preempted by an interrupt. + */ +static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl) { - __get_cpu_var(lock_spinners) = xl; - wmb(); /* set lock of interest before count */ + struct xen_spinlock *prev; + + prev = xchg(&__get_cpu_var(lock_spinners), xl); + /* xchg is a barrier */ + asm(LOCK_PREFIX " incw %0" : "+m" (xl->spinners) : : "memory"); + + return prev; } -static inline void unspinning_lock(struct xen_spinlock *xl) +/* + * Mark a cpu as no longer interested in a lock. Restores previous + * lock of interest (NULL for none). + */ +static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock *prev) { asm(LOCK_PREFIX " decw %0" : "+m" (xl->spinners) : : "memory"); - wmb(); /* decrement count before clearing lock */ - __get_cpu_var(lock_spinners) = NULL; + wmb(); /* decrement count before restoring lock */ + __get_cpu_var(lock_spinners) = prev; } static noinline int xen_spin_lock_slow(struct raw_spinlock *lock, bool irq_enable) { struct xen_spinlock *xl = (struct xen_spinlock *)lock; + struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); int ret; unsigned long flags; @@ -163,7 +177,8 @@ return 0; /* announce we''re spinning */ - spinning_lock(xl); + prev = spinning_lock(xl); + flags = __raw_local_save_flags(); if (irq_enable) { ADD_STATS(taken_slow_irqenable, 1); @@ -189,7 +204,7 @@ out: raw_local_irq_restore(flags); - unspinning_lock(xl); + unspinning_lock(xl, prev); return ret; }> int xen_spin_wait(raw_spinlock_t *lock, unsigned int token) > { > struct spinning *spinning; > int rc; > > /* if kicker interrupt not initialized yet, just spin */ > if (spinlock_irq < 0) > return 0; > > /* announce we''re spinning */ > spinning = &__get_cpu_var(spinning); > if (spinning->lock) { > BUG_ON(!raw_irqs_disabled()); > spinning = &__get_cpu_var(spinning_irq); > BUG_ON(spinning->lock); > } > spinning->ticket = token >> TICKET_SHIFT; > smp_wmb(); > spinning->lock = lock; > > /* clear pending */ > xen_clear_irq_pending(spinlock_irq); > > /* check again make sure it didn''t become free while > we weren''t looking */ > rc = __raw_spin_trylock(lock); > if (!rc) { > /* block until irq becomes pending */ > xen_poll_irq(spinlock_irq); > kstat_this_cpu.irqs[spinlock_irq]++; > } > > /* announce we''re done */ > spinning->lock = NULL; > > return rc; > } > > > On an 8-core system I''m seeing between 20,000 (x86-64) and 35,000 > (i686) wakeup interrupts per CPU. I''m not certain this still counts as rare. > Though that number may go down a little once the hypervisor doesn''t > needlessly wake all polling vCPU-s anymore. >What workload are you seeing that on? 20-35k interrupts over what time period? In my tests, I only see it fall into the slow path a couple of thousand times per cpu for a kernbench run.>> The main reason for ticket locks is to break the egregious unfairness >> that (some) bus protocols implement. That level of fairness shouldn''t >> be necessary here because once the cpus fall to blocking in the >> hypervisor, it''s up to Xen to tie-break. >> > > Why? The hypervisor doing the tie-break makes it possibly even more > unfair, whereas with tickets and a way to kick the next owner (and only > it) almost the same level of fairness as on native can be achieved.I don''t think strict fairness is a particularly desirable property; it''s certainly not an unambiguous win. The important thing is solves is total starvation, and if the Xen scheduler ends up starving a CPU then that''s a scheduler bug we can fix. We can''t fix the cache coherence protocols, so we need to use something like a ticket lock.> The > only heuristic to determine by measurement is the number of spin loops > before going into poll mode (which your original patch''s description and > implementation for some reason disagree about - the description says > 2^16 loops, the implementation uses 2^10). Obviously, the optimal > numbers may turn out different for byte and ticket locks. >Yes, there''s a bit of confusion about loop iterations vs cycles. The original paper said that after 2^16 *cycles* 90% of locks have been taken, which I map (approximately) to 2^10 loop iterations (but originally I had 2^16 iterations). I don''t think it''s all that critical; it probably depends too much on workload and precise system and VM configuration to really finely tune, and doesn''t end up making all that much difference. That said, I''ve implemented a pile of debugfs infrastructure for extracting lots of details about lock performance so there''s some scope for tuning it (including being able to change the timeout on the fly to see how things change).> And doing a wake-them-all approach isn''t good here, as was > (supposedly, I wasn''t there) explained in a talk on the last summit. >Well, yes, wake-all is a total disaster for ticket locks, since it just causes scheduler thrashing. But for byte locks it doesn''t matter all that much since whoever runs next will take the lock and make progress. The others will wake up and spin for a bit before sleeping; not great, but still better than plain spinning. It might be worth using a smaller timeout for re-lock attempts after waking up, but that could also lead to starvation as a sleeper will be at a disadvantage against someone who''s currently spinning on the lock, and so will tend to lose against new lock takers. I think we can also mitigate poll''s wake-all behaviour by seeing if our particular per-cpu interrupt is pending and drop back into poll immediately if not (ie, detect a spurious wakeup). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Jeremy Fitzhardinge <jeremy@goop.org> 06.08.08 10:47 >>> >Jan Beulich wrote: >> More on that: You''ll really need two per-CPU variables afaics, one for the >> current non-irq lock being spun upon, and one for the current irqs-disabled >> one. The latter one might not need saving/restoring as long as you don''t >> re-enable interrupts, but the code might turn out cleaner when doing the >> save/restore regardless, e.g. for me (doing ticket locking): >> > >Not sure I follow. How do you use the second array at the kick end of >the process?I just look at both stored values.>I ended up just storing the previous value locally, and then restoring >it. It assumes that locks will strictly nest, of course, but I think >that''s reasonable.Storing the previous value locally is fine. But I don''t think you can do with just one ''currently spinning'' pointer because of the kicking side requirements - if an irq-save lock interrupted an non-irq one (with the spinning pointer already set) and a remote CPU releases the lock and wants to kick you, it won''t be able to if the irq-save lock already replaced the non-irq one. Nevertheless, if you''re past the try-lock, you may end up never getting the wakeup. Since there can only be one non-irq and one irq-save lock a CPU is currently spinning on (the latter as long as you don''t re-enable interrupts), two fields, otoh, are sufficient. Btw., I also think that using an xchg() (and hence a locked transaction) for updating the pointer isn''t necessary.>> On an 8-core system I''m seeing between 20,000 (x86-64) and 35,000 >> (i686) wakeup interrupts per CPU. I''m not certain this still counts as rare. >> Though that number may go down a little once the hypervisor doesn''t >> needlessly wake all polling vCPU-s anymore. >> > >What workload are you seeing that on? 20-35k interrupts over what time >period?Oh, sorry, I meant to say that''s for a kernel build (-j12), taking about 400 wall seconds.>In my tests, I only see it fall into the slow path a couple of thousand >times per cpu for a kernbench run.Hmm, that''s different then for me. Actually, I see a significant spike at modpost stage 2, when all the .ko-s get linked. The (spinlock) interrupt rate gets up to between 1,000 and 2,000 per CPU and second.>That said, I''ve implemented a pile of debugfs infrastructure for >extracting lots of details about lock performance so there''s some scope >for tuning it (including being able to change the timeout on the fly to >see how things change).Yeah, that''s gonna be useful to have.>I think we can also mitigate poll''s wake-all behaviour by seeing if our >particular per-cpu interrupt is pending and drop back into poll >immediately if not (ie, detect a spurious wakeup).Oh, of course - I didn''t consider this so far. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:> Storing the previous value locally is fine. But I don''t think you can do with > just one ''currently spinning'' pointer because of the kicking side > requirements - if an irq-save lock interrupted an non-irq one (with the > spinning pointer already set) and a remote CPU releases the lock and > wants to kick you, it won''t be able to if the irq-save lock already replaced > the non-irq one. Nevertheless, if you''re past the try-lock, you may end > up never getting the wakeup. > > Since there can only be one non-irq and one irq-save lock a CPU is > currently spinning on (the latter as long as you don''t re-enable interrupts), > two fields, otoh, are sufficient. >No, I think it''s mostly OK. The sequence is: 1. mark that we''re about to block 2. clear pending state on irq 3. test again with trylock 4. block in poll The worrisome interval is between 3-4, but it''s only a problem if we end up entering the poll with the lock free and the interrupt non-pending. There are a few possibilities for what happens in that interval: 1. the nested spinlock is fastpath only, and takes some other lock -> OK, because we''re still marked as interested in this lock, and will be kicked -> if the nested spinlock takes the same lock as the outer one, it will end up doing a self-kick 2. the nested spinlock is slowpath and is kicked -> OK, because it will leave the irq pending 3. the nested spinlock is slowpath, but picks up the lock on retry -> bad, because it won''t leave the irq pending. The fix is to make sure that in case 4, it checks to see if it''s interrupting a blocking lock (by checking if prev != NULL), and leave the irq pending if it is. Updated patch below. Compile tested only.> Btw., I also think that using an xchg() (and hence a locked transaction) > for updating the pointer isn''t necessary. >I was concerned about what would happen if an interrupt got between the fetch and store. But thinking about it, I think you''re right.>>> On an 8-core system I''m seeing between 20,000 (x86-64) and 35,000 >>> (i686) wakeup interrupts per CPU. I''m not certain this still counts as rare. >>> Though that number may go down a little once the hypervisor doesn''t >>> needlessly wake all polling vCPU-s anymore. >>> >>> >> What workload are you seeing that on? 20-35k interrupts over what time >> period? >> > > Oh, sorry, I meant to say that''s for a kernel build (-j12), taking about > 400 wall seconds. > > >> In my tests, I only see it fall into the slow path a couple of thousand >> times per cpu for a kernbench run. >> > > Hmm, that''s different then for me. Actually, I see a significant spike at > modpost stage 2, when all the .ko-s get linked. The (spinlock) interrupt > rate gets up to between 1,000 and 2,000 per CPU and second. >defconfig? allmodconfig? I''ll measure again to confirm. J diff -r 5b4b80c08799 arch/x86/xen/spinlock.c --- a/arch/x86/xen/spinlock.c Wed Aug 06 01:35:09 2008 -0700 +++ b/arch/x86/xen/spinlock.c Wed Aug 06 10:51:27 2008 -0700 @@ -22,6 +22,7 @@ u64 taken_slow; u64 taken_slow_pickup; u64 taken_slow_irqenable; + u64 taken_slow_spurious; u64 released; u64 released_slow; @@ -135,25 +136,41 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners); -static inline void spinning_lock(struct xen_spinlock *xl) +/* + * Mark a cpu as interested in a lock. Returns the CPU''s previous + * lock of interest, in case we got preempted by an interrupt. + */ +static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl) { + struct xen_spinlock *prev; + + prev = __get_cpu_var(lock_spinners); __get_cpu_var(lock_spinners) = xl; + wmb(); /* set lock of interest before count */ + asm(LOCK_PREFIX " incw %0" : "+m" (xl->spinners) : : "memory"); + + return prev; } -static inline void unspinning_lock(struct xen_spinlock *xl) +/* + * Mark a cpu as no longer interested in a lock. Restores previous + * lock of interest (NULL for none). + */ +static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock *prev) { asm(LOCK_PREFIX " decw %0" : "+m" (xl->spinners) : : "memory"); - wmb(); /* decrement count before clearing lock */ - __get_cpu_var(lock_spinners) = NULL; + wmb(); /* decrement count before restoring lock */ + __get_cpu_var(lock_spinners) = prev; } static noinline int xen_spin_lock_slow(struct raw_spinlock *lock, bool irq_enable) { struct xen_spinlock *xl = (struct xen_spinlock *)lock; + struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); int ret; unsigned long flags; @@ -163,33 +180,56 @@ return 0; /* announce we''re spinning */ - spinning_lock(xl); + prev = spinning_lock(xl); + flags = __raw_local_save_flags(); if (irq_enable) { ADD_STATS(taken_slow_irqenable, 1); raw_local_irq_enable(); } - /* clear pending */ - xen_clear_irq_pending(irq); + ADD_STATS(taken_slow, 1); + + do { + /* clear pending */ + xen_clear_irq_pending(irq); - ADD_STATS(taken_slow, 1); + /* check again make sure it didn''t become free while + we weren''t looking */ + ret = xen_spin_trylock(lock); + if (ret) { + ADD_STATS(taken_slow_pickup, 1); + + /* + * If we interrupted another spinlock while it + * was blocking, make sure it doesn''t block + * without rechecking the lock. + */ + if (prev != NULL) + xen_set_irq_pending(irq); + goto out; + } - /* check again make sure it didn''t become free while - we weren''t looking */ - ret = xen_spin_trylock(lock); - if (ret) { - ADD_STATS(taken_slow_pickup, 1); - goto out; - } + /* + * Block until irq becomes pending. If we''re + * interrupted at this point (after the trylock but + * before entering the block), then the nested lock + * handler guarantees that the irq will be left + * pending if there''s any chance the lock became free; + * xen_poll_irq() returns immediately if the irq is + * pending. + */ + xen_poll_irq(irq); + kstat_this_cpu.irqs[irq]++; + ADD_STATS(taken_slow_spurious, !xen_test_irq_pending(irq)); + } while (!xen_test_irq_pending(irq)); /* check for spurious wakeups */ - /* block until irq becomes pending */ - xen_poll_irq(irq); - kstat_this_cpu.irqs[irq]++; + /* Leave the irq pending so that any interrupted blocker will + recheck. */ out: raw_local_irq_restore(flags); - unspinning_lock(xl); + unspinning_lock(xl, prev); return ret; } @@ -333,6 +373,8 @@ &spinlock_stats.taken_slow_pickup); debugfs_create_u64("taken_slow_irqenable", 0444, d_spin_debug, &spinlock_stats.taken_slow_irqenable); + debugfs_create_u64("taken_slow_spurious", 0444, d_spin_debug, + &spinlock_stats.taken_slow_spurious); debugfs_create_u64("released", 0444, d_spin_debug, &spinlock_stats.released); debugfs_create_u64("released_slow", 0444, d_spin_debug, diff -r 5b4b80c08799 drivers/xen/events.c --- a/drivers/xen/events.c Wed Aug 06 01:35:09 2008 -0700 +++ b/drivers/xen/events.c Wed Aug 06 10:51:27 2008 -0700 @@ -162,6 +162,12 @@ { struct shared_info *s = HYPERVISOR_shared_info; sync_set_bit(port, &s->evtchn_pending[0]); +} + +static inline int test_evtchn(int port) +{ + struct shared_info *s = HYPERVISOR_shared_info; + return sync_test_bit(port, &s->evtchn_pending[0]); } @@ -748,6 +754,25 @@ clear_evtchn(evtchn); } +void xen_set_irq_pending(int irq) +{ + int evtchn = evtchn_from_irq(irq); + + if (VALID_EVTCHN(evtchn)) + set_evtchn(evtchn); +} + +bool xen_test_irq_pending(int irq) +{ + int evtchn = evtchn_from_irq(irq); + bool ret = false; + + if (VALID_EVTCHN(evtchn)) + ret = test_evtchn(evtchn); + + return ret; +} + /* Poll waiting for an irq to become pending. In the usual case, the irq will be disabled so it won''t deliver an interrupt. */ void xen_poll_irq(int irq) diff -r 5b4b80c08799 include/xen/events.h --- a/include/xen/events.h Wed Aug 06 01:35:09 2008 -0700 +++ b/include/xen/events.h Wed Aug 06 10:51:27 2008 -0700 @@ -47,6 +47,8 @@ /* Clear an irq''s pending state, in preparation for polling on it */ void xen_clear_irq_pending(int irq); +void xen_set_irq_pending(int irq); +bool xen_test_irq_pending(int irq); /* Poll waiting for an irq to become pending. In the usual case, the irq will be disabled so it won''t deliver an interrupt. */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>> In my tests, I only see it fall into the slow path a couple of thousand >> times per cpu for a kernbench run. >> > > Hmm, that''s different then for me. Actually, I see a significant spike at > modpost stage 2, when all the .ko-s get linked. The (spinlock) interrupt > rate gets up to between 1,000 and 2,000 per CPU and second. >BTW, how much contention for CPUs was there? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Jeremy Fitzhardinge <jeremy@goop.org> 06.08.08 22:21 >>> >Jan Beulich wrote: >>> In my tests, I only see it fall into the slow path a couple of thousand >>> times per cpu for a kernbench run. >>> >> >> Hmm, that''s different then for me. Actually, I see a significant spike at >> modpost stage 2, when all the .ko-s get linked. The (spinlock) interrupt >> rate gets up to between 1,000 and 2,000 per CPU and second. >> > >BTW, how much contention for CPUs was there?In Xen or the kernel? So far these are all Dom0-only measurements, so in Xen there should be no contention. To answer your other mail''s question: It''s defconfig builds I''m using. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy@goop.org> 06.08.08 22:21 >>> >>>> >> Jan Beulich wrote: >> >>>> In my tests, I only see it fall into the slow path a couple of thousand >>>> times per cpu for a kernbench run. >>>> >>>> >>> Hmm, that''s different then for me. Actually, I see a significant spike at >>> modpost stage 2, when all the .ko-s get linked. The (spinlock) interrupt >>> rate gets up to between 1,000 and 2,000 per CPU and second. >>> >>> >> BTW, how much contention for CPUs was there? >> > > In Xen or the kernel? So far these are all Dom0-only measurements, so > in Xen there should be no contention. >OK. The spinlock results are most interesting when there''s contention for physical cpus between domains.> To answer your other mail''s question: It''s defconfig builds I''m using. >[SuSE defconfigs] Presumably that has a lot enabled as modules. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 04.08.08 12:24 >>> >On 4/8/08 11:18, "Jan Beulich" <jbeulich@novell.com> wrote: > >> 1) While the goal of the per-CPU kicker irq appears to be to avoid all CPUs >> waiting for a particular lock to get kicked simultaneously, I think this >> doesn''t have the desired effect. This is because Xen doesn''t track what >> event channel you poll for (through SCHEDOP_poll), and rather kicks all CPUs >> polling for any event channel. > >Yes, this is true. We could easily do something better for VCPUs polling a >single event channel though, but there hasn''t been a need up to now. I >suppose it depends how often we have multiple VCPUs stuck waiting for >spinlocks. I can sort out a Xen-side patch if someone wanted to measure the >benefits from more selective wakeup from poll.Running kernel builds on 8 vCPU-s competing for 4 pCPU-s shows a 10% improvement in performance with the individual wakeup (patch attached - probably sub-optimal, but I didn''t seem to be able to think of a lock-less mechanism to achieve the desired behavior), using ticket locks in the kernel. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/8/08 13:22, "Jan Beulich" <jbeulich@novell.com> wrote:>> Yes, this is true. We could easily do something better for VCPUs polling a >> single event channel though, but there hasn''t been a need up to now. I >> suppose it depends how often we have multiple VCPUs stuck waiting for >> spinlocks. I can sort out a Xen-side patch if someone wanted to measure the >> benefits from more selective wakeup from poll. > > Running kernel builds on 8 vCPU-s competing for 4 pCPU-s shows a 10% > improvement in performance with the individual wakeup (patch attached > - probably sub-optimal, but I didn''t seem to be able to think of a lock-less > mechanism to achieve the desired behavior), using ticket locks in the > kernel.Hardly a real-world benchmark. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/8/08 13:22, "Jan Beulich" <jbeulich@novell.com> wrote:> Running kernel builds on 8 vCPU-s competing for 4 pCPU-s shows a 10% > improvement in performance with the individual wakeup (patch attached > - probably sub-optimal, but I didn''t seem to be able to think of a lock-less > mechanism to achieve the desired behavior), using ticket locks in the > kernel.Could you try with the attached lock-free patch? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:> Running kernel builds on 8 vCPU-s competing for 4 pCPU-s shows a 10% > improvement in performance with the individual wakeup (patch attached > - probably sub-optimal, but I didn''t seem to be able to think of a lock-less > mechanism to achieve the desired behavior), using ticket locks in the > kernel.What exactly are you measuring here? Are you testing your 2.6.2x forward-port with an adaptation of pv spinlocks using ticket locks as the underlying locking mechanism? Does the 10% improvement come compared to straightforward use of sched_poll, or compared to doing spurious wakeup detection? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> Could you try with the attached lock-free patch? >Just to confirm: this will definitely cause SCHED_poll to return if any event is delivered to the polling vcpu, right? It won''t restart the poll after event delivery? Looks good otherwise. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/8/08 19:11, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> Keir Fraser wrote: >> Could you try with the attached lock-free patch? >> > > Just to confirm: this will definitely cause SCHED_poll to return if any > event is delivered to the polling vcpu, right? It won''t restart the > poll after event delivery?That''s right. The hypercall returns if any event is delivered to the vcpu, or if any port in the poll set is pending. The fact we don''t clear the vcpu from the poll_mask if unblocked for event delivery isn''t a correctness issue, but it will cause unnecessary extra work in future invocations of evtchn_set_pending(). Perhaps vcpu_unblock() should clear it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> That''s right. The hypercall returns if any event is delivered to the vcpu, > or if any port in the poll set is pending. The fact we don''t clear the vcpu > from the poll_mask if unblocked for event delivery isn''t a correctness > issue, but it will cause unnecessary extra work in future invocations of > evtchn_set_pending(). Perhaps vcpu_unblock() should clear it. >That seems reasonable. In this use-case, it''s quite likely that if the poll is interrupted by event delivery, on return it will find that the spinlock is now free and never re-enter the poll. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/8/08 19:49, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> Keir Fraser wrote: >> That''s right. The hypercall returns if any event is delivered to the vcpu, >> or if any port in the poll set is pending. The fact we don''t clear the vcpu >> from the poll_mask if unblocked for event delivery isn''t a correctness >> issue, but it will cause unnecessary extra work in future invocations of >> evtchn_set_pending(). Perhaps vcpu_unblock() should clear it. >> > > That seems reasonable. In this use-case, it''s quite likely that if the > poll is interrupted by event delivery, on return it will find that the > spinlock is now free and never re-enter the poll.Attached is a new version of the patch which clears the vcpu from poll_mask when it is unblocked for any reason. Jan: please can you give this one a spin if you get time. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 11/8/08 19:49, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > > >> Keir Fraser wrote: >> >>> That''s right. The hypercall returns if any event is delivered to the vcpu, >>> or if any port in the poll set is pending. The fact we don''t clear the vcpu >>> from the poll_mask if unblocked for event delivery isn''t a correctness >>> issue, but it will cause unnecessary extra work in future invocations of >>> evtchn_set_pending(). Perhaps vcpu_unblock() should clear it. >>> >>> >> That seems reasonable. In this use-case, it''s quite likely that if the >> poll is interrupted by event delivery, on return it will find that the >> spinlock is now free and never re-enter the poll. >> > > Attached is a new version of the patch which clears the vcpu from poll_mask > when it is unblocked for any reason. Jan: please can you give this one a > spin if you get time. >Forgot to attach? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 12/8/08 17:33, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:>>> That seems reasonable. In this use-case, it''s quite likely that if the >>> poll is interrupted by event delivery, on return it will find that the >>> spinlock is now free and never re-enter the poll. >>> >> >> Attached is a new version of the patch which clears the vcpu from poll_mask >> when it is unblocked for any reason. Jan: please can you give this one a >> spin if you get time. >> > > Forgot to attach?Sorry, attached now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Jeremy Fitzhardinge <jeremy@goop.org> 11.08.08 20:10 >>> >Jan Beulich wrote: >> Running kernel builds on 8 vCPU-s competing for 4 pCPU-s shows a 10% >> improvement in performance with the individual wakeup (patch attached >> - probably sub-optimal, but I didn''t seem to be able to think of a lock-less >> mechanism to achieve the desired behavior), using ticket locks in the >> kernel. > >What exactly are you measuring here? Are you testing your 2.6.2x >forward-port with an adaptation of pv spinlocks using ticket locks as >the underlying locking mechanism? Does the 10% improvement comeYes.>compared to straightforward use of sched_poll, or compared to doing >spurious wakeup detection?Compared to spurious wakeup detection done in the kernel, similar to how you had done it in the latest version of your respective patch that I have seen. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 12.08.08 19:00 >>> >On 12/8/08 17:33, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote: > >>>> That seems reasonable. In this use-case, it''s quite likely that if the >>>> poll is interrupted by event delivery, on return it will find that the >>>> spinlock is now free and never re-enter the poll. >>>> >>> >>> Attached is a new version of the patch which clears the vcpu from poll_mask >>> when it is unblocked for any reason. Jan: please can you give this one a >>> spin if you get time. >>> >> >> Forgot to attach? > >Sorry, attached now.I can''t really explain the results of testing with this version of the patch: While the number of false wakeups got further reduced by somewhat less than 20%, both time spent in the kernel and total execution time went up (8% and 4% respectively) compared to my original (and from all I can tell worse) version of the patch. Nothing else changed as far as I''m aware. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/8/08 13:15, "Jan Beulich" <jbeulich@novell.com> wrote:> I can''t really explain the results of testing with this version of the patch: > While the number of false wakeups got further reduced by somewhat > less than 20%, both time spent in the kernel and total execution time > went up (8% and 4% respectively) compared to my original (and from > all I can tell worse) version of the patch. Nothing else changed as far as > I''m aware.That is certainly odd. Presumably consistent across a few runs? I can''t imagine where extra time would be being spent though... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.08.08 15:01 >>> >On 15/8/08 13:15, "Jan Beulich" <jbeulich@novell.com> wrote: > >> I can''t really explain the results of testing with this version of the patch: >> While the number of false wakeups got further reduced by somewhat >> less than 20%, both time spent in the kernel and total execution time >> went up (8% and 4% respectively) compared to my original (and from >> all I can tell worse) version of the patch. Nothing else changed as far as >> I''m aware. > >That is certainly odd. Presumably consistent across a few runs? I can''t >imagine where extra time would be being spent though...Yes, I did at least five runs in each environment. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 15/8/08 13:15, "Jan Beulich" <jbeulich@novell.com> wrote: > > >> I can''t really explain the results of testing with this version of the patch: >> While the number of false wakeups got further reduced by somewhat >> less than 20%, both time spent in the kernel and total execution time >> went up (8% and 4% respectively) compared to my original (and from >> all I can tell worse) version of the patch. Nothing else changed as far as >> I''m aware. >> > > That is certainly odd. Presumably consistent across a few runs? I can''t > imagine where extra time would be being spent though... >2 locked instructions in vcpu_unblock() when using polling? Doesn''t seem like 8%-worth of time though. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/8/08 15:06, "Jan Beulich" <jbeulich@novell.com> wrote:>>> I can''t really explain the results of testing with this version of the >>> patch: >>> While the number of false wakeups got further reduced by somewhat >>> less than 20%, both time spent in the kernel and total execution time >>> went up (8% and 4% respectively) compared to my original (and from >>> all I can tell worse) version of the patch. Nothing else changed as far as >>> I''m aware. >> >> That is certainly odd. Presumably consistent across a few runs? I can''t >> imagine where extra time would be being spent though... > > Yes, I did at least five runs in each environment.It might be worth retrying with the vcpu_unblock() changes removed. It''ll still work, but poll_mask may have bits spuriously left set for arbitrary time periods. However, vcpu_unblock() is the only thing I obviously make more expensive than in your patch. We could also possibly make the vcpu_unblock() check cheaper by testing v->poll_evtchn for non-zero, and zero it, and clear from poll_mask. Reading a vcpu-local field may be cheaper than getting access to a domain struct cache line. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel