I am currently looking at a bug report[1] which is happening when a Xen PVM guest with multiple VCPUs is running a high IO database load (a test script is available in the bug report). In experimenting it seems that this happens (or is getting more likely) when the number of VCPUs is 8 or higher (though I have not tried 6, only 2 and 4), having autogroup enabled seems to make it more likely, too (at some point thought it would actually prevent it but we were wrong) and pv spinlocks enabled. It has happened with older (3.4.3) and newer (4.1.2) versions of Xen as a host and with 3.2 and 3.5 kernels as guests. The pv spinlock assumption I will try to get re-verified by asking to reproduce under a real load with a kernel that just disables that. However, the dumps I am looking at really do look odd there. The first dump I looked at had the spinlock of runqueue[0] being placed into the per-cpu lock_spinners variable for cpu#0 and cpu#7 (doing my tests with 8 VCPUs). So apparently both cpus were waiting on the slow path for it to become free. Though actually it was free! Now, here is one issue I have in understanding the dump: the back traces produced in crash are in the normal form not showing any cpu in the poll_irq HV call. Only when using the form that uses the last known stack location and displays all text symols found will get close for cpu#7. cpu#0 still does not seem to be anywhere close. This could be a problem with crash, or with the way PVM works, I am not sure. Anyway, from what I could take from that situation, it seemed that cpu#1 (that one had soft lockup warnings in the log) seemed to try to do a wake_up on the task that just seemed to have done an io schedule on cpu#7 (but the waitqueue_head spinlock is currently locked). cpu#7 tries to get the runqueue[0] spinlock to do an idle migration (though the lock currently is free). So I guessed that maybe cpu#0 was just woken for the free lock but has not grabbed it yet. From there I wondered whether something that acquires a spinlock usually more than the quick path timeout (and this causes new lockers to go into the slow path), could cause a stall on a high numbered cpu when the lock is contented (as the search and signalling is only done for the first waiter starting from 0). That lead to below patch. The good thing about it, it does not break things immediately, the bad thing, it does not help with the problem. :/ The interesting thing when looking at a second dump, taken with a testkernel using the patch below, was that again 2 cpus seemed to spin slow on a lock that was free in the dump. And again at least one did not seem to be doing anything spinlock related (anymore?). One other detail (but that could be just incidence as well) was that in unmodified kernel I usually would end up with soft lockup messages, with the patched kernel, that was a stall detected by rcu_bh (0 and 1 stalled, detected by 3). Now I am a bit confused and wonder about some basic things: 1. When a cpu goes into the slow lock path and into the poll_irq, shouldn''t I expect this one to be in hypercall_page in the dump? 2. When does the whole handling for interrupted spinlock wait apply? I assume only for a spinlock taken without irqs disabled and then trying to acquire another one in an interrupt handler. Is that correct? 3. Not really related but I just stumbled over it: In xen_spin_trylock: "asm("xchgb %b0,%1"...) What does the "b" part of %b0 do? I thought xchgb already would say it is a byte value... But putting aside those questions, the fact that two times there was two cpus waiting on the same lock which from the lock count (=0) was free seems a bit too much of a coincidence. Oh and the spinners count in the lock was 2 as one would expect. struct rq { lock = { raw_lock = { { head_tail = 512, tickets = { head = 0 ''\000'', tail = 2 ''\002'' } } } }, nr_running = 1, ... } I really don''t know how this happens. Especially cpu#0 seems at least past the wakeup and should have removed itself from the spinners list... -Stefan [1] http://bugs.launchpad.net/bugs/1011792 From 635a4e101ccbc9a324c8000f7e264ed4e646592c Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Tue, 16 Oct 2012 17:46:09 +0200 Subject: [PATCH] xen/spinlocks: Make wakeup fairer Instead of sending the IPI signalling the free lock to the first online CPU found waiting for it, start the search from the CPU that had the lock last. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> --- arch/x86/xen/spinlock.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d69cc6c..8b86efb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -320,17 +320,23 @@ static void xen_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags) static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl) { int cpu; + int this_cpu = smp_processor_id(); ADD_STATS(released_slow, 1); - for_each_online_cpu(cpu) { - /* XXX should mix up next cpu selection */ - if (per_cpu(lock_spinners, cpu) == xl) { - ADD_STATS(released_slow_kicked, 1); - xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); - break; - } - } + cpu = cpumask_next(this_cpu, cpu_online_mask); + do { + if (cpu < nr_cpu_ids) { + if (per_cpu(lock_spinners, cpu) == xl) { + ADD_STATS(released_slow_kicked, 1); + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); + break; + } + } else + cpu = -1; + + cpu = cpumask_next(cpu, cpu_online_mask); + } while (cpu != this_cpu); } static void xen_spin_unlock(struct arch_spinlock *lock) -- 1.7.9.5
Andrew Cooper
2012-Oct-17 13:28 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17/10/12 14:10, Stefan Bader wrote:> I am currently looking at a bug report[1] which is happening when > a Xen PVM guest with multiple VCPUs is running a high IO database > load (a test script is available in the bug report). > > In experimenting it seems that this happens (or is getting more > likely) when the number of VCPUs is 8 or higher (though I have > not tried 6, only 2 and 4), having autogroup enabled seems to > make it more likely, too (at some point thought it would actually > prevent it but we were wrong) and pv spinlocks enabled. > It has happened with older (3.4.3) and newer (4.1.2) versions of > Xen as a host and with 3.2 and 3.5 kernels as guests. > > The pv spinlock assumption I will try to get re-verified by asking > to reproduce under a real load with a kernel that just disables > that. However, the dumps I am looking at really do look odd there. > > The first dump I looked at had the spinlock of runqueue[0] being > placed into the per-cpu lock_spinners variable for cpu#0 and > cpu#7 (doing my tests with 8 VCPUs). So apparently both cpus were > waiting on the slow path for it to become free. Though actually > it was free! Now, here is one issue I have in understanding the > dump: the back traces produced in crash are in the normal form > not showing any cpu in the poll_irq HV call. Only when using > the form that uses the last known stack location and displays > all text symols found will get close for cpu#7. cpu#0 still does > not seem to be anywhere close. This could be a problem with crash, > or with the way PVM works, I am not sure. > > Anyway, from what I could take from that situation, it seemed that > cpu#1 (that one had soft lockup warnings in the log) seemed to try > to do a wake_up on the task that just seemed to have done an io > schedule on cpu#7 (but the waitqueue_head spinlock is currently > locked). cpu#7 tries to get the runqueue[0] spinlock to do an idle > migration (though the lock currently is free). So I guessed that > maybe cpu#0 was just woken for the free lock but has not grabbed > it yet. > > From there I wondered whether something that acquires a spinlock > usually more than the quick path timeout (and this causes new > lockers to go into the slow path), could cause a stall on a high > numbered cpu when the lock is contented (as the search and > signalling is only done for the first waiter starting from 0). > That lead to below patch. The good thing about it, it does not > break things immediately, the bad thing, it does not help with > the problem. :/ > > The interesting thing when looking at a second dump, taken with > a testkernel using the patch below, was that again 2 cpus seemed > to spin slow on a lock that was free in the dump. And again at > least one did not seem to be doing anything spinlock related > (anymore?). > One other detail (but that could be just incidence as well) was > that in unmodified kernel I usually would end up with soft > lockup messages, with the patched kernel, that was a stall > detected by rcu_bh (0 and 1 stalled, detected by 3). > > Now I am a bit confused and wonder about some basic things: > 1. When a cpu goes into the slow lock path and into the poll_irq, > shouldn''t I expect this one to be in hypercall_page in the > dump? > 2. When does the whole handling for interrupted spinlock wait > apply? I assume only for a spinlock taken without irqs > disabled and then trying to acquire another one in an > interrupt handler. Is that correct? > 3. Not really related but I just stumbled over it: > In xen_spin_trylock: "asm("xchgb %b0,%1"...) > What does the "b" part of %b0 do? I thought xchgb already > would say it is a byte value... > > But putting aside those questions, the fact that two times > there was two cpus waiting on the same lock which from the > lock count (=0) was free seems a bit too much of a coincidence. > Oh and the spinners count in the lock was 2 as one would > expect. > > struct rq { > lock = { > raw_lock = { > { > head_tail = 512, > tickets = { > head = 0 ''\000'', > tail = 2 ''\002'' > } > } > } > }, > nr_running = 1, > ... > } > > I really don''t know how this happens. Especially cpu#0 seems at > least past the wakeup and should have removed itself from the > spinners list... > > -StefanWe (Citrix) have an outstanding bug, on 2.6.32 classic, which we have never managed to get to the bottom of (because of inability to reliably reproduce), which might be related. In our case, certain processes were locking up, and it turned out that the kernel was issuing SCHOP_poll hypercalls (same the stack trace on your launchpad ticket) on its own spinlock IPI event channel (understandable, as its a spinlock), but with the event channel masked, so it would never wake up from the poll. Can you reliably reproduce the issue? and if so, would it be possible to capture hypervisor debug keys q and e when the badness happens? This would clearly show if our bugs are related or not Our repro case took about a month to reproduce, and unfortunately turning lockdep on has appeared to fix the problem. ~Andrew> > [1] http://bugs.launchpad.net/bugs/1011792 > > From 635a4e101ccbc9a324c8000f7e264ed4e646592c Mon Sep 17 00:00:00 2001 > From: Stefan Bader <stefan.bader@canonical.com> > Date: Tue, 16 Oct 2012 17:46:09 +0200 > Subject: [PATCH] xen/spinlocks: Make wakeup fairer > > Instead of sending the IPI signalling the free lock to the first > online CPU found waiting for it, start the search from the CPU > that had the lock last. > > Signed-off-by: Stefan Bader <stefan.bader@canonical.com> > --- > arch/x86/xen/spinlock.c | 22 ++++++++++++++-------- > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c > index d69cc6c..8b86efb 100644 > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -320,17 +320,23 @@ static void xen_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags) > static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl) > { > int cpu; > + int this_cpu = smp_processor_id(); > > ADD_STATS(released_slow, 1); > > - for_each_online_cpu(cpu) { > - /* XXX should mix up next cpu selection */ > - if (per_cpu(lock_spinners, cpu) == xl) { > - ADD_STATS(released_slow_kicked, 1); > - xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); > - break; > - } > - } > + cpu = cpumask_next(this_cpu, cpu_online_mask); > + do { > + if (cpu < nr_cpu_ids) { > + if (per_cpu(lock_spinners, cpu) == xl) { > + ADD_STATS(released_slow_kicked, 1); > + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); > + break; > + } > + } else > + cpu = -1; > + > + cpu = cpumask_next(cpu, cpu_online_mask); > + } while (cpu != this_cpu); > } > > static void xen_spin_unlock(struct arch_spinlock *lock)-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Stefan Bader
2012-Oct-17 13:45 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17.10.2012 15:28, Andrew Cooper wrote:> On 17/10/12 14:10, Stefan Bader wrote: >> I am currently looking at a bug report[1] which is happening when >> a Xen PVM guest with multiple VCPUs is running a high IO database >> load (a test script is available in the bug report). >> >> In experimenting it seems that this happens (or is getting more >> likely) when the number of VCPUs is 8 or higher (though I have >> not tried 6, only 2 and 4), having autogroup enabled seems to >> make it more likely, too (at some point thought it would actually >> prevent it but we were wrong) and pv spinlocks enabled. >> It has happened with older (3.4.3) and newer (4.1.2) versions of >> Xen as a host and with 3.2 and 3.5 kernels as guests. >> >> The pv spinlock assumption I will try to get re-verified by asking >> to reproduce under a real load with a kernel that just disables >> that. However, the dumps I am looking at really do look odd there. >> >> The first dump I looked at had the spinlock of runqueue[0] being >> placed into the per-cpu lock_spinners variable for cpu#0 and >> cpu#7 (doing my tests with 8 VCPUs). So apparently both cpus were >> waiting on the slow path for it to become free. Though actually >> it was free! Now, here is one issue I have in understanding the >> dump: the back traces produced in crash are in the normal form >> not showing any cpu in the poll_irq HV call. Only when using >> the form that uses the last known stack location and displays >> all text symols found will get close for cpu#7. cpu#0 still does >> not seem to be anywhere close. This could be a problem with crash, >> or with the way PVM works, I am not sure. >> >> Anyway, from what I could take from that situation, it seemed that >> cpu#1 (that one had soft lockup warnings in the log) seemed to try >> to do a wake_up on the task that just seemed to have done an io >> schedule on cpu#7 (but the waitqueue_head spinlock is currently >> locked). cpu#7 tries to get the runqueue[0] spinlock to do an idle >> migration (though the lock currently is free). So I guessed that >> maybe cpu#0 was just woken for the free lock but has not grabbed >> it yet. >> >> From there I wondered whether something that acquires a spinlock >> usually more than the quick path timeout (and this causes new >> lockers to go into the slow path), could cause a stall on a high >> numbered cpu when the lock is contented (as the search and >> signalling is only done for the first waiter starting from 0). >> That lead to below patch. The good thing about it, it does not >> break things immediately, the bad thing, it does not help with >> the problem. :/ >> >> The interesting thing when looking at a second dump, taken with >> a testkernel using the patch below, was that again 2 cpus seemed >> to spin slow on a lock that was free in the dump. And again at >> least one did not seem to be doing anything spinlock related >> (anymore?). >> One other detail (but that could be just incidence as well) was >> that in unmodified kernel I usually would end up with soft >> lockup messages, with the patched kernel, that was a stall >> detected by rcu_bh (0 and 1 stalled, detected by 3). >> >> Now I am a bit confused and wonder about some basic things: >> 1. When a cpu goes into the slow lock path and into the poll_irq, >> shouldn''t I expect this one to be in hypercall_page in the >> dump? >> 2. When does the whole handling for interrupted spinlock wait >> apply? I assume only for a spinlock taken without irqs >> disabled and then trying to acquire another one in an >> interrupt handler. Is that correct? >> 3. Not really related but I just stumbled over it: >> In xen_spin_trylock: "asm("xchgb %b0,%1"...) >> What does the "b" part of %b0 do? I thought xchgb already >> would say it is a byte value... >> >> But putting aside those questions, the fact that two times >> there was two cpus waiting on the same lock which from the >> lock count (=0) was free seems a bit too much of a coincidence. >> Oh and the spinners count in the lock was 2 as one would >> expect. >> >> struct rq { >> lock = { >> raw_lock = { >> { >> head_tail = 512, >> tickets = { >> head = 0 ''\000'', >> tail = 2 ''\002'' >> } >> } >> } >> }, >> nr_running = 1, >> ... >> } >> >> I really don''t know how this happens. Especially cpu#0 seems at >> least past the wakeup and should have removed itself from the >> spinners list... >> >> -Stefan > > We (Citrix) have an outstanding bug, on 2.6.32 classic, which we have > never managed to get to the bottom of (because of inability to reliably > reproduce), which might be related. > > In our case, certain processes were locking up, and it turned out that > the kernel was issuing SCHOP_poll hypercalls (same the stack trace on > your launchpad ticket) on its own spinlock IPI event channel > (understandable, as its a spinlock), but with the event channel masked, > so it would never wake up from the poll.Unfortunately either crash looking at it or something else is wrong for the PV guest as looking at HYPERVISOR_shared_info appears completely blank.> > Can you reliably reproduce the issue? and if so, would it be possible to > capture hypervisor debug keys q and e when the badness happens? This > would clearly show if our bugs are related or notYes, as mentioned the python script in the bug report does that when meeting the other conditions. I certainly can do the debug keys on that.> > Our repro case took about a month to reproduce, and unfortunately > turning lockdep on has appeared to fix the problem.Yeah, having lockdep enabled was another thing that made it less likely. Though some reporters had lockups with it, too. Just much less likely. -Stefan> > ~Andrew > >> >> [1] http://bugs.launchpad.net/bugs/1011792 >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Oct-17 13:55 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On Wed, 2012-10-17 at 14:28 +0100, Andrew Cooper wrote:> In our case, certain processes were locking up, and it turned out that > the kernel was issuing SCHOP_poll hypercalls (same the stack trace on > your launchpad ticket) on its own spinlock IPI event channel > (understandable, as its a spinlock), but with the event channel > masked, so it would never wake up from the poll.I think (but you might want to check) that SCHEDOP_poll works (or is supposed to work!) regardless of whether the specific evtchn you ask for is masked or not. The Linux PV spinlock implementation takes advantage of this because it never wants to take a real interrupt from the spinlock poller evtchn. The IRQ handler for the spinlock evtchn in Linux is: static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); return IRQ_HANDLED; } and right after we register it: disable_irq(irq); /* make sure it''s never delivered */ The is no enable -- ignoring bugs of which there have been couple of instances, but those trigger the BUG() so are pretty obvious. Ian.
Jan Beulich
2012-Oct-17 14:51 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 17.10.12 at 15:28, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > In our case, certain processes were locking up, and it turned out that > the kernel was issuing SCHOP_poll hypercalls (same the stack trace on > your launchpad ticket) on its own spinlock IPI event channel > (understandable, as its a spinlock), but with the event channel masked, > so it would never wake up from the poll.Probably some misunderstanding: The event channel used here will always be masked, and whether an event channel is masked doesn''t matter for the purposes of polling it. Jan
Andrew Cooper
2012-Oct-17 15:12 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17/10/12 15:51, Jan Beulich wrote:>>>> On 17.10.12 at 15:28, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> In our case, certain processes were locking up, and it turned out that >> the kernel was issuing SCHOP_poll hypercalls (same the stack trace on >> your launchpad ticket) on its own spinlock IPI event channel >> (understandable, as its a spinlock), but with the event channel masked, >> so it would never wake up from the poll. > Probably some misunderstanding: The event channel used > here will always be masked, and whether an event channel is > masked doesn''t matter for the purposes of polling it. > > Jan >Sorry - I was not very clear. In our case, we saw every single VCPU stuck in a SCHEDOP_poll on its own spinlock event channel, which renderes the guest completely deadlocked. My understanding was that while some VCPUs were blocked in the hypervisor, the VCPU with the lock should raise an event when the lock was released, waking the other VCPUs so one could grab the lock. This implication of this being that there is a bug in the spinlock code. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Stefan Bader
2012-Oct-17 15:21 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17.10.2012 15:55, Ian Campbell wrote:> On Wed, 2012-10-17 at 14:28 +0100, Andrew Cooper wrote: >> In our case, certain processes were locking up, and it turned out that >> the kernel was issuing SCHOP_poll hypercalls (same the stack trace on >> your launchpad ticket) on its own spinlock IPI event channel >> (understandable, as its a spinlock), but with the event channel >> masked, so it would never wake up from the poll. > > I think (but you might want to check) that SCHEDOP_poll works (or is > supposed to work!) regardless of whether the specific evtchn you ask for > is masked or not.I was assuming it to be supposed to work at least in Xen 4.1.2. Or at least if it did not I would hope to catch VCPUs rather sitting on the hypercall than doing nothing. Of course I cannot say how reliable information in crash is as this is something new to do after Daniel fixed crash.> > The Linux PV spinlock implementation takes advantage of this because it > never wants to take a real interrupt from the spinlock poller evtchn.Right, I probably should have realized that. Though I guess it is still interesting to see whether the channel is pending. So when just recreating, I got the q and e info which is (assuming the guest domain is enough): (XEN) Event channel information for domain 1: (XEN) Polling vCPUs: {1,4,6} (XEN) port [p/m] (XEN) 1 [0/0]: s=3 n=0 d=0 p=68 x=0 (XEN) 2 [1/0]: s=3 n=0 d=0 p=69 x=0 (XEN) 3 [1/0]: s=5 n=0 v=0 x=0 (XEN) 4 [1/1]: s=6 n=0 x=0 (XEN) 5 [1/0]: s=6 n=0 x=0 (XEN) 6 [0/0]: s=6 n=0 x=0 (XEN) 7 [0/0]: s=5 n=0 v=1 x=0 (XEN) 8 [0/0]: s=6 n=0 x=0 (XEN) 9 [1/0]: s=5 n=1 v=0 x=0 (XEN) 10 [0/1]: s=6 n=1 x=0 (XEN) 11 [1/0]: s=6 n=1 x=0 (XEN) 12 [0/0]: s=6 n=1 x=0 (XEN) 13 [0/0]: s=5 n=1 v=1 x=0 (XEN) 14 [0/0]: s=6 n=1 x=0 (XEN) 15 [0/0]: s=5 n=2 v=0 x=0 (XEN) 16 [1/1]: s=6 n=2 x=0 (XEN) 17 [0/0]: s=6 n=2 x=0 (XEN) 18 [0/0]: s=6 n=2 x=0 (XEN) 19 [0/0]: s=5 n=2 v=1 x=0 (XEN) 20 [0/0]: s=6 n=2 x=0 (XEN) 21 [0/0]: s=5 n=3 v=0 x=0 (XEN) 22 [1/1]: s=6 n=3 x=0 (XEN) 23 [0/0]: s=6 n=3 x=0 (XEN) 24 [0/0]: s=6 n=3 x=0 (XEN) 25 [0/0]: s=5 n=3 v=1 x=0 (XEN) 26 [0/0]: s=6 n=3 x=0 (XEN) 27 [1/0]: s=5 n=4 v=0 x=0 (XEN) 28 [0/1]: s=6 n=4 x=0 (XEN) 29 [1/0]: s=6 n=4 x=0 (XEN) 30 [0/0]: s=6 n=4 x=0 (XEN) 31 [0/0]: s=5 n=4 v=1 x=0 (XEN) 32 [0/0]: s=6 n=4 x=0 (XEN) 33 [0/0]: s=5 n=5 v=0 x=0 (XEN) 34 [0/1]: s=6 n=5 x=0 (XEN) 35 [0/0]: s=6 n=5 x=0 (XEN) 36 [0/0]: s=6 n=5 x=0 (XEN) 37 [0/0]: s=5 n=5 v=1 x=0 (XEN) 38 [0/0]: s=6 n=5 x=0 (XEN) 39 [1/0]: s=5 n=6 v=0 x=0 (XEN) 40 [0/1]: s=6 n=6 x=0 (XEN) 41 [1/0]: s=6 n=6 x=0 (XEN) 42 [0/0]: s=6 n=6 x=0 (XEN) 43 [0/0]: s=5 n=6 v=1 x=0 (XEN) 44 [0/0]: s=6 n=6 x=0 (XEN) 45 [0/0]: s=5 n=7 v=0 x=0 (XEN) 46 [1/1]: s=6 n=7 x=0 (XEN) 47 [0/0]: s=6 n=7 x=0 (XEN) 48 [0/0]: s=6 n=7 x=0 (XEN) 49 [0/0]: s=5 n=7 v=1 x=0 (XEN) 50 [0/0]: s=6 n=7 x=0 (XEN) 51 [0/0]: s=3 n=7 d=0 p=70 x=0 (XEN) 52 [0/0]: s=3 n=0 d=0 p=71 x=0 (XEN) 53 [0/0]: s=3 n=0 d=0 p=72 x=0 (XEN) 54 [0/0]: s=3 n=0 d=0 p=73 x=0 (XEN) 55 [1/0]: s=3 n=0 d=0 p=74 x=0 [maybe someone can tell me what the s,n,d,p and x mean] (XEN) Rangesets belonging to domain 1: (XEN) I/O Ports { } (XEN) Interrupts { } (XEN) I/O Memory { } (XEN) Memory pages belonging to domain 1: (XEN) DomPage list too long to display (XEN) XenPage 00000000008000ab: caf=c000000000000002, taf=7400000000000002 (XEN) XenPage 00000000008000aa: caf=c000000000000002, taf=7400000000000002 (XEN) XenPage 00000000008000a9: caf=c000000000000002, taf=7400000000000002 (XEN) XenPage 00000000008000a8: caf=c000000000000001, taf=7400000000000001 (XEN) XenPage 00000000000dfae4: caf=c000000000000002, taf=7400000000000002 (XEN) VCPU information and callbacks for domain 1: (XEN) VCPU0: CPU3 [has=T] flags=0 poll=0 upcall_pend = 01, upcall_mask = 01 dirty_cpus={3} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU1: CPU7 [has=F] flags=1 poll=10 upcall_pend = 01, upcall_mask = 01 dirty_cpus={} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU2: CPU4 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU3: CPU5 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU4: CPU6 [has=F] flags=1 poll=28 upcall_pend = 01, upcall_mask = 01 dirty_cpus={} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU5: CPU7 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00 dirty_cpus={7} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU6: CPU0 [has=F] flags=1 poll=40 upcall_pend = 01, upcall_mask = 01 dirty_cpus={} cpu_affinity={0-127} (XEN) No periodic timer (XEN) VCPU7: CPU6 [has=T] flags=0 poll=0 upcall_pend = 00, upcall_mask = 01 dirty_cpus={6} cpu_affinity={0-127} (XEN) No periodic timer (XEN) Notifying guest 0:0 (virq 1, port 5, stat 0/0/0) (XEN) Notifying guest 0:1 (virq 1, port 11, stat 0/0/0) (XEN) Notifying guest 0:2 (virq 1, port 17, stat 0/0/0) (XEN) Notifying guest 0:3 (virq 1, port 23, stat 0/0/0) (XEN) Notifying guest 0:4 (virq 1, port 29, stat 0/0/0) (XEN) Notifying guest 0:5 (virq 1, port 35, stat 0/0/0) (XEN) Notifying guest 0:6 (virq 1, port 41, stat 0/0/0) (XEN) Notifying guest 0:7 (virq 1, port 47, stat 0/0/0) (XEN) Notifying guest 1:0 (virq 1, port 7, stat 0/0/-1) (XEN) Notifying guest 1:1 (virq 1, port 13, stat 0/0/-1) (XEN) Notifying guest 1:2 (virq 1, port 19, stat 0/0/0) (XEN) Notifying guest 1:3 (virq 1, port 25, stat 0/0/0) (XEN) Notifying guest 1:4 (virq 1, port 31, stat 0/0/-1) (XEN) Notifying guest 1:5 (virq 1, port 37, stat 0/0/0) (XEN) Notifying guest 1:6 (virq 1, port 43, stat 0/0/-1) (XEN) Notifying guest 1:7 (virq 1, port 49, stat 0/0/0) When the guest was unresponsive the console would still show: [10174.372092] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 1} (detected by 4, t=15002 jiffies) [10284.448089] INFO: rcu_bh detected stalls on CPUs/tasks: { 0 1 4 6} (detected by 5, t=15004 jiffies) in a repeating pattern. So I take the above as cpus 1,4 and 6 are polling. From the dump and the content of lock_spinners I get: cpu 0 and 1 -> ffff8803bfc13700 (which is runqueue[0] and is unlocked again) cpu 4 and 6 -> ffffffff81f15ef0 (which is blkif_io_lock and is locked) Backtraces would be somewhat inconsistent (as always). Note, I should mention that I still had a kernel with my patch applied on that guest. That changes things a bit (actually it takes a bit longer to hang but again that might be just a matter of timing). The strange lock state of 2 spinners on an unlocked lock remains the same with or without it. One question about the patch actually, would anybody think that there could be a case where the unlocking cpu has itself on the spinners list? I did not think so but that might be wrong.> > The IRQ handler for the spinlock evtchn in Linux is: > static irqreturn_t dummy_handler(int irq, void *dev_id) > { > BUG(); > return IRQ_HANDLED; > } > > and right after we register it: > disable_irq(irq); /* make sure it''s never delivered */ > > The is no enable -- ignoring bugs of which there have been couple of > instances, but those trigger the BUG() so are pretty obvious. > > Ian. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2012-Oct-17 15:35 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17/10/12 16:21, Stefan Bader wrote:> On 17.10.2012 15:55, Ian Campbell wrote: >> On Wed, 2012-10-17 at 14:28 +0100, Andrew Cooper wrote: >>> In our case, certain processes were locking up, and it turned out that >>> the kernel was issuing SCHOP_poll hypercalls (same the stack trace on >>> your launchpad ticket) on its own spinlock IPI event channel >>> (understandable, as its a spinlock), but with the event channel >>> masked, so it would never wake up from the poll. >> >> I think (but you might want to check) that SCHEDOP_poll works (or is >> supposed to work!) regardless of whether the specific evtchn you ask for >> is masked or not. > > I was assuming it to be supposed to work at least in Xen 4.1.2. Or atleast if> it did not I would hope to catch VCPUs rather sitting on the hypercallthan> doing nothing. Of course I cannot say how reliable information incrash is as> this is something new to do after Daniel fixed crash. > >> >> The Linux PV spinlock implementation takes advantage of this because it >> never wants to take a real interrupt from the spinlock poller evtchn. > > Right, I probably should have realized that. Though I guess it is still > interesting to see whether the channel is pending. > > So when just recreating, I got the q and e info which is (assuming theguest> domain is enough): > > (XEN) Event channel information for domain 1: > (XEN) Polling vCPUs: {1,4,6} > (XEN) port [p/m] > (XEN) 1 [0/0]: s=3 n=0 d=0 p=68 x=0 > (XEN) 2 [1/0]: s=3 n=0 d=0 p=69 x=0 > (XEN) 3 [1/0]: s=5 n=0 v=0 x=0 > (XEN) 4 [1/1]: s=6 n=0 x=0 > (XEN) 5 [1/0]: s=6 n=0 x=0 > (XEN) 6 [0/0]: s=6 n=0 x=0 > (XEN) 7 [0/0]: s=5 n=0 v=1 x=0 > (XEN) 8 [0/0]: s=6 n=0 x=0 > (XEN) 9 [1/0]: s=5 n=1 v=0 x=0 > (XEN) 10 [0/1]: s=6 n=1 x=0 > (XEN) 11 [1/0]: s=6 n=1 x=0 > (XEN) 12 [0/0]: s=6 n=1 x=0 > (XEN) 13 [0/0]: s=5 n=1 v=1 x=0 > (XEN) 14 [0/0]: s=6 n=1 x=0 > (XEN) 15 [0/0]: s=5 n=2 v=0 x=0 > (XEN) 16 [1/1]: s=6 n=2 x=0 > (XEN) 17 [0/0]: s=6 n=2 x=0 > (XEN) 18 [0/0]: s=6 n=2 x=0 > (XEN) 19 [0/0]: s=5 n=2 v=1 x=0 > (XEN) 20 [0/0]: s=6 n=2 x=0 > (XEN) 21 [0/0]: s=5 n=3 v=0 x=0 > (XEN) 22 [1/1]: s=6 n=3 x=0 > (XEN) 23 [0/0]: s=6 n=3 x=0 > (XEN) 24 [0/0]: s=6 n=3 x=0 > (XEN) 25 [0/0]: s=5 n=3 v=1 x=0 > (XEN) 26 [0/0]: s=6 n=3 x=0 > (XEN) 27 [1/0]: s=5 n=4 v=0 x=0 > (XEN) 28 [0/1]: s=6 n=4 x=0 > (XEN) 29 [1/0]: s=6 n=4 x=0 > (XEN) 30 [0/0]: s=6 n=4 x=0 > (XEN) 31 [0/0]: s=5 n=4 v=1 x=0 > (XEN) 32 [0/0]: s=6 n=4 x=0 > (XEN) 33 [0/0]: s=5 n=5 v=0 x=0 > (XEN) 34 [0/1]: s=6 n=5 x=0 > (XEN) 35 [0/0]: s=6 n=5 x=0 > (XEN) 36 [0/0]: s=6 n=5 x=0 > (XEN) 37 [0/0]: s=5 n=5 v=1 x=0 > (XEN) 38 [0/0]: s=6 n=5 x=0 > (XEN) 39 [1/0]: s=5 n=6 v=0 x=0 > (XEN) 40 [0/1]: s=6 n=6 x=0 > (XEN) 41 [1/0]: s=6 n=6 x=0 > (XEN) 42 [0/0]: s=6 n=6 x=0 > (XEN) 43 [0/0]: s=5 n=6 v=1 x=0 > (XEN) 44 [0/0]: s=6 n=6 x=0event channel > (XEN) 45 [0/0]: s=5 n=7 v=0 x=0 > (XEN) 46 [1/1]: s=6 n=7 x=0 > (XEN) 47 [0/0]: s=6 n=7 x=0 > (XEN) 48 [0/0]: s=6 n=7 x=0 > (XEN) 49 [0/0]: s=5 n=7 v=1 x=0 > (XEN) 50 [0/0]: s=6 n=7 x=0 > (XEN) 51 [0/0]: s=3 n=7 d=0 p=70 x=0 > (XEN) 52 [0/0]: s=3 n=0 d=0 p=71 x=0 > (XEN) 53 [0/0]: s=3 n=0 d=0 p=72 x=0 > (XEN) 54 [0/0]: s=3 n=0 d=0 p=73 x=0 > (XEN) 55 [1/0]: s=3 n=0 d=0 p=74 x=0 > > [maybe someone can tell me what the s,n,d,p and x mean]s = state. 0 = free, 1 = reserved, 2 = unbound, 3 = inter-domain, 4 pirq, 5 = virq, 6 = ipi n = target vcpu id to notify x = boolean indicating whether xen is a consumer of the event channel or not. d = target domain (when appropriate) In this case, p is the target port.> > > (XEN) Rangesets belonging to domain 1: > (XEN) I/O Ports { } > (XEN) Interrupts { } > (XEN) I/O Memory { } > (XEN) Memory pages belonging to domain 1: > (XEN) DomPage list too long to display > (XEN) XenPage 00000000008000ab: caf=c000000000000002, taf=7400000000000002 > (XEN) XenPage 00000000008000aa: caf=c000000000000002, taf=7400000000000002 > (XEN) XenPage 00000000008000a9: caf=c000000000000002, taf=7400000000000002 > (XEN) XenPage 00000000008000a8: caf=c000000000000001, taf=7400000000000001 > (XEN) XenPage 00000000000dfae4: caf=c000000000000002, taf=7400000000000002 > (XEN) VCPU information and callbacks for domain 1: > (XEN) VCPU0: CPU3 [has=T] flags=0 poll=0 upcall_pend = 01, upcall_mask= 01> dirty_cpus={3} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU1: CPU7 [has=F] flags=1 poll=10 upcall_pend = 01,upcall_mask = 01> dirty_cpus={} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU2: CPU4 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask= 00> dirty_cpus={} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU3: CPU5 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask= 00> dirty_cpus={} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU4: CPU6 [has=F] flags=1 poll=28 upcall_pend = 01,upcall_mask = 01> dirty_cpus={} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU5: CPU7 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask= 00> dirty_cpus={7} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU6: CPU0 [has=F] flags=1 poll=40 upcall_pend = 01,upcall_mask = 01> dirty_cpus={} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) VCPU7: CPU6 [has=T] flags=0 poll=0 upcall_pend = 00, upcall_mask= 01> dirty_cpus={6} cpu_affinity={0-127} > (XEN) No periodic timer > (XEN) Notifying guest 0:0 (virq 1, port 5, stat 0/0/0) > (XEN) Notifying guest 0:1 (virq 1, port 11, stat 0/0/0) > (XEN) Notifying guest 0:2 (virq 1, port 17, stat 0/0/0) > (XEN) Notifying guest 0:3 (virq 1, port 23, stat 0/0/0) > (XEN) Notifying guest 0:4 (virq 1, port 29, stat 0/0/0) > (XEN) Notifying guest 0:5 (virq 1, port 35, stat 0/0/0) > (XEN) Notifying guest 0:6 (virq 1, port 41, stat 0/0/0) > (XEN) Notifying guest 0:7 (virq 1, port 47, stat 0/0/0) > (XEN) Notifying guest 1:0 (virq 1, port 7, stat 0/0/-1) > (XEN) Notifying guest 1:1 (virq 1, port 13, stat 0/0/-1) > (XEN) Notifying guest 1:2 (virq 1, port 19, stat 0/0/0) > (XEN) Notifying guest 1:3 (virq 1, port 25, stat 0/0/0) > (XEN) Notifying guest 1:4 (virq 1, port 31, stat 0/0/-1) > (XEN) Notifying guest 1:5 (virq 1, port 37, stat 0/0/0) > (XEN) Notifying guest 1:6 (virq 1, port 43, stat 0/0/-1) > (XEN) Notifying guest 1:7 (virq 1, port 49, stat 0/0/0)So in this case, vcpu 1 is in a poll, on port 10, which is an IPI event channel for itself. Same for vcpu 4, except it is on port 28, and for vcpu 6 on port 60. In each case, the event channels are masked (no surprise given the conversation so far on this thread), and have no pending events. Therefore, I believe we are looking at the same bug.> > > When the guest was unresponsive the console would still show: > [10174.372092] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 1}(detected> by 4, t=15002 jiffies) > [10284.448089] INFO: rcu_bh detected stalls on CPUs/tasks: { 0 1 4 6}(detected> by 5, t=15004 jiffies) > > in a repeating pattern. So I take the above as cpus 1,4 and 6 arepolling. From> the dump and the content of lock_spinners I get: > > cpu 0 and 1 -> ffff8803bfc13700 (which is runqueue[0] and is unlockedagain)> cpu 4 and 6 -> ffffffff81f15ef0 (which is blkif_io_lock and is locked)I wonder if there is possibly a race condition between notifying that a lock has been unlocked, and another vcpu trying to poll after deciding that the lock is locked. The other option is that there is a bug in working out which event channel to notify when a lock is unlocked. ~Andrew> > > Backtraces would be somewhat inconsistent (as always). Note, I shouldmention> that I still had a kernel with my patch applied on that guest. Thatchanges> things a bit (actually it takes a bit longer to hang but again thatmight be> just a matter of timing). The strange lock state of 2 spinners on anunlocked> lock remains the same with or without it. > > One question about the patch actually, would anybody think that therecould be a> case where the unlocking cpu has itself on the spinners list? I didnot think so> but that might be wrong. >> >> The IRQ handler for the spinlock evtchn in Linux is: >> static irqreturn_t dummy_handler(int irq, void *dev_id) >> { >> BUG(); >> return IRQ_HANDLED; >> } >> >> and right after we register it: >> disable_irq(irq); /* make sure it''s never delivered */ >> >> The is no enable -- ignoring bugs of which there have been couple of >> instances, but those trigger the BUG() so are pretty obvious. >> >> Ian. >> >> > > > >-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com --------------030607010300050505080009 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <br> On 17/10/12 16:21, Stefan Bader wrote:<br> <span style="white-space: pre;">> On 17.10.2012 15:55, Ian Campbell wrote:<br> >> On Wed, 2012-10-17 at 14:28 +0100, Andrew Cooper wrote:<br> >>> In our case, certain processes were locking up, and it turned out that<br> >>> the kernel was issuing SCHOP_poll hypercalls (same the stack trace on<br> >>> your launchpad ticket) on its own spinlock IPI event channel<br> >>> (understandable, as its a spinlock), but with the event channel<br> >>> masked, so it would never wake up from the poll. <br> >><br> >> I think (but you might want to check) that SCHEDOP_poll works (or is<br> >> supposed to work!) regardless of whether the specific evtchn you ask for<br> >> is masked or not.<br> ><br> > I was assuming it to be supposed to work at least in Xen 4.1.2. Or at least if<br> > it did not I would hope to catch VCPUs rather sitting on the hypercall than<br> > doing nothing. Of course I cannot say how reliable information in crash is as<br> > this is something new to do after Daniel fixed crash.<br> ><br> >><br> >> The Linux PV spinlock implementation takes advantage of this because it<br> >> never wants to take a real interrupt from the spinlock poller evtchn.<br> ><br> > Right, I probably should have realized that. Though I guess it is still<br> > interesting to see whether the channel is pending.<br> ><br> > So when just recreating, I got the q and e info which is (assuming the guest<br> > domain is enough):<br> ><br> > (XEN) Event channel information for domain 1:<br> > (XEN) Polling vCPUs: {1,4,6}<br> > (XEN) port [p/m]<br> > (XEN) 1 [0/0]: s=3 n=0 d=0 p=68 x=0<br> > (XEN) 2 [1/0]: s=3 n=0 d=0 p=69 x=0<br> > (XEN) 3 [1/0]: s=5 n=0 v=0 x=0<br> > (XEN) 4 [1/1]: s=6 n=0 x=0<br> > (XEN) 5 [1/0]: s=6 n=0 x=0<br> > (XEN) 6 [0/0]: s=6 n=0 x=0<br> > (XEN) 7 [0/0]: s=5 n=0 v=1 x=0<br> > (XEN) 8 [0/0]: s=6 n=0 x=0<br> > (XEN) 9 [1/0]: s=5 n=1 v=0 x=0<br> > (XEN) 10 [0/1]: s=6 n=1 x=0<br> > (XEN) 11 [1/0]: s=6 n=1 x=0<br> > (XEN) 12 [0/0]: s=6 n=1 x=0<br> > (XEN) 13 [0/0]: s=5 n=1 v=1 x=0<br> > (XEN) 14 [0/0]: s=6 n=1 x=0<br> > (XEN) 15 [0/0]: s=5 n=2 v=0 x=0<br> > (XEN) 16 [1/1]: s=6 n=2 x=0<br> > (XEN) 17 [0/0]: s=6 n=2 x=0<br> > (XEN) 18 [0/0]: s=6 n=2 x=0<br> > (XEN) 19 [0/0]: s=5 n=2 v=1 x=0<br> > (XEN) 20 [0/0]: s=6 n=2 x=0<br> > (XEN) 21 [0/0]: s=5 n=3 v=0 x=0<br> > (XEN) 22 [1/1]: s=6 n=3 x=0<br> > (XEN) 23 [0/0]: s=6 n=3 x=0<br> > (XEN) 24 [0/0]: s=6 n=3 x=0<br> > (XEN) 25 [0/0]: s=5 n=3 v=1 x=0<br> > (XEN) 26 [0/0]: s=6 n=3 x=0<br> > (XEN) 27 [1/0]: s=5 n=4 v=0 x=0<br> > (XEN) 28 [0/1]: s=6 n=4 x=0<br> > (XEN) 29 [1/0]: s=6 n=4 x=0<br> > (XEN) 30 [0/0]: s=6 n=4 x=0<br> > (XEN) 31 [0/0]: s=5 n=4 v=1 x=0<br> > (XEN) 32 [0/0]: s=6 n=4 x=0<br> > (XEN) 33 [0/0]: s=5 n=5 v=0 x=0<br> > (XEN) 34 [0/1]: s=6 n=5 x=0<br> > (XEN) 35 [0/0]: s=6 n=5 x=0<br> > (XEN) 36 [0/0]: s=6 n=5 x=0<br> > (XEN) 37 [0/0]: s=5 n=5 v=1 x=0<br> > (XEN) 38 [0/0]: s=6 n=5 x=0<br> > (XEN) 39 [1/0]: s=5 n=6 v=0 x=0<br> > (XEN) 40 [0/1]: s=6 n=6 x=0<br> > (XEN) 41 [1/0]: s=6 n=6 x=0<br> > (XEN) 42 [0/0]: s=6 n=6 x=0<br> > (XEN) 43 [0/0]: s=5 n=6 v=1 x=0<br> > (XEN) 44 [0/0]: s=6 n=6 x=0event channel<br> > (XEN) 45 [0/0]: s=5 n=7 v=0 x=0<br> > (XEN) 46 [1/1]: s=6 n=7 x=0<br> > (XEN) 47 [0/0]: s=6 n=7 x=0<br> > (XEN) 48 [0/0]: s=6 n=7 x=0<br> > (XEN) 49 [0/0]: s=5 n=7 v=1 x=0<br> > (XEN) 50 [0/0]: s=6 n=7 x=0<br> > (XEN) 51 [0/0]: s=3 n=7 d=0 p=70 x=0<br> > (XEN) 52 [0/0]: s=3 n=0 d=0 p=71 x=0<br> > (XEN) 53 [0/0]: s=3 n=0 d=0 p=72 x=0<br> > (XEN) 54 [0/0]: s=3 n=0 d=0 p=73 x=0<br> > (XEN) 55 [1/0]: s=3 n=0 d=0 p=74 x=0<br> ><br> > [maybe someone can tell me what the s,n,d,p and x mean]</span><br> <br> s = state. 0 = free, 1 = reserved, 2 = unbound, 3 = inter-domain, 4 = pirq, 5 = virq, 6 = ipi<br> n = target vcpu id to notify<br> x = boolean indicating whether xen is a consumer of the event channel or not.<br> <br> d = target domain (when appropriate) In this case, p is the target port.<br> <br> <span style="white-space: pre;">><br> ><br> > (XEN) Rangesets belonging to domain 1:<br> > (XEN) I/O Ports { }<br> > (XEN) Interrupts { }<br> > (XEN) I/O Memory { }<br> > (XEN) Memory pages belonging to domain 1:<br> > (XEN) DomPage list too long to display<br> > (XEN) XenPage 00000000008000ab: caf=c000000000000002, taf=7400000000000002<br> > (XEN) XenPage 00000000008000aa: caf=c000000000000002, taf=7400000000000002<br> > (XEN) XenPage 00000000008000a9: caf=c000000000000002, taf=7400000000000002<br> > (XEN) XenPage 00000000008000a8: caf=c000000000000001, taf=7400000000000001<br> > (XEN) XenPage 00000000000dfae4: caf=c000000000000002, taf=7400000000000002<br> > (XEN) VCPU information and callbacks for domain 1:<br> > (XEN) VCPU0: CPU3 [has=T] flags=0 poll=0 upcall_pend = 01, upcall_mask = 01<br> > dirty_cpus={3} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU1: CPU7 [has=F] flags=1 poll=10 upcall_pend = 01, upcall_mask = 01<br> > dirty_cpus={} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU2: CPU4 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00<br> > dirty_cpus={} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU3: CPU5 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00<br> > dirty_cpus={} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU4: CPU6 [has=F] flags=1 poll=28 upcall_pend = 01, upcall_mask = 01<br> > dirty_cpus={} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU5: CPU7 [has=F] flags=1 poll=0 upcall_pend = 00, upcall_mask = 00<br> > dirty_cpus={7} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU6: CPU0 [has=F] flags=1 poll=40 upcall_pend = 01, upcall_mask = 01<br> > dirty_cpus={} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) VCPU7: CPU6 [has=T] flags=0 poll=0 upcall_pend = 00, upcall_mask = 01<br> > dirty_cpus={6} cpu_affinity={0-127}<br> > (XEN) No periodic timer<br> > (XEN) Notifying guest 0:0 (virq 1, port 5, stat 0/0/0)<br> > (XEN) Notifying guest 0:1 (virq 1, port 11, stat 0/0/0)<br> > (XEN) Notifying guest 0:2 (virq 1, port 17, stat 0/0/0)<br> > (XEN) Notifying guest 0:3 (virq 1, port 23, stat 0/0/0)<br> > (XEN) Notifying guest 0:4 (virq 1, port 29, stat 0/0/0)<br> > (XEN) Notifying guest 0:5 (virq 1, port 35, stat 0/0/0)<br> > (XEN) Notifying guest 0:6 (virq 1, port 41, stat 0/0/0)<br> > (XEN) Notifying guest 0:7 (virq 1, port 47, stat 0/0/0)<br> > (XEN) Notifying guest 1:0 (virq 1, port 7, stat 0/0/-1)<br> > (XEN) Notifying guest 1:1 (virq 1, port 13, stat 0/0/-1)<br> > (XEN) Notifying guest 1:2 (virq 1, port 19, stat 0/0/0)<br> > (XEN) Notifying guest 1:3 (virq 1, port 25, stat 0/0/0)<br> > (XEN) Notifying guest 1:4 (virq 1, port 31, stat 0/0/-1)<br> > (XEN) Notifying guest 1:5 (virq 1, port 37, stat 0/0/0)<br> > (XEN) Notifying guest 1:6 (virq 1, port 43, stat 0/0/-1)<br> > (XEN) Notifying guest 1:7 (virq 1, port 49, stat 0/0/0)</span><br> <br> So in this case, vcpu 1 is in a poll, on port 10, which is an IPI event channel for itself.<br> <br> Same for vcpu 4, except it is on port 28, and for vcpu 6 on port 60.<br> <br> In each case, the event channels are masked (no surprise given the conversation so far on this thread), and have no pending events. Therefore, I believe we are looking at the same bug.<br> <br> <br> <span style="white-space: pre;">><br> ><br> > When the guest was unresponsive the console would still show:<br> > [10174.372092] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 1} (detected<br> > by 4, t=15002 jiffies)<br> > [10284.448089] INFO: rcu_bh detected stalls on CPUs/tasks: { 0 1 4 6} (detected<br> > by 5, t=15004 jiffies)<br> ><br> > in a repeating pattern. So I take the above as cpus 1,4 and 6 are polling. From<br> > the dump and the content of lock_spinners I get:<br> ><br> > cpu 0 and 1 -> ffff8803bfc13700 (which is runqueue[0] and is unlocked again)<br> > cpu 4 and 6 -> ffffffff81f15ef0 (which is blkif_io_lock and is locked)</span><br> <br> I wonder if there is possibly a race condition between notifying that a lock has been unlocked, and another vcpu trying to poll after deciding that the lock is locked.<br> <br> The other option is that there is a bug in working out which event channel to notify when a lock is unlocked.<br> <br> ~Andrew<br> <br> <span style="white-space: pre;">><br> ><br> > Backtraces would be somewhat inconsistent (as always). Note, I should mention<br> > that I still had a kernel with my patch applied on that guest. That changes<br> > things a bit (actually it takes a bit longer to hang but again that might be<br> > just a matter of timing). The strange lock state of 2 spinners on an unlocked<br> > lock remains the same with or without it.<br> ><br> > One question about the patch actually, would anybody think that there could be a<br> > case where the unlocking cpu has itself on the spinners list? I did not think so<br> > but that might be wrong.<br> >> <br> >> The IRQ handler for the spinlock evtchn in Linux is:<br> >> static irqreturn_t dummy_handler(int irq, void *dev_id)<br> >> {<br> >> BUG();<br> >> return IRQ_HANDLED;<br> >> }<br> >> <br> >> and right after we register it:<br> >> disable_irq(irq); /* make sure it''s never delivered */<br> >><br> >> The is no enable -- ignoring bugs of which there have been couple of<br> >> instances, but those trigger the BUG() so are pretty obvious.<br> >><br> >> Ian.<br> >><br> >><br> ><br> ><br> ><br> ></span><br> <br> -- <br> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br> T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext" href="http://www.citrix.com">http://www.citrix.com</a><br> <br> </body> </html> --------------030607010300050505080009-- --===============6179704354617584911=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6179704354617584911==--
Stefan Bader
2012-Oct-17 16:27 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17.10.2012 17:35, Andrew Cooper wrote:>> (XEN) Event channel information for domain 1: >> (XEN) Polling vCPUs: {1,4,6} >> (XEN) port [p/m] >> (XEN) 4 [1/1]: s=6 n=0 x=0 >> (XEN) 10 [0/1]: s=6 n=1 x=0 >> (XEN) 28 [0/1]: s=6 n=4 x=0 >> (XEN) 40 [0/1]: s=6 n=6 x=0 >> > s = state. 0 = free, 1 = reserved, 2 = unbound, 3 = inter-domain, 4 > pirq, 5 = virq, 6 = ipi > n = target vcpu id to notify > x = boolean indicating whether xen is a consumer of the event channel or > not. > > d = target domain (when appropriate) In this case, p is the target port. >Thanks (at least something learned today :)) One thing I noticed here, in the event channel info above, pending is 0 for channel 10, 28 and 40 (and set for 4 which is the spinlock ipi for cpu 0). But in the VCPU info below (another unknown: has=T and F) it says upcall_pend for all of them. Unfortunately that might just mean that things change...>> (XEN) VCPU0: CPU3 [has=T] flags=0 poll=0 upcall_pend = 01, upcall_mask > = 01 >> dirty_cpus={3} cpu_affinity={0-127} >> (XEN) No periodic timer >> (XEN) VCPU1: CPU7 [has=F] flags=1 poll=10 upcall_pend = 01, > upcall_mask = 01 >> dirty_cpus={} cpu_affinity={0-127} >> (XEN) No periodic timer>> (XEN) VCPU4: CPU6 [has=F] flags=1 poll=28 upcall_pend = 01, > upcall_mask = 01 >> dirty_cpus={} cpu_affinity={0-127} >> (XEN) No periodic timer >> (XEN) VCPU6: CPU0 [has=F] flags=1 poll=40 upcall_pend = 01, > upcall_mask = 01 >> dirty_cpus={} cpu_affinity={0-127} >> (XEN) No periodic timer > > So in this case, vcpu 1 is in a poll, on port 10, which is an IPI event > channel for itself. > > Same for vcpu 4, except it is on port 28, and for vcpu 6 on port 60. >> > I wonder if there is possibly a race condition between notifying that a > lock has been unlocked, and another vcpu trying to poll after deciding > that the lock is locked.There has to be something somehwere, I just cannot spot it. The unlocking cpu will do a wmb() before setting the lock to 0, then a mb() and then check for spinners. When failing the quick pack a locker will first set the lockspinner entry, then do a wmb() and increment the spinners count. After that it clears the event pending and then checks lock again before actually going into poll.> > The other option is that there is a bug in working out which event > channel to notify when a lock is unlocked.I had thought I saw one thing that I tried to fix with my patch. Another train of thought would have been any other cpu grabbing the lock always as soon as it gets released and so preventing any cpu in poll from success. But that would then show the lock as locked...> > ~Andrew > >> >> >> Backtraces would be somewhat inconsistent (as always). Note, I should > mention >> that I still had a kernel with my patch applied on that guest. That > changes >> things a bit (actually it takes a bit longer to hang but again that > might be >> just a matter of timing). The strange lock state of 2 spinners on an > unlocked >> lock remains the same with or without it. >> >> One question about the patch actually, would anybody think that there > could be a >> case where the unlocking cpu has itself on the spinners list? I did > not think so >> but that might be wrong. >>> >>> The IRQ handler for the spinlock evtchn in Linux is: >>> static irqreturn_t dummy_handler(int irq, void *dev_id) >>> { >>> BUG(); >>> return IRQ_HANDLED; >>> } >>> >>> and right after we register it: >>> disable_irq(irq); /* make sure it''s never delivered */ >>> >>> The is no enable -- ignoring bugs of which there have been couple of >>> instances, but those trigger the BUG() so are pretty obvious. >>> >>> Ian. >>> >>> >> >> >> >> > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2012-Oct-17 17:46 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 17/10/12 17:27, Stefan Bader wrote:> On 17.10.2012 17:35, Andrew Cooper wrote: > >>> (XEN) Event channel information for domain 1: >>> (XEN) Polling vCPUs: {1,4,6} >>> (XEN) port [p/m] >>> (XEN) 4 [1/1]: s=6 n=0 x=0 >>> (XEN) 10 [0/1]: s=6 n=1 x=0 >>> (XEN) 28 [0/1]: s=6 n=4 x=0 >>> (XEN) 40 [0/1]: s=6 n=6 x=0 >>> >> s = state. 0 = free, 1 = reserved, 2 = unbound, 3 = inter-domain, 4 >> pirq, 5 = virq, 6 = ipi >> n = target vcpu id to notify >> x = boolean indicating whether xen is a consumer of the event channel or >> not. >> >> d = target domain (when appropriate) In this case, p is the target port. >> > Thanks (at least something learned today :)) One thing I noticed here, in the > event channel info above, pending is 0 for channel 10, 28 and 40 (and set for 4 > which is the spinlock ipi for cpu 0). But in the VCPU info below (another > unknown: has=T and F) it says upcall_pend for all of them. Unfortunately that > might just mean that things change...I believe the upcall_pending is just a simple bit indicating that an event channel for that vcpu has an event pending, but doesn''t identify which event channel it is. Of course, as the VCPU is currently descheduled in Xen waiting for a specific event channel to end the poll, other events are likely to build up, but will not be serviced until the VCPU resumes. The "has=" value is a boolean from struct vcpu, vcpu->is_running. This is a bit more subtle, as a vcpu is still considered running if its context is on the hypervisor pcpu stack, but Xen has invoked the idle loop on that PCPU. I don''t know whether this is relevant to the bug.> >>> (XEN) VCPU0: CPU3 [has=T] flags=0 poll=0 upcall_pend = 01, upcall_mask >> = 01 >>> dirty_cpus={3} cpu_affinity={0-127} >>> (XEN) No periodic timer >>> (XEN) VCPU1: CPU7 [has=F] flags=1 poll=10 upcall_pend = 01, >> upcall_mask = 01 >>> dirty_cpus={} cpu_affinity={0-127} >>> (XEN) No periodic timer >>> (XEN) VCPU4: CPU6 [has=F] flags=1 poll=28 upcall_pend = 01, >> upcall_mask = 01 >>> dirty_cpus={} cpu_affinity={0-127} >>> (XEN) No periodic timer >>> (XEN) VCPU6: CPU0 [has=F] flags=1 poll=40 upcall_pend = 01, >> upcall_mask = 01 >>> dirty_cpus={} cpu_affinity={0-127} >>> (XEN) No periodic timer >> So in this case, vcpu 1 is in a poll, on port 10, which is an IPI event >> channel for itself. >> >> Same for vcpu 4, except it is on port 28, and for vcpu 6 on port 60. >> >> I wonder if there is possibly a race condition between notifying that a >> lock has been unlocked, and another vcpu trying to poll after deciding >> that the lock is locked. > There has to be something somehwere, I just cannot spot it. The unlocking cpu > will do a wmb() before setting the lock to 0, then a mb() and then check for > spinners. When failing the quick pack a locker will first set the lockspinner > entry, then do a wmb() and increment the spinners count. After that it clears > the event pending and then checks lock again before actually going into poll. > >> The other option is that there is a bug in working out which event >> channel to notify when a lock is unlocked. > I had thought I saw one thing that I tried to fix with my patch. Another train > of thought would have been any other cpu grabbing the lock always as soon as it > gets released and so preventing any cpu in poll from success. But that would > then show the lock as locked...How does the unlocking code choose an event channel to invoke? Is there perhaps a race condition updating that value if another cpu nabs the lock inbetween? (I am just throwing ideas around - I am not very familar with this code) ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
Jan Beulich
2012-Oct-18 07:00 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > In each case, the event channels are masked (no surprise given the > conversation so far on this thread), and have no pending events. > Therefore, I believe we are looking at the same bug.That seems very unlikely (albeit not impossible) to me, given that the non-pvops kernel uses ticket locks while the pvops one doesn''t. Jan
Jan Beulich
2012-Oct-18 07:08 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 18.10.12 at 09:00, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> In each case, the event channels are masked (no surprise given the >> conversation so far on this thread), and have no pending events. >> Therefore, I believe we are looking at the same bug. > > That seems very unlikely (albeit not impossible) to me, given that > the non-pvops kernel uses ticket locks while the pvops one doesn''t.And in fact we had a similar problem with our original ticket lock implementation, exposed by an open coded lock in the scheduler''s run queue management. But that was really ticket lock specific, in that the fact that a CPU could passively become the owner of a lock while polling - that''s impossible with pvops'' byte locks afaict. Jan
Stefan Bader
2012-Oct-18 07:24 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 09:00, Jan Beulich wrote:>>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> In each case, the event channels are masked (no surprise given the >> conversation so far on this thread), and have no pending events. >> Therefore, I believe we are looking at the same bug. > > That seems very unlikely (albeit not impossible) to me, given that > the non-pvops kernel uses ticket locks while the pvops one doesn''t.True, if classic 2.6.32 actually meant non-pvops. In my memory that was the version that first introduced them, so I assumed we talk about the same base. The one I am looking at seems (as much info there is available right now, but I try to get confirmation from reporters running a non-paravirt-spinlock kernel on a production load) to be related directly to the paravirtualized spinlock implementation for Xen. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Stefan Bader
2012-Oct-18 07:38 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 09:08, Jan Beulich wrote:>>>> On 18.10.12 at 09:00, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> In each case, the event channels are masked (no surprise given the >>> conversation so far on this thread), and have no pending events. >>> Therefore, I believe we are looking at the same bug. >> >> That seems very unlikely (albeit not impossible) to me, given that >> the non-pvops kernel uses ticket locks while the pvops one doesn''t. > > And in fact we had a similar problem with our original ticket lock > implementation, exposed by an open coded lock in the scheduler''s > run queue management. But that was really ticket lock specific, > in that the fact that a CPU could passively become the owner of > a lock while polling - that''s impossible with pvops'' byte locks afaict.One of the trains of thought I had was whether it could happen that a cpu is in polling and the task gets moved. But I don''t think it can happen as the hypercall unlikely is a place where any schedule happens (preempt is none). And it would be much more common... One detail which I hope someone can fill in is the whole "interrupted spinlock" thing. Saving the last lock pointer stored on the per-cpu lock_spinners and so on. Is that really only for spinlocks taken without interrupts disabled or do I miss something there? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Oct-18 07:48 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On Thu, 2012-10-18 at 08:38 +0100, Stefan Bader wrote:> On 18.10.2012 09:08, Jan Beulich wrote: > >>>> On 18.10.12 at 09:00, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >>> In each case, the event channels are masked (no surprise given the > >>> conversation so far on this thread), and have no pending events. > >>> Therefore, I believe we are looking at the same bug. > >> > >> That seems very unlikely (albeit not impossible) to me, given that > >> the non-pvops kernel uses ticket locks while the pvops one doesn''t. > > > > And in fact we had a similar problem with our original ticket lock > > implementation, exposed by an open coded lock in the scheduler''s > > run queue management. But that was really ticket lock specific, > > in that the fact that a CPU could passively become the owner of > > a lock while polling - that''s impossible with pvops'' byte locks afaict. > > One of the trains of thought I had was whether it could happen that a cpu is in > polling and the task gets moved. But I don''t think it can happen as the > hypercall unlikely is a place where any schedule happens (preempt is none). And > it would be much more common... > > One detail which I hope someone can fill in is the whole "interrupted spinlock" > thing. Saving the last lock pointer stored on the per-cpu lock_spinners and so > on. Is that really only for spinlocks taken without interrupts disabled or do I > miss something there?spinning_lock() returns the old lock which the caller is expected to remember and replace via unspinning_lock() -- it effectively implements a stack of locks which are being waited on. xen_spin_lock_slow (the only caller0 appears to do this correctly from a brief inspection. Is there any chance this is just a simple AB-BA or similar type deadlock? Do we have data which suggests all vCPUs are waiting on the same lock or just that they are waiting on some lock? I suppose lockdep (which I think you mentioned before?) would have caught this, unless pv locks somehow confound it? Ian.
Stefan Bader
2012-Oct-18 10:20 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 09:48, Ian Campbell wrote:> On Thu, 2012-10-18 at 08:38 +0100, Stefan Bader wrote: >> On 18.10.2012 09:08, Jan Beulich wrote: >>>>>> On 18.10.12 at 09:00, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>>>> On 17.10.12 at 17:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>>> In each case, the event channels are masked (no surprise given the >>>>> conversation so far on this thread), and have no pending events. >>>>> Therefore, I believe we are looking at the same bug. >>>> >>>> That seems very unlikely (albeit not impossible) to me, given that >>>> the non-pvops kernel uses ticket locks while the pvops one doesn''t. >>> >>> And in fact we had a similar problem with our original ticket lock >>> implementation, exposed by an open coded lock in the scheduler''s >>> run queue management. But that was really ticket lock specific, >>> in that the fact that a CPU could passively become the owner of >>> a lock while polling - that''s impossible with pvops'' byte locks afaict. >> >> One of the trains of thought I had was whether it could happen that a cpu is in >> polling and the task gets moved. But I don''t think it can happen as the >> hypercall unlikely is a place where any schedule happens (preempt is none). And >> it would be much more common... >> >> One detail which I hope someone can fill in is the whole "interrupted spinlock" >> thing. Saving the last lock pointer stored on the per-cpu lock_spinners and so >> on. Is that really only for spinlocks taken without interrupts disabled or do I >> miss something there? > > spinning_lock() returns the old lock which the caller is expected to > remember and replace via unspinning_lock() -- it effectively implements > a stack of locks which are being waited on. xen_spin_lock_slow (the only > caller0 appears to do this correctly from a brief inspection.Yes, just *when* can there be a stack of locks (spinlocks). The poll_irq hypercall seems to be an active (in the sense of not preemting to another task) process. How could there be a situation that another lock (on the same cpu is tried to be taken).> > Is there any chance this is just a simple AB-BA or similar type > deadlock? Do we have data which suggests all vCPUs are waiting on the > same lock or just that they are waiting on some lock? I suppose lockdep > (which I think you mentioned before?) would have caught this, unless pv > locks somehow confound it?The one situation where I went deeper into the tasks that appeared to be on a cpu it was one waiting for signalling a task that looked to be just scheduled out and the cpu it was running on doing a idle balance that waited on the lock for cpu#0''s runqueue. Which cpu#0 itself seemed to be waiting slow (the lock pointer was on lock_spinners[0]) but the lock itself was 0. Though there is a chance that this is always just a coincidental state where the lock just was released and more related to how the Xen stack does a guest dump. So it would be to find who holds the other lock. Unfortunately at least a full lock debugging enabled kernel is sufficiently different in timing that I cannot reproduce the issue on a test machine. And from reported crashes in production I have no data.> > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-18 10:47 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 18.10.12 at 12:20, Stefan Bader <stefan.bader@canonical.com> wrote: > On 18.10.2012 09:48, Ian Campbell wrote: >> spinning_lock() returns the old lock which the caller is expected to >> remember and replace via unspinning_lock() -- it effectively implements >> a stack of locks which are being waited on. xen_spin_lock_slow (the only >> caller0 appears to do this correctly from a brief inspection. > > Yes, just *when* can there be a stack of locks (spinlocks). The poll_irq > hypercall seems to be an active (in the sense of not preemting to another > task) > process. How could there be a situation that another lock (on the same cpu > is tried to be taken).Obviously when this is an acquire not disabling interrupts, and an interrupt comes in while in the poll hypercall (or about to go there, or just having come back from one). Jan
Stefan Bader
2012-Oct-18 12:43 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 12:47, Jan Beulich wrote:>>>> On 18.10.12 at 12:20, Stefan Bader <stefan.bader@canonical.com> wrote: >> On 18.10.2012 09:48, Ian Campbell wrote: >>> spinning_lock() returns the old lock which the caller is expected to >>> remember and replace via unspinning_lock() -- it effectively implements >>> a stack of locks which are being waited on. xen_spin_lock_slow (the only >>> caller0 appears to do this correctly from a brief inspection. >> >> Yes, just *when* can there be a stack of locks (spinlocks). The poll_irq >> hypercall seems to be an active (in the sense of not preemting to another >> task) >> process. How could there be a situation that another lock (on the same cpu >> is tried to be taken). > > Obviously when this is an acquire not disabling interrupts, and > an interrupt comes in while in the poll hypercall (or about to go > there, or just having come back from one). > > Jan >Obviously. ;) Ok, so my thinking there was ok and its one level deep max. At some point staring at things I start question my sanity. A wild thinking would be whether in that case the interrupted spinlock may miss a wakeup forever when the unlocker only can check for the toplevel. Hm, but that should be easy to rule out by just adding an error to spin_unlock_slow when it fails to find anything... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Stefan Bader
2012-Oct-18 20:52 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 14:43, Stefan Bader wrote:>> Obviously when this is an acquire not disabling interrupts, and >> an interrupt comes in while in the poll hypercall (or about to go >> there, or just having come back from one). >> >> Jan >> > Obviously. ;) Ok, so my thinking there was ok and its one level deep max. At > some point staring at things I start question my sanity. > A wild thinking would be whether in that case the interrupted spinlock may miss > a wakeup forever when the unlocker only can check for the toplevel. Hm, but that > should be easy to rule out by just adding an error to spin_unlock_slow when it > fails to find anything... >Actually I begin to suspect that it could be possible that I just overlooked the most obvious thing. Provoking question: are we sure we are on the same page about the purpose of the spin_lock_flags variant of the pv lock ops interface? I begin to suspect that it really is not for giving a chance to re-enable interrupts. Just what it should be used for I am not clear. Anyway it seems all other places more or less ignore the flags and map themselves back to an ignorant version of spinlock. Also I believe that the only high level function that would end up in passing any flags, would be the spin_lock_irqsave one. And I am pretty sure that this one will expect interrupts to stay disabled. So I tried below approach and that seems to be surviving the previously breaking testcase for much longer than anything I tried before. -Stefan From f2ebb6626f3e3a00932bf1f4f75265f826c7fba9 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Thu, 18 Oct 2012 21:40:37 +0200 Subject: [PATCH 1/2] xen/pv-spinlock: Never enable interrupts in xen_spin_lock_slow() I am not sure what exactly the spin_lock_flags variant of the pv-spinlocks (or even in the arch spinlocks) should be used for. But it should not be used as an invitation to enable irqs. The only high-level variant that seems to end up there is the spin_lock_irqsave one and that would always be used in a context that expects the interrupts to be disabled. The generic paravirt-spinlock code just maps the flags variant to the one without flags, so just do the same and get rid of all the stuff that is not needed anymore. This seems to be resolving a weird locking issue seen when having a high i/o database load on a PV Xen guest with multiple (8+ in local experiments) CPUs. Well, thinking about it a second time it seems like one of those "how did that ever work?" cases. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> --- arch/x86/xen/spinlock.c | 23 +++++------------------ 1 file changed, 5 insertions(+), 18 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 83e866d..3330a1d 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -24,7 +24,6 @@ static struct xen_spinlock_stats u32 taken_slow_nested; u32 taken_slow_pickup; u32 taken_slow_spurious; - u32 taken_slow_irqenable; u64 released; u32 released_slow; @@ -197,7 +196,7 @@ static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock __this_cpu_write(lock_spinners, prev); } -static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enable) +static noinline int xen_spin_lock_slow(struct arch_spinlock *lock) { struct xen_spinlock *xl = (struct xen_spinlock *)lock; struct xen_spinlock *prev; @@ -218,8 +217,6 @@ static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enab ADD_STATS(taken_slow_nested, prev != NULL); do { - unsigned long flags; - /* clear pending */ xen_clear_irq_pending(irq); @@ -239,12 +236,6 @@ static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enab goto out; } - flags = arch_local_save_flags(); - if (irq_enable) { - ADD_STATS(taken_slow_irqenable, 1); - raw_local_irq_enable(); - } - /* * Block until irq becomes pending. If we''re * interrupted at this point (after the trylock but @@ -256,8 +247,6 @@ static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enab */ xen_poll_irq(irq); - raw_local_irq_restore(flags); - ADD_STATS(taken_slow_spurious, !xen_test_irq_pending(irq)); } while (!xen_test_irq_pending(irq)); /* check for spurious wakeups */ @@ -270,7 +259,7 @@ out: return ret; } -static inline void __xen_spin_lock(struct arch_spinlock *lock, bool irq_enable) +static inline void __xen_spin_lock(struct arch_spinlock *lock) { struct xen_spinlock *xl = (struct xen_spinlock *)lock; unsigned timeout; @@ -302,19 +291,19 @@ static inline void __xen_spin_lock(struct arch_spinlock *lock, bool irq_enable) spin_time_accum_spinning(start_spin_fast); } while (unlikely(oldval != 0 && - (TIMEOUT == ~0 || !xen_spin_lock_slow(lock, irq_enable)))); + (TIMEOUT == ~0 || !xen_spin_lock_slow(lock)))); spin_time_accum_total(start_spin); } static void xen_spin_lock(struct arch_spinlock *lock) { - __xen_spin_lock(lock, false); + __xen_spin_lock(lock); } static void xen_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags) { - __xen_spin_lock(lock, !raw_irqs_disabled_flags(flags)); + __xen_spin_lock(lock); } static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl) @@ -424,8 +413,6 @@ static int __init xen_spinlock_debugfs(void) &spinlock_stats.taken_slow_pickup); debugfs_create_u32("taken_slow_spurious", 0444, d_spin_debug, &spinlock_stats.taken_slow_spurious); - debugfs_create_u32("taken_slow_irqenable", 0444, d_spin_debug, - &spinlock_stats.taken_slow_irqenable); debugfs_create_u64("released", 0444, d_spin_debug, &spinlock_stats.released); debugfs_create_u32("released_slow", 0444, d_spin_debug, -- 1.7.9.5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Stefan Bader
2012-Oct-19 07:10 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 18.10.2012 22:52, Stefan Bader wrote:> On 18.10.2012 14:43, Stefan Bader wrote:> So I tried below approach and that seems to be surviving the previously breaking > testcase for much longer than anything I tried before. > > -Stefan > > From f2ebb6626f3e3a00932bf1f4f75265f826c7fba9 Mon Sep 17 00:00:00 2001 > From: Stefan Bader <stefan.bader@canonical.com> > Date: Thu, 18 Oct 2012 21:40:37 +0200 > Subject: [PATCH 1/2] xen/pv-spinlock: Never enable interrupts in > xen_spin_lock_slow() >Testcase ran the whole night without problems. Now re-started it with a cleaner kernel (only the patch above applied and no debugging and it has again passed the threshold it would have broken before. -Stefan Oh, just in case this *is* the solution, it should very definitely be marked with a cc: stable. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-19 08:06 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 18.10.12 at 22:52, Stefan Bader <stefan.bader@canonical.com> wrote: > Actually I begin to suspect that it could be possible that I just overlooked > the > most obvious thing. Provoking question: are we sure we are on the same page > about the purpose of the spin_lock_flags variant of the pv lock ops > interface? > > I begin to suspect that it really is not for giving a chance to re-enable > interrupts. Just what it should be used for I am not clear. Anyway it seems > all > other places more or less ignore the flags and map themselves back to an > ignorant version of spinlock. > Also I believe that the only high level function that would end up in > passing > any flags, would be the spin_lock_irqsave one. And I am pretty sure that > this > one will expect interrupts to stay disabled.No - the only requirement here is that from the point on where the lock is owned interrupt must remain disabled. Re-enabling intermediately is quite okay (and used to be done by the native kernel prior to the conversion to ticket locks iirc).> So I tried below approach and that seems to be surviving the previously > breaking > testcase for much longer than anything I tried before.If that indeed fixes your problem, then (minus eventual problems with the scope of the interrupts-enabled window) this rather points at a bug in the users of the spinlock interfaces. Jan
Stefan Bader
2012-Oct-19 08:33 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 19.10.2012 10:06, Jan Beulich wrote:>>>> On 18.10.12 at 22:52, Stefan Bader <stefan.bader@canonical.com> wrote: >> Actually I begin to suspect that it could be possible that I just overlooked >> the >> most obvious thing. Provoking question: are we sure we are on the same page >> about the purpose of the spin_lock_flags variant of the pv lock ops >> interface? >> >> I begin to suspect that it really is not for giving a chance to re-enable >> interrupts. Just what it should be used for I am not clear. Anyway it seems >> all >> other places more or less ignore the flags and map themselves back to an >> ignorant version of spinlock. >> Also I believe that the only high level function that would end up in >> passing >> any flags, would be the spin_lock_irqsave one. And I am pretty sure that >> this >> one will expect interrupts to stay disabled. > > No - the only requirement here is that from the point on where > the lock is owned interrupt must remain disabled. Re-enabling > intermediately is quite okay (and used to be done by the > native kernel prior to the conversion to ticket locks iirc).Though it seems rather dangerous then. Don''t remember the old code, but imo it always opens up a (even microscopic) window to unexpected interruptions.> >> So I tried below approach and that seems to be surviving the previously >> breaking >> testcase for much longer than anything I tried before. > > If that indeed fixes your problem, then (minus eventual problems > with the scope of the interrupts-enabled window) this rather > points at a bug in the users of the spinlock interfaces.I would be pragmatic here, none of the other current implementations seem to re-enable interrupts and so this only affects xen pv. And how much really is gained from enabling it compared to the risk of being affected by something that nobody else will be?> > Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-19 09:24 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 19.10.12 at 10:33, Stefan Bader <stefan.bader@canonical.com> wrote: > On 19.10.2012 10:06, Jan Beulich wrote: >>>>> On 18.10.12 at 22:52, Stefan Bader <stefan.bader@canonical.com> wrote: >>> Actually I begin to suspect that it could be possible that I just overlooked > >>> the >>> most obvious thing. Provoking question: are we sure we are on the same page >>> about the purpose of the spin_lock_flags variant of the pv lock ops >>> interface? >>> >>> I begin to suspect that it really is not for giving a chance to re-enable >>> interrupts. Just what it should be used for I am not clear. Anyway it seems >>> all >>> other places more or less ignore the flags and map themselves back to an >>> ignorant version of spinlock. >>> Also I believe that the only high level function that would end up in >>> passing >>> any flags, would be the spin_lock_irqsave one. And I am pretty sure that >>> this >>> one will expect interrupts to stay disabled. >> >> No - the only requirement here is that from the point on where >> the lock is owned interrupt must remain disabled. Re-enabling >> intermediately is quite okay (and used to be done by the >> native kernel prior to the conversion to ticket locks iirc). > > Though it seems rather dangerous then. Don''t remember the old code, but imo > it > always opens up a (even microscopic) window to unexpected interruptions.There just can''t be unexpected interruptions. Whenever interrupts are enabled, it is expected that they can occur.>>> So I tried below approach and that seems to be surviving the previously >>> breaking >>> testcase for much longer than anything I tried before. >> >> If that indeed fixes your problem, then (minus eventual problems >> with the scope of the interrupts-enabled window) this rather >> points at a bug in the users of the spinlock interfaces. > > I would be pragmatic here, none of the other current implementations seem to > re-enable interrupts and so this only affects xen pv.I don''t think you really checked - the first arch I looked at (s390, as being the most obvious one to look at when it comes to virtualization) quite prominently re-enableds interrupts in arch_spin_lock_wait_flags().> And how much really is > gained from enabling it compared to the risk of being affected by something > that nobody else will be?The main difference between the native and virtualized cases is that the period of time you spend waiting for the lock to become available is pretty much unbounded (even more so without ticket locks), and keeping interrupts disabled for such an extended period of time is just going to ask for other problems. And note that this isn''t the case just for Xen PV - all virtualization scenarios suffer from that. Jan
Stefan Bader
2012-Oct-19 14:03 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 19.10.2012 11:24, Jan Beulich wrote:>>>> On 19.10.12 at 10:33, Stefan Bader <stefan.bader@canonical.com> wrote: >> On 19.10.2012 10:06, Jan Beulich wrote: >>>>>> On 18.10.12 at 22:52, Stefan Bader <stefan.bader@canonical.com> >>>>>> wrote: >>>> Actually I begin to suspect that it could be possible that I just >>>> overlooked >> >>>> the most obvious thing. Provoking question: are we sure we are on the >>>> same page about the purpose of the spin_lock_flags variant of the pv >>>> lock ops interface? >>>> >>>> I begin to suspect that it really is not for giving a chance to >>>> re-enable interrupts. Just what it should be used for I am not clear. >>>> Anyway it seems all other places more or less ignore the flags and map >>>> themselves back to an ignorant version of spinlock. Also I believe that >>>> the only high level function that would end up in passing any flags, >>>> would be the spin_lock_irqsave one. And I am pretty sure that this one >>>> will expect interrupts to stay disabled. >>> >>> No - the only requirement here is that from the point on where the lock >>> is owned interrupt must remain disabled. Re-enabling intermediately is >>> quite okay (and used to be done by the native kernel prior to the >>> conversion to ticket locks iirc). >> >> Though it seems rather dangerous then. Don''t remember the old code, but imo >> it always opens up a (even microscopic) window to unexpected >> interruptions. > > There just can''t be unexpected interruptions. Whenever interrupts are > enabled, it is expected that they can occur.Probably one thing that makes things a bit more complicated is that in the PVM case interrupts maps to vcpu->evtchn_upcall_mask.> >>>> So I tried below approach and that seems to be surviving the previously >>>> breaking testcase for much longer than anything I tried before. >>> >>> If that indeed fixes your problem, then (minus eventual problems with the >>> scope of the interrupts-enabled window) this rather points at a bug in >>> the users of the spinlock interfaces. >> >> I would be pragmatic here, none of the other current implementations seem >> to re-enable interrupts and so this only affects xen pv. > > I don''t think you really checked - the first arch I looked at (s390, as being > the most obvious one to look at when it comes to virtualization) quite > prominently re-enableds interrupts in arch_spin_lock_wait_flags().No, I assumed that you saying native kernel did prior to ticket lock conversion, that this involves more historic search. And yes s390 is doing virtualization quite a bit back into history. Just not paravirtualization. And when I look at arch_spin_lock_wait_flags() enabling/disabling is done close by (at least I am not leaving off into some hypercall fog).> >> And how much really is gained from enabling it compared to the risk of >> being affected by something that nobody else will be? > > The main difference between the native and virtualized cases is that the > period of time you spend waiting for the lock to become available is pretty > much unbounded (even more so without ticket locks), and keeping interrupts > disabled for such an extended period of time is just going to ask for other > problems.So not sure I followed all of the right paths here, but I think xen_poll_irq ends up doing a hypercall via syscall. Syscall seems to mask of some bits of the flags (maybe irq) but certainly that will not translate automatically into the upcall mask. Then, again hopefully the right place, in the mini-os part, the hypervisor_callback there is some annotation about leaving the events masked off until having checked for being already interrupted. Could be the same mask that is checked here that the guest has cleared to enable interrupts... Would that be expected?> > And note that this isn''t the case just for Xen PV - all virtualization > scenarios suffer from that. > > Jan > > > _______________________________________________ Xen-devel mailing list > Xen-devel@lists.xen.org http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-19 14:49 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 19.10.12 at 16:03, Stefan Bader <stefan.bader@canonical.com> wrote: > On 19.10.2012 11:24, Jan Beulich wrote: >>>>> On 19.10.12 at 10:33, Stefan Bader <stefan.bader@canonical.com> wrote: >>> I would be pragmatic here, none of the other current implementations seem >>> to re-enable interrupts and so this only affects xen pv. >> >> I don''t think you really checked - the first arch I looked at (s390, as being >> the most obvious one to look at when it comes to virtualization) quite >> prominently re-enableds interrupts in arch_spin_lock_wait_flags(). > > No, I assumed that you saying native kernel did prior to ticket lock > conversion,I indeed did say that. Just go look back at the old code. Jan> that this involves more historic search. And yes s390 is doing > virtualization > quite a bit back into history. Just not paravirtualization. > And when I look at arch_spin_lock_wait_flags() enabling/disabling is done > close > by (at least I am not leaving off into some hypercall fog).
Stefan Bader
2012-Oct-19 14:57 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 19.10.2012 16:49, Jan Beulich wrote:>> No, I assumed that you saying native kernel did prior to ticket lock >> conversion, > > I indeed did say that. Just go look back at the old code.No, sorry, but no. Just that old code did it does not help me here. I di believe you there. Still it seems there is something wrong doing it there in a pv guest. Though I cannot give a satisfying explanation, yet. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-19 15:08 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 19.10.12 at 16:57, Stefan Bader <stefan.bader@canonical.com> wrote: > On 19.10.2012 16:49, Jan Beulich wrote: > >>> No, I assumed that you saying native kernel did prior to ticket lock >>> conversion, >> >> I indeed did say that. Just go look back at the old code. > > No, sorry, but no. Just that old code did it does not help me here. I di > believe > you there. Still it seems there is something wrong doing it there in a pv > guest.PV guest may be a little broad - did you observe such a problem with a recent kernel of ours?> Though I cannot give a satisfying explanation, yet.But that''s going to be necessary in order to judge whether the brute force approach you take is really the only alternative (and whether it''s fixing the problem rather than just making it less likely). Jan
Stefan Bader
2012-Oct-19 15:21 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
On 19.10.2012 17:08, Jan Beulich wrote:>>>> On 19.10.12 at 16:57, Stefan Bader <stefan.bader@canonical.com> wrote: >> On 19.10.2012 16:49, Jan Beulich wrote: >> >>>> No, I assumed that you saying native kernel did prior to ticket lock >>>> conversion, >>> >>> I indeed did say that. Just go look back at the old code. >> >> No, sorry, but no. Just that old code did it does not help me here. I di >> believe >> you there. Still it seems there is something wrong doing it there in a pv >> guest. > > PV guest may be a little broad - did you observe such a problem > with a recent kernel of ours?3.2 (ok, older) and 3.5 up to now, certainly can and should retry with at least 3.6. And on the host side it was on Amazon EC2 and locally with a 4.1.2 host.> >> Though I cannot give a satisfying explanation, yet. > > But that''s going to be necessary in order to judge whether the > brute force approach you take is really the only alternative (and > whether it''s fixing the problem rather than just making it less > likely).I know. Unfortunately that means digging quite deeply in the not-so-well-known. And it does not help either that the backtraces in the dump seem not completely trustworthy. Like vcpus that the host says are polling on a certain eventchannel do not seem to be sitting on the hypercall.> > Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Oct-19 15:33 UTC
Re: Xen PVM: Strange lockups when running PostgreSQL load
>>> On 19.10.12 at 17:21, Stefan Bader <stefan.bader@canonical.com> wrote: > On 19.10.2012 17:08, Jan Beulich wrote: >>>>> On 19.10.12 at 16:57, Stefan Bader <stefan.bader@canonical.com> wrote: >>> On 19.10.2012 16:49, Jan Beulich wrote: >>> >>>>> No, I assumed that you saying native kernel did prior to ticket lock >>>>> conversion, >>>> >>>> I indeed did say that. Just go look back at the old code. >>> >>> No, sorry, but no. Just that old code did it does not help me here. I di >>> believe >>> you there. Still it seems there is something wrong doing it there in a pv >>> guest. >> >> PV guest may be a little broad - did you observe such a problem >> with a recent kernel of ours? > > 3.2 (ok, older) and 3.5 up to now, certainly can and should retry with at > least > 3.6. And on the host side it was on Amazon EC2 and locally with a 4.1.2 > host.Anything 3.0 and newer is more or less the same spin lock wise; did you ever open a bug report for this? Jan