A while ago I had been reporting an issue which causes Xen PVM guests to hang (apparently related to spinlocks). Since then I was able to gather a few more facts which I try to provide below. For the real reasons, I am still puzzled and would be thankful for any hints or direction. - Happens with Xen 4.1.2 and Xen 4.2 host-side. - PVM guest with 8 VCPUs is running 800 threads doing DB updates. - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free. IIRC the weird lock always belongs to one VCPU runqueue. - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not verified farther back). Since v3.6 it cannot be reproduced. Bisect ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662 "x86/tlb: enable tlb flush range support for x86" - The other potential fix for this was to prevent xen_spin_lock_slow() from unmasking events (enabling interrupts) if those were enabled before the spinlock call. Maybe allowing or not allowing the interrupts in xen_spin_lock_slow just manages to tip the likelihood of getting nested deeper. The same way that the changed TLB flush may reduce it by causing lesser IPIs. From the static information I can extract from a dump it is hard to tell what exactly is going on. VCPU2 seems to be doing something but looks to be rather deep in event handling. The irq_count is at 2, but irq_count seems to be used in a few more places than just xen_hypervisor callback. Though that at least seems to be the entry on the stack I can see for VCPU2. The contents of the per-cpu variables for irq_reqs and irq_stack_ptr at least seem consistent as if a callback came in while being on the sched_op hypercall. But then the bt command seems to be operating on a completely different stack. VCPU2 seems to be active, not polling (event processing could be initiated from enable_irqs via force callback or before finally exiting the sched_op hypercall), there seems to be a pending upcall but upcalls are masked. The next thing I would try is to instrument the sites incrementing and decrementing irq_count... --- Below some of the info seen from dom0 debugging keys and looking at a dump with crash: Dom0 Info: Polling vCPUs: {1,5-7} VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 10 [0/1]: s=6 n=1 x=0 40 [0/1]: s=6 n=6 x=0 46 [0/1]: s=6 n=7 x=0 Guest Info: lock_spinners: [0] (struct xen_spinlock *) 0x [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } [3] (struct xen_spinlock *) 0x [4] (struct xen_spinlock *) 0x [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1] [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2] [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 } VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700 irq_count[1] = 1 irq_count[2] = 2 It is -1 for the rest. *(struct pt_regs *) irq_regs[2] = { .ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op .sp = 0xffff8803bfc83ce8 } char *irq_stack_ptr[2] = 0xffff8803bfc83fc0 #> bt -T ffff88033d65c4a0 PID: 2050 TASK: ffff88033d65c4a0 CPU: 2 COMMAND: "postgres" [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec [ffff88033d72d620] check_preempt_curr at ffffffff81050254 [ffff88033d72d640] pull_task at ffffffff8105e33e [ffff88033d72d670] balance_tasks at ffffffff8105e4b6 [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e [ffff88033d72d690] find_get_page at ffffffff811161ee [ffff88033d72d6b0] find_lock_page at ffffffff81116286 [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10 [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78 [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248 [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269 [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8 [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de [ffff88033d72d920] netif_rx at ffffffff8153f400 [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6 [ffff88033d72d9d0] do_softirq at ffffffff81016284 [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614 [ffff88033d72da00] dev_queue_xmit at ffffffff81540309 [ffff88033d72da50] ip_finish_output at ffffffff815769ab [ffff88033d72da90] ip_output at ffffffff81577518 [ffff88033d72dac0] ip_local_out at ffffffff81576c09 [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1 [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030 [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38 [ffff88033d72db90] irq_to_desc at ffffffff810d70b7 [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1 [ffff88033d72dc10] ttwu_queue at ffffffff81052672 [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474 --- bt -t would end here [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7 [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe [ffff88033d72dd70] md_make_request at ffffffff814e0606 [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf [ffff88033d72de10] mempool_free_slab at ffffffff81118507 [ffff88033d72de20] mempool_free at ffffffff811185b7 [ffff88033d72de50] sys_sendto at ffffffff8152974d [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b [ffff88033d72df00] vfs_write at ffffffff811764b0 [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db RIP: 00007f6852c7b39a RSP: 00007fff431454e0 RFLAGS: 00000212 RAX: 00007f6852fb38b8 RBX: 00007f6852fb3720 RCX: 000000000000014a RDX: 0000000000000150 RSI: 0000000000000140 RDI: 00007f6852fb3720 RBP: 0000000000000150 R8: 0000000000000015 R9: 0000000000000000 R10: 0000000000000000 R11: 00007f6852c9174a R12: 00007f6852fb3778 R13: 0000000000000140 R14: 00007f6852fb38b8 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: e033 SS: e02b #> bt -T ffff88033b812dc0 PID: 2069 TASK: ffff88033b812dc0 CPU: 6 COMMAND: "postgres" [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b [ffff88033b897650] irq_to_desc at ffffffff810d70b7 [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e [ffff88033b897670] balance_tasks at ffffffff8105e493 [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e [ffff88033b897690] find_get_page at ffffffff811161ee [ffff88033b8976b0] find_lock_page at ffffffff81116286 [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10 [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78 [ffff88033b897770] xen_set_pte_at at ffffffff81008e29 [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd [ffff88033b8977b0] unlock_page at ffffffff8111589a [ffff88033b8977d0] __do_fault at ffffffff81138ac9 [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15 [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850 [ffff88033b897900] sched_clock at ffffffff8101bd79 [ffff88033b897910] blk_rq_init at ffffffff812ec262 [ffff88033b897930] get_request at ffffffff812f004e [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66 [ffff88033b8979e0] find_busiest_group at ffffffff8105a061 --- bt -t ends here [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b [ffff88033b897a40] irq_to_desc at ffffffff810d70b7 [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e [ffff88033b897a60] info_for_irq at ffffffff813a636e [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450 [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77 [ffff88033b897b20] xen_spin_lock at ffffffff810121da [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e [ffff88033b897b50] double_rq_lock at ffffffff8105119c [ffff88033b897b80] load_balance at ffffffff8105e788 [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e [ffff88033b897c10] idle_balance at ffffffff8163d57c [ffff88033b897c60] __schedule at ffffffff81653ea9 [ffff88033b897cc0] sleep_on_page at ffffffff81115810 [ffff88033b897ce0] schedule at ffffffff81653fcf [ffff88033b897cf0] io_schedule at ffffffff8165407f [ffff88033b897d10] sleep_on_page at ffffffff8111581e [ffff88033b897d20] __wait_on_bit at ffffffff8165489f [ffff88033b897d70] wait_on_page_bit at ffffffff81115988 [ffff88033b897da0] wake_bit_function at ffffffff8108a140 [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c [ffff88033b897e60] do_writepages at ffffffff81120ee1 [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398 [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f [ffff88033b897f00] vfs_write at ffffffff811764b0 [ffff88033b897f40] do_fsync at ffffffff811a4a36 [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83 [ffff88033b897f80] system_call_fastpath at ffffffff8165e442 RIP: 00007f6852ce8240 RSP: 00007fff43145618 RFLAGS: 00000246 RAX: 000000000000004b RBX: ffffffff8165e442 RCX: 00007f6852ce1900 RDX: 00000000000000c5 RSI: 000000000000000c RDI: 000000000000000c RBP: 00000000000000c5 R8: 000000000073a550 R9: 0000000000000000 R10: 0000000000000004 R11: 0000000000000246 R12: ffffffff811a4d83 R13: ffff88033b897f78 R14: 0000000000000001 R15: 0000000000af7488 ORIG_RAX: 000000000000004b CS: e033 SS: e02b _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:> A while ago I had been reporting an issue which causes Xen PVM guests > to hang (apparently related to spinlocks). Since then I was able to > gather a few more facts which I try to provide below. For the real > reasons, I am still puzzled and would be thankful for any hints or > direction. > > - Happens with Xen 4.1.2 and Xen 4.2 host-side. > - PVM guest with 8 VCPUs is running 800 threads doing DB updates.This is on a host with >= 8 PCPUs, correct?> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free. > IIRC the weird lock always belongs to one VCPU runqueue. > - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not > verified farther back). Since v3.6 it cannot be reproduced. Bisect > ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662 > "x86/tlb: enable tlb flush range support for x86"That seems like a very odd candidate for impacting on this issue.> - The other potential fix for this was to prevent xen_spin_lock_slow() > from unmasking events (enabling interrupts) if those were enabled > before the spinlock call. > > Maybe allowing or not allowing the interrupts in xen_spin_lock_slow > just manages to tip the likelihood of getting nested deeper.Either that or the bug is in the nesting aspect of xen_spin_lock_slow(), which seems to be a rather complex dance judging from the comments in that function (and one I confess I don''t really follow).> The same > way that the changed TLB flush may reduce it by causing lesser IPIs.Ah, that might explain why that patch is relevant, yes.> From the static information I can extract from a dump it is hard to > tell what exactly is going on. VCPU2 seems to be doing something but > looks to be rather deep in event handling. The irq_count is at 2, but > irq_count seems to be used in a few more places than just xen_hypervisor > callback. Though that at least seems to be the entry on the stack I can > see for VCPU2. The contents of the per-cpu variables for irq_reqs and > irq_stack_ptr at least seem consistent as if a callback came in while > being on the sched_op hypercall. But then the bt command seems to be > operating on a completely different stack. > VCPU2 seems to be active, not polling (event processing could be initiated > from enable_irqs via force callback or before finally exiting the sched_op > hypercall), there seems to be a pending upcall but upcalls are masked. > The next thing I would try is to instrument the sites incrementing and > decrementing irq_count... > --- > > Below some of the info seen from dom0 debugging keys and looking > at a dump with crash: > > Dom0 Info: > > Polling vCPUs: {1,5-7} > VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 > VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 > VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 > VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 > > 10 [0/1]: s=6 n=1 x=0 > 40 [0/1]: s=6 n=6 x=0 > 46 [0/1]: s=6 n=7 x=0 > > Guest Info: > > lock_spinners: > [0] (struct xen_spinlock *) 0x > [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } > [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }If I understand correctly this means that 2 threads (VCPU2 & 6, I suppose) are spinning on this lock, but no one is holding it? An interesting hack^Wexperiment might be to make xen_poll_irq use a timeout and see if that unwedges things -- this would help confirm that the issue is on nested wakeup. I suppose xen_spin_unlock could also be instrumented to indicate who it last kicked and for which lock and that might show us something? Can someone explain why the non-locked update of lock_spinners in spinning_lock() is safe against xen_spin_unlock reading the field from another VCPU? I suspect it isn''t, so what happens if the other VCPU kicks the lock which is just about to become prev? maybe this is handled in xen_spin_lock_slow somehow? In a similar vein do we not need a barrier in unspinning_lock to ensure that the remote cpu sees prev getting restored?> [3] (struct xen_spinlock *) 0x > [4] (struct xen_spinlock *) 0x > [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1] > [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2] > [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 } > > VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700 > > irq_count[1] = 1 > irq_count[2] = 2 > It is -1 for the rest. > > *(struct pt_regs *) irq_regs[2] = { > .ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op > .sp = 0xffff8803bfc83ce8 > } > > char *irq_stack_ptr[2] = 0xffff8803bfc83fc0 > > #> bt -T ffff88033d65c4a0 > PID: 2050 TASK: ffff88033d65c4a0 CPU: 2 COMMAND: "postgres" > [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df > [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec > [ffff88033d72d620] check_preempt_curr at ffffffff81050254 > [ffff88033d72d640] pull_task at ffffffff8105e33e > [ffff88033d72d670] balance_tasks at ffffffff8105e4b6 > [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e > [ffff88033d72d690] find_get_page at ffffffff811161ee > [ffff88033d72d6b0] find_lock_page at ffffffff81116286 > [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10 > [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78 > [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df > [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e > [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248 > [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269 > [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e > [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f > [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8 > [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de > [ffff88033d72d920] netif_rx at ffffffff8153f400 > [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c > [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6 > [ffff88033d72d9d0] do_softirq at ffffffff81016284 > [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614 > [ffff88033d72da00] dev_queue_xmit at ffffffff81540309 > [ffff88033d72da50] ip_finish_output at ffffffff815769ab > [ffff88033d72da90] ip_output at ffffffff81577518 > [ffff88033d72dac0] ip_local_out at ffffffff81576c09 > [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b > [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1 > [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030 > [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38 > [ffff88033d72db90] irq_to_desc at ffffffff810d70b7 > [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1 > [ffff88033d72dc10] ttwu_queue at ffffffff81052672 > [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e > [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474 > --- bt -t would end here > [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7 > [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe > [ffff88033d72dd70] md_make_request at ffffffff814e0606 > [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf > [ffff88033d72de10] mempool_free_slab at ffffffff81118507 > [ffff88033d72de20] mempool_free at ffffffff811185b7 > [ffff88033d72de50] sys_sendto at ffffffff8152974d > [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b > [ffff88033d72df00] vfs_write at ffffffff811764b0 > [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db > RIP: 00007f6852c7b39a RSP: 00007fff431454e0 RFLAGS: 00000212 > RAX: 00007f6852fb38b8 RBX: 00007f6852fb3720 RCX: 000000000000014a > RDX: 0000000000000150 RSI: 0000000000000140 RDI: 00007f6852fb3720 > RBP: 0000000000000150 R8: 0000000000000015 R9: 0000000000000000 > R10: 0000000000000000 R11: 00007f6852c9174a R12: 00007f6852fb3778 > R13: 0000000000000140 R14: 00007f6852fb38b8 R15: 0000000000000001 > ORIG_RAX: ffffffffffffffff CS: e033 SS: e02b > > #> bt -T ffff88033b812dc0 > PID: 2069 TASK: ffff88033b812dc0 CPU: 6 COMMAND: "postgres" > [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df > [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec > [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b > [ffff88033b897650] irq_to_desc at ffffffff810d70b7 > [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e > [ffff88033b897670] balance_tasks at ffffffff8105e493 > [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e > [ffff88033b897690] find_get_page at ffffffff811161ee > [ffff88033b8976b0] find_lock_page at ffffffff81116286 > [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10 > [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78 > [ffff88033b897770] xen_set_pte_at at ffffffff81008e29 > [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd > [ffff88033b8977b0] unlock_page at ffffffff8111589a > [ffff88033b8977d0] __do_fault at ffffffff81138ac9 > [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15 > [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850 > [ffff88033b897900] sched_clock at ffffffff8101bd79 > [ffff88033b897910] blk_rq_init at ffffffff812ec262 > [ffff88033b897930] get_request at ffffffff812f004e > [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66 > [ffff88033b8979e0] find_busiest_group at ffffffff8105a061 > --- bt -t ends here > [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b > [ffff88033b897a40] irq_to_desc at ffffffff810d70b7 > [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e > [ffff88033b897a60] info_for_irq at ffffffff813a636e > [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e > [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450 > [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77 > [ffff88033b897b20] xen_spin_lock at ffffffff810121da > [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e > [ffff88033b897b50] double_rq_lock at ffffffff8105119c > [ffff88033b897b80] load_balance at ffffffff8105e788 > [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e > [ffff88033b897c10] idle_balance at ffffffff8163d57c > [ffff88033b897c60] __schedule at ffffffff81653ea9 > [ffff88033b897cc0] sleep_on_page at ffffffff81115810 > [ffff88033b897ce0] schedule at ffffffff81653fcf > [ffff88033b897cf0] io_schedule at ffffffff8165407f > [ffff88033b897d10] sleep_on_page at ffffffff8111581e > [ffff88033b897d20] __wait_on_bit at ffffffff8165489f > [ffff88033b897d70] wait_on_page_bit at ffffffff81115988 > [ffff88033b897da0] wake_bit_function at ffffffff8108a140 > [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c > [ffff88033b897e60] do_writepages at ffffffff81120ee1 > [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398 > [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f > [ffff88033b897f00] vfs_write at ffffffff811764b0 > [ffff88033b897f40] do_fsync at ffffffff811a4a36 > [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83 > [ffff88033b897f80] system_call_fastpath at ffffffff8165e442 > RIP: 00007f6852ce8240 RSP: 00007fff43145618 RFLAGS: 00000246 > RAX: 000000000000004b RBX: ffffffff8165e442 RCX: 00007f6852ce1900 > RDX: 00000000000000c5 RSI: 000000000000000c RDI: 000000000000000c > RBP: 00000000000000c5 R8: 000000000073a550 R9: 0000000000000000 > R10: 0000000000000004 R11: 0000000000000246 R12: ffffffff811a4d83 > R13: ffff88033b897f78 R14: 0000000000000001 R15: 0000000000af7488 > ORIG_RAX: 000000000000004b CS: e033 SS: e02b >
On 11.02.2013 18:29, Ian Campbell wrote:> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: >> A while ago I had been reporting an issue which causes Xen PVM guests >> to hang (apparently related to spinlocks). Since then I was able to >> gather a few more facts which I try to provide below. For the real >> reasons, I am still puzzled and would be thankful for any hints or >> direction. >> >> - Happens with Xen 4.1.2 and Xen 4.2 host-side. >> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. > > This is on a host with >= 8 PCPUs, correct?Maybe >=4 but I cannot remember for sure anymore.> >> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free. >> IIRC the weird lock always belongs to one VCPU runqueue. >> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not >> verified farther back). Since v3.6 it cannot be reproduced. Bisect >> ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662 >> "x86/tlb: enable tlb flush range support for x86" > > That seems like a very odd candidate for impacting on this issue. > >> - The other potential fix for this was to prevent xen_spin_lock_slow() >> from unmasking events (enabling interrupts) if those were enabled >> before the spinlock call. >> >> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow >> just manages to tip the likelihood of getting nested deeper. > > Either that or the bug is in the nesting aspect of xen_spin_lock_slow(), > which seems to be a rather complex dance judging from the comments in > that function (and one I confess I don''t really follow). >I am suspecting something like this. Or maybe generally a nesting issue revealed by enabling interrupts here.>> The same >> way that the changed TLB flush may reduce it by causing lesser IPIs. > > Ah, that might explain why that patch is relevant, yes. > >> From the static information I can extract from a dump it is hard to >> tell what exactly is going on. VCPU2 seems to be doing something but >> looks to be rather deep in event handling. The irq_count is at 2, but >> irq_count seems to be used in a few more places than just xen_hypervisor >> callback. Though that at least seems to be the entry on the stack I can >> see for VCPU2. The contents of the per-cpu variables for irq_reqs and >> irq_stack_ptr at least seem consistent as if a callback came in while >> being on the sched_op hypercall. But then the bt command seems to be >> operating on a completely different stack. >> VCPU2 seems to be active, not polling (event processing could be initiated >> from enable_irqs via force callback or before finally exiting the sched_op >> hypercall), there seems to be a pending upcall but upcalls are masked. >> The next thing I would try is to instrument the sites incrementing and >> decrementing irq_count... >> --- >> >> Below some of the info seen from dom0 debugging keys and looking >> at a dump with crash: >> >> Dom0 Info: >> >> Polling vCPUs: {1,5-7} >> VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 >> VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 >> VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >> VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >> VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 >> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 >> >> 10 [0/1]: s=6 n=1 x=0 >> 40 [0/1]: s=6 n=6 x=0 >> 46 [0/1]: s=6 n=7 x=0 >> >> Guest Info: >> >> lock_spinners: >> [0] (struct xen_spinlock *) 0x >> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } >> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } > > If I understand correctly this means that 2 threads (VCPU2 & 6, I > suppose) are spinning on this lock, but no one is holding it?To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the raw_local_irq_enable() before doing poll_irq or maybe in handling some event after receiving the wakeup irq but before returning from the hv call. So the VCPU that had the lock has returned it and likely the wakeup has been sent.> > An interesting hack^Wexperiment might be to make xen_poll_irq use a > timeout and see if that unwedges things -- this would help confirm that > the issue is on nested wakeup.I could do that. This would at least re-activate the other waiters. In the case shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there this would move things ahead.> > I suppose xen_spin_unlock could also be instrumented to indicate who it > last kicked and for which lock and that might show us something?IIRC I had done this but it did not really show much. What I have done in the meantime was to instrument the IRQ service functions in arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) that the last events on the vcpu not polling are: xen_do_hypervisor_callback+127 call_softirq+29 call_softirq+125 call_softirq+29 call_softirq+125 call_softirq+29 call_softirq+125 xen_do_hypervisor_callback+28 call_softirq+29 xen_do_hypervisor_callback+28 The lower offsets are when irq_count gets incremented and the higher offsets are when irq_count gets decremented before ending the callback. This shows that at least in this case there was an upcall, a softirq and another upcall being triggered without finishing the previous one. There was another experiment where I saw softirq, upcall, upcall. But at least it seems that there is always a three level stack. I believe that softirq, upcall would be ok as softirq processing enables interrupts but is it expected to have an upcall interrupted by another one?> > Can someone explain why the non-locked update of lock_spinners in > spinning_lock() is safe against xen_spin_unlock reading the field from > another VCPU? I suspect it isn''t, so what happens if the other VCPU > kicks the lock which is just about to become prev? maybe this is handled > in xen_spin_lock_slow somehow?Isn''t that safe because lock_spinners is percpu?> > In a similar vein do we not need a barrier in unspinning_lock to ensure > that the remote cpu sees prev getting restored? > >> [3] (struct xen_spinlock *) 0x >> [4] (struct xen_spinlock *) 0x >> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1] >> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2] >> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 } >> >> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700 >> >> irq_count[1] = 1 >> irq_count[2] = 2 >> It is -1 for the rest. >> >> *(struct pt_regs *) irq_regs[2] = { >> .ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op >> .sp = 0xffff8803bfc83ce8 >> } >> >> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0 >> >> #> bt -T ffff88033d65c4a0 >> PID: 2050 TASK: ffff88033d65c4a0 CPU: 2 COMMAND: "postgres" >> [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df >> [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec >> [ffff88033d72d620] check_preempt_curr at ffffffff81050254 >> [ffff88033d72d640] pull_task at ffffffff8105e33e >> [ffff88033d72d670] balance_tasks at ffffffff8105e4b6 >> [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e >> [ffff88033d72d690] find_get_page at ffffffff811161ee >> [ffff88033d72d6b0] find_lock_page at ffffffff81116286 >> [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10 >> [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78 >> [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df >> [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e >> [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248 >> [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269 >> [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e >> [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f >> [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8 >> [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de >> [ffff88033d72d920] netif_rx at ffffffff8153f400 >> [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c >> [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6 >> [ffff88033d72d9d0] do_softirq at ffffffff81016284 >> [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614 >> [ffff88033d72da00] dev_queue_xmit at ffffffff81540309 >> [ffff88033d72da50] ip_finish_output at ffffffff815769ab >> [ffff88033d72da90] ip_output at ffffffff81577518 >> [ffff88033d72dac0] ip_local_out at ffffffff81576c09 >> [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b >> [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1 >> [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030 >> [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38 >> [ffff88033d72db90] irq_to_desc at ffffffff810d70b7 >> [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1 >> [ffff88033d72dc10] ttwu_queue at ffffffff81052672 >> [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e >> [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474 >> --- bt -t would end here >> [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7 >> [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe >> [ffff88033d72dd70] md_make_request at ffffffff814e0606 >> [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf >> [ffff88033d72de10] mempool_free_slab at ffffffff81118507 >> [ffff88033d72de20] mempool_free at ffffffff811185b7 >> [ffff88033d72de50] sys_sendto at ffffffff8152974d >> [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b >> [ffff88033d72df00] vfs_write at ffffffff811764b0 >> [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db >> RIP: 00007f6852c7b39a RSP: 00007fff431454e0 RFLAGS: 00000212 >> RAX: 00007f6852fb38b8 RBX: 00007f6852fb3720 RCX: 000000000000014a >> RDX: 0000000000000150 RSI: 0000000000000140 RDI: 00007f6852fb3720 >> RBP: 0000000000000150 R8: 0000000000000015 R9: 0000000000000000 >> R10: 0000000000000000 R11: 00007f6852c9174a R12: 00007f6852fb3778 >> R13: 0000000000000140 R14: 00007f6852fb38b8 R15: 0000000000000001 >> ORIG_RAX: ffffffffffffffff CS: e033 SS: e02b >> >> #> bt -T ffff88033b812dc0 >> PID: 2069 TASK: ffff88033b812dc0 CPU: 6 COMMAND: "postgres" >> [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df >> [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec >> [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b >> [ffff88033b897650] irq_to_desc at ffffffff810d70b7 >> [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e >> [ffff88033b897670] balance_tasks at ffffffff8105e493 >> [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e >> [ffff88033b897690] find_get_page at ffffffff811161ee >> [ffff88033b8976b0] find_lock_page at ffffffff81116286 >> [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10 >> [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78 >> [ffff88033b897770] xen_set_pte_at at ffffffff81008e29 >> [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd >> [ffff88033b8977b0] unlock_page at ffffffff8111589a >> [ffff88033b8977d0] __do_fault at ffffffff81138ac9 >> [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15 >> [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850 >> [ffff88033b897900] sched_clock at ffffffff8101bd79 >> [ffff88033b897910] blk_rq_init at ffffffff812ec262 >> [ffff88033b897930] get_request at ffffffff812f004e >> [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66 >> [ffff88033b8979e0] find_busiest_group at ffffffff8105a061 >> --- bt -t ends here >> [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b >> [ffff88033b897a40] irq_to_desc at ffffffff810d70b7 >> [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e >> [ffff88033b897a60] info_for_irq at ffffffff813a636e >> [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e >> [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450 >> [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77 >> [ffff88033b897b20] xen_spin_lock at ffffffff810121da >> [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e >> [ffff88033b897b50] double_rq_lock at ffffffff8105119c >> [ffff88033b897b80] load_balance at ffffffff8105e788 >> [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e >> [ffff88033b897c10] idle_balance at ffffffff8163d57c >> [ffff88033b897c60] __schedule at ffffffff81653ea9 >> [ffff88033b897cc0] sleep_on_page at ffffffff81115810 >> [ffff88033b897ce0] schedule at ffffffff81653fcf >> [ffff88033b897cf0] io_schedule at ffffffff8165407f >> [ffff88033b897d10] sleep_on_page at ffffffff8111581e >> [ffff88033b897d20] __wait_on_bit at ffffffff8165489f >> [ffff88033b897d70] wait_on_page_bit at ffffffff81115988 >> [ffff88033b897da0] wake_bit_function at ffffffff8108a140 >> [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c >> [ffff88033b897e60] do_writepages at ffffffff81120ee1 >> [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398 >> [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f >> [ffff88033b897f00] vfs_write at ffffffff811764b0 >> [ffff88033b897f40] do_fsync at ffffffff811a4a36 >> [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83 >> [ffff88033b897f80] system_call_fastpath at ffffffff8165e442 >> RIP: 00007f6852ce8240 RSP: 00007fff43145618 RFLAGS: 00000246 >> RAX: 000000000000004b RBX: ffffffff8165e442 RCX: 00007f6852ce1900 >> RDX: 00000000000000c5 RSI: 000000000000000c RDI: 000000000000000c >> RBP: 00000000000000c5 R8: 000000000073a550 R9: 0000000000000000 >> R10: 0000000000000004 R11: 0000000000000246 R12: ffffffff811a4d83 >> R13: ffff88033b897f78 R14: 0000000000000001 R15: 0000000000af7488 >> ORIG_RAX: 000000000000004b CS: e033 SS: e02b >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:> On 11.02.2013 18:29, Ian Campbell wrote: > > On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: > >> A while ago I had been reporting an issue which causes Xen PVM guests > >> to hang (apparently related to spinlocks). Since then I was able to > >> gather a few more facts which I try to provide below. For the real > >> reasons, I am still puzzled and would be thankful for any hints or > >> direction. > >> > >> - Happens with Xen 4.1.2 and Xen 4.2 host-side. > >> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. > > > > This is on a host with >= 8 PCPUs, correct? > > Maybe >=4 but I cannot remember for sure anymore.OK, so the important point I was getting at was whether the guest was overcommitting the host or not, it seems like it is (2xVCPUs for each PCPU)> >> From the static information I can extract from a dump it is hard to > >> tell what exactly is going on. VCPU2 seems to be doing something but > >> looks to be rather deep in event handling. The irq_count is at 2, but > >> irq_count seems to be used in a few more places than just xen_hypervisor > >> callback. Though that at least seems to be the entry on the stack I can > >> see for VCPU2. The contents of the per-cpu variables for irq_reqs and > >> irq_stack_ptr at least seem consistent as if a callback came in while > >> being on the sched_op hypercall. But then the bt command seems to be > >> operating on a completely different stack. > >> VCPU2 seems to be active, not polling (event processing could be initiated > >> from enable_irqs via force callback or before finally exiting the sched_op > >> hypercall), there seems to be a pending upcall but upcalls are masked. > >> The next thing I would try is to instrument the sites incrementing and > >> decrementing irq_count... > >> --- > >> > >> Below some of the info seen from dom0 debugging keys and looking > >> at a dump with crash: > >> > >> Dom0 Info: > >> > >> Polling vCPUs: {1,5-7} > >> VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > >> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 > >> VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 > >> VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > >> VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > >> VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 > >> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 > >> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 > >> > >> 10 [0/1]: s=6 n=1 x=0 > >> 40 [0/1]: s=6 n=6 x=0 > >> 46 [0/1]: s=6 n=7 x=0 > >> > >> Guest Info: > >> > >> lock_spinners: > >> [0] (struct xen_spinlock *) 0x > >> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } > >> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } > > > > If I understand correctly this means that 2 threads (VCPU2 & 6, I > > suppose) are spinning on this lock, but no one is holding it? > > To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the > raw_local_irq_enable() before doing poll_irq or maybe in handling some event > after receiving the wakeup irq but before returning from the hv call. So the > VCPU that had the lock has returned it and likely the wakeup has been sent.Makes sense, so we are in the window between one holder unlocking and the next unlocking, which explains why noone is currently holding it (which I previously thought odd).> > An interesting hack^Wexperiment might be to make xen_poll_irq use a > > timeout and see if that unwedges things -- this would help confirm that > > the issue is on nested wakeup. > > I could do that. This would at least re-activate the other waiters. In the case > shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there > this would move things ahead.Yes, it would be interesting e.g. if only either 6 or 2 made progress and the other remained wedged for all time.> > I suppose xen_spin_unlock could also be instrumented to indicate who it > > last kicked and for which lock and that might show us something? > > IIRC I had done this but it did not really show much. What I have done in the > meantime was to instrument the IRQ service functions in > arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in > another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) > that the last events on the vcpu not polling are: > > xen_do_hypervisor_callback+127 > call_softirq+29 > call_softirq+125 > call_softirq+29 > call_softirq+125 > call_softirq+29 > call_softirq+125 > xen_do_hypervisor_callback+28 > call_softirq+29 > xen_do_hypervisor_callback+28 > > The lower offsets are when irq_count gets incremented and the higher offsets are > when irq_count gets decremented before ending the callback. This shows that at > least in this case there was an upcall, a softirq and another upcall being > triggered without finishing the previous one. There was another experiment where > I saw softirq, upcall, upcall. But at least it seems that there is always a > three level stack. > I believe that softirq, upcall would be ok as softirq processing enables > interrupts but is it expected to have an upcall interrupted by another one?I''m not sure. evtchn''s don''t have a priority mechanism so I expect not in general but individual interrupt handlers could well reenable interrupts, I think (it might be one of the flags you pass when calling request_irq?).> > Can someone explain why the non-locked update of lock_spinners in > > spinning_lock() is safe against xen_spin_unlock reading the field from > > another VCPU? I suspect it isn''t, so what happens if the other VCPU > > kicks the lock which is just about to become prev? maybe this is handled > > in xen_spin_lock_slow somehow? > > Isn''t that safe because lock_spinners is percpu?xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which is what I think might be unsafe. Ian.
On 12.02.2013 15:07, Ian Campbell wrote:> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote: >> On 11.02.2013 18:29, Ian Campbell wrote: >>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: >>>> A while ago I had been reporting an issue which causes Xen PVM guests >>>> to hang (apparently related to spinlocks). Since then I was able to >>>> gather a few more facts which I try to provide below. For the real >>>> reasons, I am still puzzled and would be thankful for any hints or >>>> direction. >>>> >>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side. >>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. >>> >>> This is on a host with >= 8 PCPUs, correct? >> >> Maybe >=4 but I cannot remember for sure anymore. > > OK, so the important point I was getting at was whether the guest was > overcommitting the host or not, it seems like it is (2xVCPUs for each > PCPU)Err not so much. I run on an 8-core host. ;)> > >>>> From the static information I can extract from a dump it is hard to >>>> tell what exactly is going on. VCPU2 seems to be doing something but >>>> looks to be rather deep in event handling. The irq_count is at 2, but >>>> irq_count seems to be used in a few more places than just xen_hypervisor >>>> callback. Though that at least seems to be the entry on the stack I can >>>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and >>>> irq_stack_ptr at least seem consistent as if a callback came in while >>>> being on the sched_op hypercall. But then the bt command seems to be >>>> operating on a completely different stack. >>>> VCPU2 seems to be active, not polling (event processing could be initiated >>>> from enable_irqs via force callback or before finally exiting the sched_op >>>> hypercall), there seems to be a pending upcall but upcalls are masked. >>>> The next thing I would try is to instrument the sites incrementing and >>>> decrementing irq_count... >>>> --- >>>> >>>> Below some of the info seen from dom0 debugging keys and looking >>>> at a dump with crash: >>>> >>>> Dom0 Info: >>>> >>>> Polling vCPUs: {1,5-7} >>>> VCPU0: CPU0 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00 >>>> VCPU2: CPU6 [has=T] poll=0 upcall_pend = 01, upcall_mask = 01 >>>> VCPU3: CPU2 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU4: CPU4 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU5: CPU3 [has=F] poll=0 upcall_pend = 00, upcall_mask = 00 >>>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01 >>>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01 >>>> >>>> 10 [0/1]: s=6 n=1 x=0 >>>> 40 [0/1]: s=6 n=6 x=0 >>>> 46 [0/1]: s=6 n=7 x=0 >>>> >>>> Guest Info: >>>> >>>> lock_spinners: >>>> [0] (struct xen_spinlock *) 0x >>>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 } >>>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 } >>> >>> If I understand correctly this means that 2 threads (VCPU2 & 6, I >>> suppose) are spinning on this lock, but no one is holding it? >> >> To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the >> raw_local_irq_enable() before doing poll_irq or maybe in handling some event >> after receiving the wakeup irq but before returning from the hv call. So the >> VCPU that had the lock has returned it and likely the wakeup has been sent. > > Makes sense, so we are in the window between one holder unlocking and > the next unlocking, which explains why noone is currently holding it > (which I previously thought odd). > >>> An interesting hack^Wexperiment might be to make xen_poll_irq use a >>> timeout and see if that unwedges things -- this would help confirm that >>> the issue is on nested wakeup. >> >> I could do that. This would at least re-activate the other waiters. In the case >> shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there >> this would move things ahead. > > Yes, it would be interesting e.g. if only either 6 or 2 made progress > and the other remained wedged for all time.On my list next...> >>> I suppose xen_spin_unlock could also be instrumented to indicate who it >>> last kicked and for which lock and that might show us something? >> >> IIRC I had done this but it did not really show much. What I have done in the >> meantime was to instrument the IRQ service functions in >> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in >> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) >> that the last events on the vcpu not polling are: >> >> xen_do_hypervisor_callback+127 >> call_softirq+29 >> call_softirq+125 >> call_softirq+29 >> call_softirq+125 >> call_softirq+29 >> call_softirq+125 >> xen_do_hypervisor_callback+28 >> call_softirq+29 >> xen_do_hypervisor_callback+28 >> >> The lower offsets are when irq_count gets incremented and the higher offsets are >> when irq_count gets decremented before ending the callback. This shows that at >> least in this case there was an upcall, a softirq and another upcall being >> triggered without finishing the previous one. There was another experiment where >> I saw softirq, upcall, upcall. But at least it seems that there is always a >> three level stack. >> I believe that softirq, upcall would be ok as softirq processing enables >> interrupts but is it expected to have an upcall interrupted by another one? > > I''m not sure. evtchn''s don''t have a priority mechanism so I expect not > in general but individual interrupt handlers could well reenable > interrupts, I think (it might be one of the flags you pass when calling > request_irq?).Just a gut feeling but it feels wrong to enable interrupts in any interrupt handler. I thought for anything that needs interrupts enabled and/or takes longer it should pushed off to a workqueue...> >>> Can someone explain why the non-locked update of lock_spinners in >>> spinning_lock() is safe against xen_spin_unlock reading the field from >>> another VCPU? I suspect it isn''t, so what happens if the other VCPU >>> kicks the lock which is just about to become prev? maybe this is handled >>> in xen_spin_lock_slow somehow? >> >> Isn''t that safe because lock_spinners is percpu? > > xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which > is what I think might be unsafe.Ah right, that. Hm, the way those two play together always was a bit brain hurting. Though somehow it feels like if the top level poll_irq would return and count things as spurious wakeup. I think in that case I would expect the lock_spinner entry of one vcpu to be not matching the freed lock... Not a really scientific argument.> > Ian. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, 2013-02-12 at 14:50 +0000, Stefan Bader wrote:> On 12.02.2013 15:07, Ian Campbell wrote: > > On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote: > >> On 11.02.2013 18:29, Ian Campbell wrote: > >>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: > >>>> A while ago I had been reporting an issue which causes Xen PVM guests > >>>> to hang (apparently related to spinlocks). Since then I was able to > >>>> gather a few more facts which I try to provide below. For the real > >>>> reasons, I am still puzzled and would be thankful for any hints or > >>>> direction. > >>>> > >>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side. > >>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. > >>> > >>> This is on a host with >= 8 PCPUs, correct? > >> > >> Maybe >=4 but I cannot remember for sure anymore. > > > > OK, so the important point I was getting at was whether the guest was > > overcommitting the host or not, it seems like it is (2xVCPUs for each > > PCPU) > > Err not so much. I run on an 8-core host. ;)Um, which is what I asked. What is >= 4 then?> > > >>> I suppose xen_spin_unlock could also be instrumented to indicate who it > >>> last kicked and for which lock and that might show us something? > >> > >> IIRC I had done this but it did not really show much. What I have done in the > >> meantime was to instrument the IRQ service functions in > >> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in > >> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) > >> that the last events on the vcpu not polling are: > >> > >> xen_do_hypervisor_callback+127 > >> call_softirq+29 > >> call_softirq+125 > >> call_softirq+29 > >> call_softirq+125 > >> call_softirq+29 > >> call_softirq+125 > >> xen_do_hypervisor_callback+28 > >> call_softirq+29 > >> xen_do_hypervisor_callback+28 > >> > >> The lower offsets are when irq_count gets incremented and the higher offsets are > >> when irq_count gets decremented before ending the callback. This shows that at > >> least in this case there was an upcall, a softirq and another upcall being > >> triggered without finishing the previous one. There was another experiment where > >> I saw softirq, upcall, upcall. But at least it seems that there is always a > >> three level stack. > >> I believe that softirq, upcall would be ok as softirq processing enables > >> interrupts but is it expected to have an upcall interrupted by another one? > > > > I''m not sure. evtchn''s don''t have a priority mechanism so I expect not > > in general but individual interrupt handlers could well reenable > > interrupts, I think (it might be one of the flags you pass when calling > > request_irq?). > > Just a gut feeling but it feels wrong to enable interrupts in any interrupt > handler. I thought for anything that needs interrupts enabled and/or takes > longer it should pushed off to a workqueue...Either that or some sort of interrupt prioritisation mechanism in the hardware so only higher priority interrupts than the current one will occur.> > > > >>> Can someone explain why the non-locked update of lock_spinners in > >>> spinning_lock() is safe against xen_spin_unlock reading the field from > >>> another VCPU? I suspect it isn''t, so what happens if the other VCPU > >>> kicks the lock which is just about to become prev? maybe this is handled > >>> in xen_spin_lock_slow somehow? > >> > >> Isn''t that safe because lock_spinners is percpu? > > > > xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which > > is what I think might be unsafe. > > Ah right, that. Hm, the way those two play together always was a bit brain > hurting.Yes...> Though somehow it feels like if the top level poll_irq would return and > count things as spurious wakeup. I think in that case I would expect the > lock_spinner entry of one vcpu to be not matching the freed lock... > Not a really scientific argument.Adding a BUG_ON(__this_cpu_write(lock_spinners) != xl) in unspinning_lock might be interesting? I wonder if this is something where the ftrace and related kernel infrastructure might be useful? Ian.
On 12.02.2013 16:07, Ian Campbell wrote:> On Tue, 2013-02-12 at 14:50 +0000, Stefan Bader wrote: >> On 12.02.2013 15:07, Ian Campbell wrote: >>> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote: >>>> On 11.02.2013 18:29, Ian Campbell wrote: >>>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote: >>>>>> A while ago I had been reporting an issue which causes Xen PVM guests >>>>>> to hang (apparently related to spinlocks). Since then I was able to >>>>>> gather a few more facts which I try to provide below. For the real >>>>>> reasons, I am still puzzled and would be thankful for any hints or >>>>>> direction. >>>>>> >>>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side. >>>>>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates. >>>>> >>>>> This is on a host with >= 8 PCPUs, correct? >>>> >>>> Maybe >=4 but I cannot remember for sure anymore. >>> >>> OK, so the important point I was getting at was whether the guest was >>> overcommitting the host or not, it seems like it is (2xVCPUs for each >>> PCPU) >> >> Err not so much. I run on an 8-core host. ;) > > Um, which is what I asked. What is >= 4 then?Uhm, me not being able to discern between P and V at some point. So what I was thinking, was that I believe I had this when running with 4 VCPUs or 8 VCPUs on the same 8 PCPU host...> > >>> >>>>> I suppose xen_spin_unlock could also be instrumented to indicate who it >>>>> last kicked and for which lock and that might show us something? >>>> >>>> IIRC I had done this but it did not really show much. What I have done in the >>>> meantime was to instrument the IRQ service functions in >>>> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in >>>> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling) >>>> that the last events on the vcpu not polling are: >>>> >>>> xen_do_hypervisor_callback+127 >>>> call_softirq+29 >>>> call_softirq+125 >>>> call_softirq+29 >>>> call_softirq+125 >>>> call_softirq+29 >>>> call_softirq+125 >>>> xen_do_hypervisor_callback+28 >>>> call_softirq+29 >>>> xen_do_hypervisor_callback+28 >>>> >>>> The lower offsets are when irq_count gets incremented and the higher offsets are >>>> when irq_count gets decremented before ending the callback. This shows that at >>>> least in this case there was an upcall, a softirq and another upcall being >>>> triggered without finishing the previous one. There was another experiment where >>>> I saw softirq, upcall, upcall. But at least it seems that there is always a >>>> three level stack. >>>> I believe that softirq, upcall would be ok as softirq processing enables >>>> interrupts but is it expected to have an upcall interrupted by another one? >>> >>> I''m not sure. evtchn''s don''t have a priority mechanism so I expect not >>> in general but individual interrupt handlers could well reenable >>> interrupts, I think (it might be one of the flags you pass when calling >>> request_irq?). >> >> Just a gut feeling but it feels wrong to enable interrupts in any interrupt >> handler. I thought for anything that needs interrupts enabled and/or takes >> longer it should pushed off to a workqueue... > > Either that or some sort of interrupt prioritisation mechanism in the > hardware so only higher priority interrupts than the current one will > occur. > >> >>> >>>>> Can someone explain why the non-locked update of lock_spinners in >>>>> spinning_lock() is safe against xen_spin_unlock reading the field from >>>>> another VCPU? I suspect it isn''t, so what happens if the other VCPU >>>>> kicks the lock which is just about to become prev? maybe this is handled >>>>> in xen_spin_lock_slow somehow? >>>> >>>> Isn''t that safe because lock_spinners is percpu? >>> >>> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which >>> is what I think might be unsafe. >> >> Ah right, that. Hm, the way those two play together always was a bit brain >> hurting. > > Yes... > >> Though somehow it feels like if the top level poll_irq would return and >> count things as spurious wakeup. I think in that case I would expect the >> lock_spinner entry of one vcpu to be not matching the freed lock... >> Not a really scientific argument. > > Adding a BUG_ON(__this_cpu_write(lock_spinners) != xl) in > unspinning_lock might be interesting?Hm, beside of writing prev into there which always should not be xl, I think since write access only happens in spin_lock_slow that should stack ok. The two cases that would go wrong are sending a wakeup event when the lock was just replaced on the other vcpu or not sending one when it should be. IIRC I had one attempt print when xen_spin_unlock_slow went through all vcpus without finding one to signal. That seemed to happen rather often but then would resolve as xen_spin_lock_slow for the other lock would set the event pending for the prev lock if there was one (after being sucessful with the top level lock). I think there is room for improvement by finding a way not always to search from vcpu 0 in spin_unlock_slow and maybe in spin_lock_slow to check whether returning from poll without the current lock being free may be because of prev being free now. When there are more than one spinner may be able to proceed...> > I wonder if this is something where the ftrace and related kernel > infrastructure might be useful?MAybe. Not sure there are hooks for what we are looking for and whether the system if not hanging too badly. My usual approach of looking at the dump would require to manually find the buffer and parse the records... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 11.02.2013 18:29, Ian Campbell wrote:> An interesting hack^Wexperiment might be to make xen_poll_irq use a > timeout and see if that unwedges things -- this would help confirm that > the issue is on nested wakeup. >So I did go forward and replaced xen_poll_irq by xen_poll_irq_timeout and it did get rid of the hang. Though I think there is a big taint there. There was only one other user of poll_irq_timeout in the kernel code. And that uses "jiffies + <timeout>*HZ". But when I look at the Xen side in do_poll, that looks like it is using timeout in a absolute "ns since boot" (of hv/dom0) way. Not sure how that ever can work. The ns since boot in the guest clearly is always behind the host (and jiffies isn''t ns either). Effectively I likely got rid of any wait time in the hypervisor and back to mostly spinning. Which matches the experience that the test run never gets stuck waiting for a timeout. That maybe proves the stacking is an issue but also is likely a bit too aggressive in not having any... :/ I will try to think of some better way. Not sure the thinking is realistic but maybe that could happen: xen_spin_lock_slow(a) ... enables irq and upcalls are pending upcall processing wants lock b xen_spin_lock_slow(b) --- just before replacing lock_spinners --- xen_spin_unlock_slow(a) finds other vcpu, triggers IRQ lock b is top spinner going into poll_irq poll_irq returns lock a gets restored so maybe no spinners on b dropping out to xen_spin_lock unlock of b not finding any spinners lock b acquired That way the irq for lock a maybe get lost... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 13.02.2013 12:31, Stefan Bader wrote:> I will try to think of some better way. Not sure the thinking is realistic but > maybe that could happen: > > xen_spin_lock_slow(a) > ... > enables irq and upcalls are pending > upcall processing wants lock b > xen_spin_lock_slow(b) > --- just before replacing lock_spinners --- > xen_spin_unlock_slow(a) > finds other vcpu, triggers > IRQ > lock b is top spinner > going into poll_irq > poll_irq returns > lock a gets restored > so maybe no spinners on b > dropping out to xen_spin_lock > unlock of b not finding any > spinners > lock b acquired > > That way the irq for lock a maybe get lost...Darn, maybe not since the pending status is not cleared when leaving xen_spin_lock_slow. So I assume the interrupted lock does at least one cycle out of slow and back... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
I think I finally can proof why allowing interrupts for the duration of poll_irq in xen_spin_lock_slow is bad! \o/ So with another run/dump: VCPUs 1,2,5 and 6 are spinning on the runq lock of VCPU 1. The number of spinners is 4 (ok) and the lock released. According to dom0 debug info VCPUs 2,5 and 6 are polling, VCPU 1 is not but seems running (has=T). The event channel for spinlock1 is pending. Checking the interrupt stack of VCPU 1 in the guest dump shows: hypercall_page+938 xen_poll_irq+16 xen_spin_lock_slow+151 xen_spin_lock_flags+99 _raw_spin_lock_irqsave+46 update_shares+156 rebalance_domains+72 run_rebalance_domains+72 __do_softirq+168 call_softirq+99 do_softirq+101 irq_exit+142 xen_evtchn_do_upcall+53 xen_do_hypervisor_callback+101 So right before finishing a previous callback, irq_exit triggers softirq processing (interrupts enabled) and while updating the shares this tries to grab the runq lock which we see in lock_spinners. Since irq_count is 2 there is one more interruption executing right now (though irq_regs wont show this). So I just proceeded upwards in the interrupt stack and found: try_to_wake_up+57 default_wake_function+18 autoremove_wake_function+22 wake_bit_function+59 __wake_up_common+88 __wake_up+72 __wake_up_bit+49 end_page_writeback+64 put_io_page+36 ext4_end_bio+338 bio_endio+29 req_bio_endio.isra.45+163 blk_update_request+245 blk_update_bidi_request+49 __blk_end_bidi_request+32 __blk_end_request_all+31 blkif_interrupt+460 handle_irq_event_percpu+85 handle_irq_event+78 handle_edge_irq+132 __xen_evtchn_do_upcall+409 xen_evtchn_do_upcall+47 xen_do_hypervisor_callback+101 There was a bit more on the stack but try_to_wake_up seemed a believable current path. Even more so after looking into the function: #ifdef CONFIG_SMP /* * If the owning (remote) cpu is still in the middle of schedule() with * this task as prev, wait until its done referencing the task. */ while (p->on_cpu) { #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW /* * In case the architecture enables interrupts in * context_switch(), we cannot busy wait, since that * would lead to deadlocks when an interrupt hits and * tries to wake up @prev. So bail and do a complete * remote wakeup. */ if (ttwu_activate_remote(p, wake_flags)) goto stat; #else cpu_relax(); #endif And prying out the task in question from the stack, it was one currently being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters for the runq lock of VCPU 1. Which would get picked up as soon as this callback is done. Unfortunately we "stay awhile... stay forever!". Ok, as an experiment I defined __ARCH_WANT_INTERRUPTS_ON_CTXSW for x86 and running that kernel in the PVM guest no more locks up. Problem there is that this define is gone since v3.7 (f3e9478674). And the same testcase is not able to cause the issue since v3.6 (611ae8e3f5). The change of TLB flushes unlikely is directly related (rather changing the number of IPIs and by that the likelihood to observe this). So for v3.6+ kernels, the question is whether it can be ruled out that during softirq handling rebalance_domains->update_blocked_averages (was update_shares) is called which then can get into spin_lock_slow and enable interrupts. I have not seen it or am aware of reports about it but its likely one of those rare occurrences. As for a solution, the simplest one would be never to re-enable interrupts before poll_irq... Anything else feels atm right complicated (like trying to make the scheduler use disable interrupts and spin_lock variants instead of spin_lock_irq ones... weird hidden implications in common code "just" for PVM). -Stefan Note: comparing the number of db records created on a 15min run of the testcase seems to yield similar results for interrupts enabled (with that arch define in v3.2) or disabled. Though I do not have a huge number base and that PVM is the only one running on the host. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 14.02.13 at 12:04, Stefan Bader <stefan.bader@canonical.com> wrote: > There was a bit more on the stack but try_to_wake_up seemed a believable > current > path. Even more so after looking into the function: > > #ifdef CONFIG_SMP > /* > * If the owning (remote) cpu is still in the middle of schedule() with > * this task as prev, wait until its done referencing the task. > */ > while (p->on_cpu) { > #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW > /* > * In case the architecture enables interrupts in > * context_switch(), we cannot busy wait, since that > * would lead to deadlocks when an interrupt hits and > * tries to wake up @prev. So bail and do a complete > * remote wakeup. > */ > if (ttwu_activate_remote(p, wake_flags)) > goto stat; > #else > cpu_relax(); > #endif > > And prying out the task in question from the stack, it was one currently > being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters > for > the runq lock of VCPU 1. Which would get picked up as soon as this callback > is > done. Unfortunately we "stay awhile... stay forever!".When analyzing a similar problem with our old PV ticket lock implementation (and the interrupt re-enabling there when possible before going into polling mode), it was precisely this open coded lock construct that caused problems for us. Back then I didn''t, however, realize that this would even affect the simplistic byte locks used by the pv-ops Xen code (or else I would have pointed this out). Being relatively certain that with our new implementation we don''t have any such problems, I can only recommend against dropping the re-enabling of interrupts - this needlessly introduces high interrupt latencies in a broader range of cases. Instead, the interactions ought to be fixed properly. And it''s time for using ticket locks in the Xen code too... Jan
On 14.02.2013 12:43, Jan Beulich wrote:>>>> On 14.02.13 at 12:04, Stefan Bader <stefan.bader@canonical.com> wrote: >> There was a bit more on the stack but try_to_wake_up seemed a believable >> current >> path. Even more so after looking into the function: >> >> #ifdef CONFIG_SMP >> /* >> * If the owning (remote) cpu is still in the middle of schedule() with >> * this task as prev, wait until its done referencing the task. >> */ >> while (p->on_cpu) { >> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW >> /* >> * In case the architecture enables interrupts in >> * context_switch(), we cannot busy wait, since that >> * would lead to deadlocks when an interrupt hits and >> * tries to wake up @prev. So bail and do a complete >> * remote wakeup. >> */ >> if (ttwu_activate_remote(p, wake_flags)) >> goto stat; >> #else >> cpu_relax(); >> #endif >> >> And prying out the task in question from the stack, it was one currently >> being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters >> for >> the runq lock of VCPU 1. Which would get picked up as soon as this callback >> is >> done. Unfortunately we "stay awhile... stay forever!". > > When analyzing a similar problem with our old PV ticket lock > implementation (and the interrupt re-enabling there when possible > before going into polling mode), it was precisely this open coded > lock construct that caused problems for us. Back then I didn''t, > however, realize that this would even affect the simplistic byte > locks used by the pv-ops Xen code (or else I would have pointed > this out). > > Being relatively certain that with our new implementation we don''t > have any such problems, I can only recommend against dropping > the re-enabling of interrupts - this needlessly introduces high > interrupt latencies in a broader range of cases. Instead, the > interactions ought to be fixed properly. And it''s time for using > ticket locks in the Xen code too... >Not sure what *your new* implementation is. ;) I am more concerned about the currently released/stable kernels which potentially have this problem. I agree that a proper solution is preferable. Ticket locks likely have a much lower chance of hitting this as part of the current problem is that lower number VCPUs are preferred in unlock_slow. In the end, though, we would need something that could go upstream (and from there back into stable). So it should not be too complicated. Then a proper solution can be applied. Having more understanding now, I had a different idea. Not sure this is foolproof and surely is causing some more active spinning. But it looks like I also can keep the system from locking up (v3.2 kernel and the testcase) if I change xen_spin_unlock_slow to send the wakeup IPI to _all_ spinners instead of only the first one found. -Stefan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 14.02.13 at 15:21, Stefan Bader <stefan.bader@canonical.com> wrote: > Having more understanding now, I had a different idea. Not sure this is > foolproof and surely is causing some more active spinning. But it looks like I > also can keep the system from locking up (v3.2 kernel and the testcase) if I > change xen_spin_unlock_slow to send the wakeup IPI to _all_ spinners instead > of only the first one found.That, afaict, is an absolute requirement with the open coded lock that you pointed at, as otherwise the code makes a pseudo ticket assumption in believing to know who''s going to be the next owner of the lock. Jan