thr3ads.net - Xen devel - Strange PVM spinlock case revisited [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Stefan Bader

2013-Feb-08 15:33 UTC

Strange PVM spinlock case revisited

A while ago I had been reporting an issue which causes Xen PVM guests
to hang (apparently related to spinlocks). Since then I was able to
gather a few more facts which I try to provide below. For the real
reasons, I am still puzzled and would be thankful for any hints or
direction.

- Happens with Xen 4.1.2 and Xen 4.2 host-side.
- PVM guest with 8 VCPUs is running 800 threads doing DB updates.
- When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
  IIRC the weird lock always belongs to one VCPU runqueue.
- Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
  verified farther back). Since v3.6 it cannot be reproduced. Bisect
  ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
    "x86/tlb: enable tlb flush range support for x86"
- The other potential fix for this was to prevent xen_spin_lock_slow()
  from unmasking events (enabling interrupts) if those were enabled
  before the spinlock call.

Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
just manages to tip the likelihood of getting nested deeper. The same
way that the changed TLB flush may reduce it by causing lesser IPIs.
From the static information I can extract from a dump it is hard to
tell what exactly is going on. VCPU2 seems to be doing something but
looks to be rather deep in event handling. The irq_count is at 2, but
irq_count seems to be used in a few more places than just xen_hypervisor
callback. Though that at least seems to be the entry on the stack I can
see for VCPU2. The contents of the per-cpu variables for irq_reqs and
irq_stack_ptr at least seem consistent as if a callback came in while
being on the sched_op hypercall. But then the bt command seems to be
operating on a completely different stack.
VCPU2 seems to be active, not polling (event processing could be initiated
from enable_irqs via force callback or before finally exiting the sched_op
hypercall), there seems to be a pending upcall but upcalls are masked.
The next thing I would try is to instrument the sites incrementing and
decrementing irq_count...
---

Below some of the info seen from dom0 debugging keys and looking
at a dump with crash:

Dom0 Info:

Polling vCPUs: {1,5-7}
VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01

10 [0/1]: s=6 n=1 x=0
40 [0/1]: s=6 n=6 x=0
46 [0/1]: s=6 n=7 x=0

Guest Info:

lock_spinners:
[0] (struct xen_spinlock *) 0x
[1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
[2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
[3] (struct xen_spinlock *) 0x
[4] (struct xen_spinlock *) 0x
[5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
[6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
[7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }

VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700

irq_count[1] = 1
irq_count[2] = 2
It is -1 for the rest.

*(struct pt_regs *) irq_regs[2] = {
	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
	.sp = 0xffff8803bfc83ce8
}

char *irq_stack_ptr[2] = 0xffff8803bfc83fc0

#> bt -T ffff88033d65c4a0
PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
  [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
  [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
  [ffff88033d72d620] check_preempt_curr at ffffffff81050254
  [ffff88033d72d640] pull_task at ffffffff8105e33e
  [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
  [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
  [ffff88033d72d690] find_get_page at ffffffff811161ee
  [ffff88033d72d6b0] find_lock_page at ffffffff81116286
  [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
  [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
  [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
  [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
  [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
  [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
  [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
  [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
  [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
  [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
  [ffff88033d72d920] netif_rx at ffffffff8153f400
  [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
  [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
  [ffff88033d72d9d0] do_softirq at ffffffff81016284
  [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
  [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
  [ffff88033d72da50] ip_finish_output at ffffffff815769ab
  [ffff88033d72da90] ip_output at ffffffff81577518
  [ffff88033d72dac0] ip_local_out at ffffffff81576c09
  [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
  [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
  [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
  [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
  [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
  [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
  [ffff88033d72dc10] ttwu_queue at ffffffff81052672
  [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
  [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
  --- bt -t would end here
  [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
  [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
  [ffff88033d72dd70] md_make_request at ffffffff814e0606
  [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
  [ffff88033d72de10] mempool_free_slab at ffffffff81118507
  [ffff88033d72de20] mempool_free at ffffffff811185b7
  [ffff88033d72de50] sys_sendto at ffffffff8152974d
  [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
  [ffff88033d72df00] vfs_write at ffffffff811764b0
  [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
    RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
    RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
    RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
    RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
    R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
    R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b

#> bt -T ffff88033b812dc0
PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
  [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
  [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
  [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
  [ffff88033b897650] irq_to_desc at ffffffff810d70b7
  [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
  [ffff88033b897670] balance_tasks at ffffffff8105e493
  [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
  [ffff88033b897690] find_get_page at ffffffff811161ee
  [ffff88033b8976b0] find_lock_page at ffffffff81116286
  [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
  [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
  [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
  [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
  [ffff88033b8977b0] unlock_page at ffffffff8111589a
  [ffff88033b8977d0] __do_fault at ffffffff81138ac9
  [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
  [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
  [ffff88033b897900] sched_clock at ffffffff8101bd79
  [ffff88033b897910] blk_rq_init at ffffffff812ec262
  [ffff88033b897930] get_request at ffffffff812f004e
  [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
  [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
  --- bt -t ends here
  [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
  [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
  [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
  [ffff88033b897a60] info_for_irq at ffffffff813a636e
  [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
  [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
  [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
  [ffff88033b897b20] xen_spin_lock at ffffffff810121da
  [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
  [ffff88033b897b50] double_rq_lock at ffffffff8105119c
  [ffff88033b897b80] load_balance at ffffffff8105e788
  [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
  [ffff88033b897c10] idle_balance at ffffffff8163d57c
  [ffff88033b897c60] __schedule at ffffffff81653ea9
  [ffff88033b897cc0] sleep_on_page at ffffffff81115810
  [ffff88033b897ce0] schedule at ffffffff81653fcf
  [ffff88033b897cf0] io_schedule at ffffffff8165407f
  [ffff88033b897d10] sleep_on_page at ffffffff8111581e
  [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
  [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
  [ffff88033b897da0] wake_bit_function at ffffffff8108a140
  [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
  [ffff88033b897e60] do_writepages at ffffffff81120ee1
  [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
  [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
  [ffff88033b897f00] vfs_write at ffffffff811764b0
  [ffff88033b897f40] do_fsync at ffffffff811a4a36
  [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
  [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
    RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
    RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
    RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
    RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
    R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
    R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
    ORIG_RAX: 000000000000004b  CS: e033  SS: e02b



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Feb-11 17:29 UTC

head link

Re: Strange PVM spinlock case revisited

On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:> A while ago I had been reporting an issue which causes Xen PVM guests
> to hang (apparently related to spinlocks). Since then I was able to
> gather a few more facts which I try to provide below. For the real
> reasons, I am still puzzled and would be thankful for any hints or
> direction.
> 
> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
This is on a host with >= 8 PCPUs, correct?
> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock free.
>   IIRC the weird lock always belongs to one VCPU runqueue.
> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>     "x86/tlb: enable tlb flush range support for x86"
That seems like a very odd candidate for impacting on this issue.
> - The other potential fix for this was to prevent xen_spin_lock_slow()
>   from unmasking events (enabling interrupts) if those were enabled
>   before the spinlock call.
> 
> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
> just manages to tip the likelihood of getting nested deeper.
Either that or the bug is in the nesting aspect of xen_spin_lock_slow(),
which seems to be a rather complex dance judging from the comments in
that function (and one I confess I don''t really follow).
>  The same
> way that the changed TLB flush may reduce it by causing lesser IPIs.
Ah, that might explain why that patch is relevant, yes.
> From the static information I can extract from a dump it is hard to
> tell what exactly is going on. VCPU2 seems to be doing something but
> looks to be rather deep in event handling. The irq_count is at 2, but
> irq_count seems to be used in a few more places than just xen_hypervisor
> callback. Though that at least seems to be the entry on the stack I can
> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
> irq_stack_ptr at least seem consistent as if a callback came in while
> being on the sched_op hypercall. But then the bt command seems to be
> operating on a completely different stack.
> VCPU2 seems to be active, not polling (event processing could be initiated
> from enable_irqs via force callback or before finally exiting the sched_op
> hypercall), there seems to be a pending upcall but upcalls are masked.
> The next thing I would try is to instrument the sites incrementing and
> decrementing irq_count...
> ---
> 
> Below some of the info seen from dom0 debugging keys and looking
> at a dump with crash:
> 
> Dom0 Info:
> 
> Polling vCPUs: {1,5-7}
> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
> 
> 10 [0/1]: s=6 n=1 x=0
> 40 [0/1]: s=6 n=6 x=0
> 46 [0/1]: s=6 n=7 x=0
> 
> Guest Info:
> 
> lock_spinners:
> [0] (struct xen_spinlock *) 0x
> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
If I understand correctly this means that 2 threads (VCPU2 & 6, I
suppose) are spinning on this lock, but no one is holding it?

An interesting hack^Wexperiment might be to make xen_poll_irq use a
timeout and see if that unwedges things -- this would help confirm that
the issue is on nested wakeup.

I suppose xen_spin_unlock could also be instrumented to indicate who it
last kicked and for which lock and that might show us something?

Can someone explain why the non-locked update of lock_spinners in
spinning_lock() is safe against xen_spin_unlock reading the field from
another VCPU? I suspect it isn''t, so what happens if the other VCPU
kicks the lock which is just about to become prev? maybe this is handled
in xen_spin_lock_slow somehow?

In a similar vein do we not need a barrier in unspinning_lock to ensure
that the remote cpu sees prev getting restored?
> [3] (struct xen_spinlock *) 0x
> [4] (struct xen_spinlock *) 0x
> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }
> 
> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700
> 
> irq_count[1] = 1
> irq_count[2] = 2
> It is -1 for the rest.
> 
> *(struct pt_regs *) irq_regs[2] = {
> 	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
> 	.sp = 0xffff8803bfc83ce8
> }
> 
> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0
> 
> #> bt -T ffff88033d65c4a0
> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND: "postgres"
>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>   [ffff88033d72da90] ip_output at ffffffff81577518
>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>   --- bt -t would end here
>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
>     RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
> 
> #> bt -T ffff88033b812dc0
> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND: "postgres"
>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>   [ffff88033b897690] find_get_page at ffffffff811161ee
>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>   [ffff88033b897930] get_request at ffffffff812f004e
>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>   --- bt -t ends here
>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>   [ffff88033b897b80] load_balance at ffffffff8105e788
>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
>

Stefan Bader

2013-Feb-12 13:42 UTC

head link

Re: Strange PVM spinlock case revisited

On 11.02.2013 18:29, Ian Campbell wrote:> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>> A while ago I had been reporting an issue which causes Xen PVM guests
>> to hang (apparently related to spinlocks). Since then I was able to
>> gather a few more facts which I try to provide below. For the real
>> reasons, I am still puzzled and would be thankful for any hints or
>> direction.
>>
>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
> 
> This is on a host with >= 8 PCPUs, correct?
Maybe >=4 but I cannot remember for sure anymore.> 
>> - When hanging at least 2 VCPUs in xen_spin_lock_slow with the lock
free.
>>   IIRC the weird lock always belongs to one VCPU runqueue.
>> - Testcase shows the problem for guest kernel versions v3.2..v3.5 (not
>>   verified farther back). Since v3.6 it cannot be reproduced. Bisect
>>   ends up with: 611ae8e3f5204f7480b3b405993b3352cfa16662
>>     "x86/tlb: enable tlb flush range support for x86"
> 
> That seems like a very odd candidate for impacting on this issue.
> 
>> - The other potential fix for this was to prevent xen_spin_lock_slow()
>>   from unmasking events (enabling interrupts) if those were enabled
>>   before the spinlock call.
>>
>> Maybe allowing or not allowing the interrupts in xen_spin_lock_slow
>> just manages to tip the likelihood of getting nested deeper.
> 
> Either that or the bug is in the nesting aspect of xen_spin_lock_slow(),
> which seems to be a rather complex dance judging from the comments in
> that function (and one I confess I don''t really follow).
> 
I am suspecting something like this. Or maybe generally a nesting issue revealed
by enabling interrupts here.
>>  The same
>> way that the changed TLB flush may reduce it by causing lesser IPIs.
> 
> Ah, that might explain why that patch is relevant, yes.
> 
>> From the static information I can extract from a dump it is hard to
>> tell what exactly is going on. VCPU2 seems to be doing something but
>> looks to be rather deep in event handling. The irq_count is at 2, but
>> irq_count seems to be used in a few more places than just
xen_hypervisor
>> callback. Though that at least seems to be the entry on the stack I can
>> see for VCPU2. The contents of the per-cpu variables for irq_reqs and
>> irq_stack_ptr at least seem consistent as if a callback came in while
>> being on the sched_op hypercall. But then the bt command seems to be
>> operating on a completely different stack.
>> VCPU2 seems to be active, not polling (event processing could be
initiated
>> from enable_irqs via force callback or before finally exiting the
sched_op
>> hypercall), there seems to be a pending upcall but upcalls are masked.
>> The next thing I would try is to instrument the sites incrementing and
>> decrementing irq_count...
>> ---
>>
>> Below some of the info seen from dom0 debugging keys and looking
>> at a dump with crash:
>>
>> Dom0 Info:
>>
>> Polling vCPUs: {1,5-7}
>> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
>> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
>> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
>>
>> 10 [0/1]: s=6 n=1 x=0
>> 40 [0/1]: s=6 n=6 x=0
>> 46 [0/1]: s=6 n=7 x=0
>>
>> Guest Info:
>>
>> lock_spinners:
>> [0] (struct xen_spinlock *) 0x
>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1, spinners=2 }
>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0, spinners=2 }
> 
> If I understand correctly this means that 2 threads (VCPU2 & 6, I
> suppose) are spinning on this lock, but no one is holding it?
To my reading both went into xen_spinlock_slow(). VCPU2 may be either in the
raw_local_irq_enable() before doing poll_irq or maybe in handling some event
after receiving the wakeup irq but before returning from the hv call. So the
VCPU that had the lock has returned it and likely the wakeup has been sent.
> 
> An interesting hack^Wexperiment might be to make xen_poll_irq use a
> timeout and see if that unwedges things -- this would help confirm that
> the issue is on nested wakeup.
I could do that. This would at least re-activate the other waiters. In the case
shown VCPU6 and if something in the path of execution on VCPU2 deadlocks there
this would move things ahead.
> 
> I suppose xen_spin_unlock could also be instrumented to indicate who it
> last kicked and for which lock and that might show us something?
IIRC I had done this but it did not really show much. What I have done in the
meantime was to instrument the IRQ service functions in
arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows (in
another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still polling)
that the last events on the vcpu not polling are:

xen_do_hypervisor_callback+127
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
call_softirq+29
call_softirq+125
xen_do_hypervisor_callback+28
call_softirq+29
xen_do_hypervisor_callback+28

The lower offsets are when irq_count gets incremented and the higher offsets are
when irq_count gets decremented before ending the callback. This shows that at
least in this case there was an upcall, a softirq and another upcall being
triggered without finishing the previous one. There was another experiment where
I saw softirq, upcall, upcall. But at least it seems that there is always a
three level stack.
I believe that softirq, upcall would be ok as softirq processing enables
interrupts but is it expected to have an upcall interrupted by another one?
> 
> Can someone explain why the non-locked update of lock_spinners in
> spinning_lock() is safe against xen_spin_unlock reading the field from
> another VCPU? I suspect it isn''t, so what happens if the other
VCPU
> kicks the lock which is just about to become prev? maybe this is handled
> in xen_spin_lock_slow somehow?
Isn''t that safe because lock_spinners is percpu?
> 
> In a similar vein do we not need a barrier in unspinning_lock to ensure
> that the remote cpu sees prev getting restored?
> 
>> [3] (struct xen_spinlock *) 0x
>> [4] (struct xen_spinlock *) 0x
>> [5] (struct xen_spinlock *) 0xffff8803bffc8ae8 -> [1]
>> [6] (struct xen_spinlock *) 0xffff8803bfc93700 -> [2]
>> [7] (struct xen_spinlock *) 0xffffffff81f15ef0 = { lock=1, spinners=1 }
>>
>> VCPU[2,6] -> lock of runqueue[2] = (struct rq *) 0xffff8803bfc93700
>>
>> irq_count[1] = 1
>> irq_count[2] = 2
>> It is -1 for the rest.
>>
>> *(struct pt_regs *) irq_regs[2] = {
>> 	.ip = 0xffffffff810013aa == hypercall_page+938 -> sched_op
>> 	.sp = 0xffff8803bfc83ce8
>> }
>>
>> char *irq_stack_ptr[2] = 0xffff8803bfc83fc0
>>
>> #> bt -T ffff88033d65c4a0
>> PID: 2050   TASK: ffff88033d65c4a0  CPU: 2   COMMAND:
"postgres"
>>   [ffff88033d72d530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033d72d620] check_preempt_curr at ffffffff81050254
>>   [ffff88033d72d640] pull_task at ffffffff8105e33e
>>   [ffff88033d72d670] balance_tasks at ffffffff8105e4b6
>>   [ffff88033d72d680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033d72d690] find_get_page at ffffffff811161ee
>>   [ffff88033d72d6b0] find_lock_page at ffffffff81116286
>>   [ffff88033d72d6e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033d72d748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033d72d7a0] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033d72d7c0] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d7d0] ext4_ext_check_cache at ffffffff81239248
>>   [ffff88033d72d820] ext4_ext_map_blocks at ffffffff8123e269
>>   [ffff88033d72d870] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033d72d880] enqueue_to_backlog at ffffffff8153e55f
>>   [ffff88033d72d890] __wake_up_common at ffffffff8104c1c8
>>   [ffff88033d72d8e0] netif_rx.part.82 at ffffffff8153f2de
>>   [ffff88033d72d920] netif_rx at ffffffff8153f400
>>   [ffff88033d72d960] loopback_xmit at ffffffff8146fb1c
>>   [ffff88033d72d990] dev_hard_start_xmit at ffffffff8153fea6
>>   [ffff88033d72d9d0] do_softirq at ffffffff81016284
>>   [ffff88033d72d9f0] local_bh_enable at ffffffff8106d614
>>   [ffff88033d72da00] dev_queue_xmit at ffffffff81540309
>>   [ffff88033d72da50] ip_finish_output at ffffffff815769ab
>>   [ffff88033d72da90] ip_output at ffffffff81577518
>>   [ffff88033d72dac0] ip_local_out at ffffffff81576c09
>>   [ffff88033d72dae0] ip_send_skb at ffffffff81577f4b
>>   [ffff88033d72db00] udp_send_skb at ffffffff8159aaa1
>>   [ffff88033d72db40] ip_generic_getfrag at ffffffff81575030
>>   [ffff88033d72db50] udp_sendmsg at ffffffff8159bd38
>>   [ffff88033d72db90] irq_to_desc at ffffffff810d70b7
>>   [ffff88033d72dbd0] notify_remote_via_irq at ffffffff813a74c1
>>   [ffff88033d72dc10] ttwu_queue at ffffffff81052672
>>   [ffff88033d72dc38] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033d72dc90] inet_sendmsg at ffffffff815a6474
>>   --- bt -t would end here
>>   [ffff88033d72dcb0] apparmor_socket_sendmsg at ffffffff812d43f7
>>   [ffff88033d72dcd0] sock_sendmsg at ffffffff815266fe
>>   [ffff88033d72dd70] md_make_request at ffffffff814e0606
>>   [ffff88033d72ddd0] kmem_cache_free at ffffffff811605cf
>>   [ffff88033d72de10] mempool_free_slab at ffffffff81118507
>>   [ffff88033d72de20] mempool_free at ffffffff811185b7
>>   [ffff88033d72de50] sys_sendto at ffffffff8152974d
>>   [ffff88033d72dee0] ext4_sync_file at ffffffff8120f07b
>>   [ffff88033d72df00] vfs_write at ffffffff811764b0
>>   [ffff88033d72df50] xen_hypervisor_callback at ffffffff816606db
>>     RIP: 00007f6852c7b39a  RSP: 00007fff431454e0  RFLAGS: 00000212
>>     RAX: 00007f6852fb38b8  RBX: 00007f6852fb3720  RCX: 000000000000014a
>>     RDX: 0000000000000150  RSI: 0000000000000140  RDI: 00007f6852fb3720
>>     RBP: 0000000000000150   R8: 0000000000000015   R9: 0000000000000000
>>     R10: 0000000000000000  R11: 00007f6852c9174a  R12: 00007f6852fb3778
>>     R13: 0000000000000140  R14: 00007f6852fb38b8  R15: 0000000000000001
>>     ORIG_RAX: ffffffffffffffff  CS: e033  SS: e02b
>>
>> #> bt -T ffff88033b812dc0
>> PID: 2069   TASK: ffff88033b812dc0  CPU: 6   COMMAND:
"postgres"
>>   [ffff88033b897530] get_page_from_freelist at ffffffff8111e6df
>>   [ffff88033b897600] __alloc_pages_nodemask at ffffffff8111ebec
>>   [ffff88033b897640] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897650] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897660] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897670] balance_tasks at ffffffff8105e493
>>   [ffff88033b897680] radix_tree_lookup_slot at ffffffff8130db5e
>>   [ffff88033b897690] find_get_page at ffffffff811161ee
>>   [ffff88033b8976b0] find_lock_page at ffffffff81116286
>>   [ffff88033b8976e0] shmem_getpage_gfp at ffffffff8112dd10
>>   [ffff88033b897748] pte_pfn_to_mfn at ffffffff81005a78
>>   [ffff88033b897770] xen_set_pte_at at ffffffff81008e29
>>   [ffff88033b897788] __raw_callee_save_xen_make_pte at ffffffff810052cd
>>   [ffff88033b8977b0] unlock_page at ffffffff8111589a
>>   [ffff88033b8977d0] __do_fault at ffffffff81138ac9
>>   [ffff88033b8978a0] pvclock_clocksource_read at ffffffff8103dd15
>>   [ffff88033b8978f0] xen_clocksource_read at ffffffff8100a850
>>   [ffff88033b897900] sched_clock at ffffffff8101bd79
>>   [ffff88033b897910] blk_rq_init at ffffffff812ec262
>>   [ffff88033b897930] get_request at ffffffff812f004e
>>   [ffff88033b8979c0] cpumask_next_and at ffffffff81308c66
>>   [ffff88033b8979e0] find_busiest_group at ffffffff8105a061
>>   --- bt -t ends here
>>   [ffff88033b897a30] radix_tree_lookup at ffffffff8130db6b
>>   [ffff88033b897a40] irq_to_desc at ffffffff810d70b7
>>   [ffff88033b897a50] irq_get_irq_data at ffffffff810d9f4e
>>   [ffff88033b897a60] info_for_irq at ffffffff813a636e
>>   [ffff88033b897a80] xen_poll_irq_timeout at ffffffff813a696e
>>   [ffff88033b897ac0] xen_poll_irq at ffffffff813a8450
>>   [ffff88033b897ad0] xen_spin_lock_slow at ffffffff8163ad77
>>   [ffff88033b897b20] xen_spin_lock at ffffffff810121da
>>   [ffff88033b897b40] _raw_spin_lock at ffffffff81655e4e
>>   [ffff88033b897b50] double_rq_lock at ffffffff8105119c
>>   [ffff88033b897b80] load_balance at ffffffff8105e788
>>   [ffff88033b897b88] _raw_spin_unlock_irqrestore at ffffffff8165605e
>>   [ffff88033b897c10] idle_balance at ffffffff8163d57c
>>   [ffff88033b897c60] __schedule at ffffffff81653ea9
>>   [ffff88033b897cc0] sleep_on_page at ffffffff81115810
>>   [ffff88033b897ce0] schedule at ffffffff81653fcf
>>   [ffff88033b897cf0] io_schedule at ffffffff8165407f
>>   [ffff88033b897d10] sleep_on_page at ffffffff8111581e
>>   [ffff88033b897d20] __wait_on_bit at ffffffff8165489f
>>   [ffff88033b897d70] wait_on_page_bit at ffffffff81115988
>>   [ffff88033b897da0] wake_bit_function at ffffffff8108a140
>>   [ffff88033b897dc0] filemap_fdatawait_range at ffffffff81115a9c
>>   [ffff88033b897e60] do_writepages at ffffffff81120ee1
>>   [ffff88033b897eb0] filemap_write_and_wait_range at ffffffff81117398
>>   [ffff88033b897ee0] ext4_sync_file at ffffffff8120ef9f
>>   [ffff88033b897f00] vfs_write at ffffffff811764b0
>>   [ffff88033b897f40] do_fsync at ffffffff811a4a36
>>   [ffff88033b897f70] sys_fdatasync at ffffffff811a4d83
>>   [ffff88033b897f80] system_call_fastpath at ffffffff8165e442
>>     RIP: 00007f6852ce8240  RSP: 00007fff43145618  RFLAGS: 00000246
>>     RAX: 000000000000004b  RBX: ffffffff8165e442  RCX: 00007f6852ce1900
>>     RDX: 00000000000000c5  RSI: 000000000000000c  RDI: 000000000000000c
>>     RBP: 00000000000000c5   R8: 000000000073a550   R9: 0000000000000000
>>     R10: 0000000000000004  R11: 0000000000000246  R12: ffffffff811a4d83
>>     R13: ffff88033b897f78  R14: 0000000000000001  R15: 0000000000af7488
>>     ORIG_RAX: 000000000000004b  CS: e033  SS: e02b
>>




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Feb-12 14:07 UTC

head link

Re: Strange PVM spinlock case revisited

On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:> On 11.02.2013 18:29, Ian Campbell wrote:
> > On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
> >> A while ago I had been reporting an issue which causes Xen PVM
guests
> >> to hang (apparently related to spinlocks). Since then I was able
to
> >> gather a few more facts which I try to provide below. For the real
> >> reasons, I am still puzzled and would be thankful for any hints or
> >> direction.
> >>
> >> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
> >> - PVM guest with 8 VCPUs is running 800 threads doing DB updates.
> > 
> > This is on a host with >= 8 PCPUs, correct?
> 
> Maybe >=4 but I cannot remember for sure anymore.
OK, so the important point I was getting at was whether the guest was
overcommitting the host or not, it seems like it is (2xVCPUs for each
PCPU)

> >> From the static information I can extract from a dump it is hard
to
> >> tell what exactly is going on. VCPU2 seems to be doing something
but
> >> looks to be rather deep in event handling. The irq_count is at 2,
but
> >> irq_count seems to be used in a few more places than just
xen_hypervisor
> >> callback. Though that at least seems to be the entry on the stack
I can
> >> see for VCPU2. The contents of the per-cpu variables for irq_reqs
and
> >> irq_stack_ptr at least seem consistent as if a callback came in
while
> >> being on the sched_op hypercall. But then the bt command seems to
be
> >> operating on a completely different stack.
> >> VCPU2 seems to be active, not polling (event processing could be
initiated
> >> from enable_irqs via force callback or before finally exiting the
sched_op
> >> hypercall), there seems to be a pending upcall but upcalls are
masked.
> >> The next thing I would try is to instrument the sites incrementing
and
> >> decrementing irq_count...
> >> ---
> >>
> >> Below some of the info seen from dom0 debugging keys and looking
> >> at a dump with crash:
> >>
> >> Dom0 Info:
> >>
> >> Polling vCPUs: {1,5-7}
> >> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> >> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
> >> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
> >> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> >> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> >> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
> >> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
> >> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
> >>
> >> 10 [0/1]: s=6 n=1 x=0
> >> 40 [0/1]: s=6 n=6 x=0
> >> 46 [0/1]: s=6 n=7 x=0
> >>
> >> Guest Info:
> >>
> >> lock_spinners:
> >> [0] (struct xen_spinlock *) 0x
> >> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1,
spinners=2 }
> >> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0,
spinners=2 }
> > 
> > If I understand correctly this means that 2 threads (VCPU2 & 6, I
> > suppose) are spinning on this lock, but no one is holding it?
> 
> To my reading both went into xen_spinlock_slow(). VCPU2 may be either in
the
> raw_local_irq_enable() before doing poll_irq or maybe in handling some
event
> after receiving the wakeup irq but before returning from the hv call. So
the
> VCPU that had the lock has returned it and likely the wakeup has been sent.
Makes sense, so we are in the window between one holder unlocking and
the next unlocking, which explains why noone is currently holding it
(which I previously thought odd).
> > An interesting hack^Wexperiment might be to make xen_poll_irq use a
> > timeout and see if that unwedges things -- this would help confirm
that
> > the issue is on nested wakeup.
> 
> I could do that. This would at least re-activate the other waiters. In the
case
> shown VCPU6 and if something in the path of execution on VCPU2 deadlocks
there
> this would move things ahead.
Yes, it would be interesting e.g. if only either 6 or 2 made progress
and the other remained wedged for all time.
> > I suppose xen_spin_unlock could also be instrumented to indicate who
it
> > last kicked and for which lock and that might show us something?
> 
> IIRC I had done this but it did not really show much. What I have done in
the
> meantime was to instrument the IRQ service functions in
> arch/x86/kernel/entry_64.S to give me a history of callbacks. This shows
(in
> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still
polling)
> that the last events on the vcpu not polling are:
> 
> xen_do_hypervisor_callback+127
> call_softirq+29
> call_softirq+125
> call_softirq+29
> call_softirq+125
> call_softirq+29
> call_softirq+125
> xen_do_hypervisor_callback+28
> call_softirq+29
> xen_do_hypervisor_callback+28
> 
> The lower offsets are when irq_count gets incremented and the higher
offsets are
> when irq_count gets decremented before ending the callback. This shows that
at
> least in this case there was an upcall, a softirq and another upcall being
> triggered without finishing the previous one. There was another experiment
where
> I saw softirq, upcall, upcall. But at least it seems that there is always a
> three level stack.
> I believe that softirq, upcall would be ok as softirq processing enables
> interrupts but is it expected to have an upcall interrupted by another one?
I''m not sure. evtchn''s don''t have a priority
mechanism so I expect not
in general but individual interrupt handlers could well reenable
interrupts, I think (it might be one of the flags you pass when calling
request_irq?).
> > Can someone explain why the non-locked update of lock_spinners in
> > spinning_lock() is safe against xen_spin_unlock reading the field from
> > another VCPU? I suspect it isn''t, so what happens if the
other VCPU
> > kicks the lock which is just about to become prev? maybe this is
handled
> > in xen_spin_lock_slow somehow?
> 
> Isn''t that safe because lock_spinners is percpu?
xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which
is what I think might be unsafe.

Ian.

Stefan Bader

2013-Feb-12 14:50 UTC

head link

Re: Strange PVM spinlock case revisited

On 12.02.2013 15:07, Ian Campbell wrote:> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
>> On 11.02.2013 18:29, Ian Campbell wrote:
>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>>>> A while ago I had been reporting an issue which causes Xen PVM
guests
>>>> to hang (apparently related to spinlocks). Since then I was
able to
>>>> gather a few more facts which I try to provide below. For the
real
>>>> reasons, I am still puzzled and would be thankful for any hints
or
>>>> direction.
>>>>
>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>>>> - PVM guest with 8 VCPUs is running 800 threads doing DB
updates.
>>>
>>> This is on a host with >= 8 PCPUs, correct?
>>
>> Maybe >=4 but I cannot remember for sure anymore.
> 
> OK, so the important point I was getting at was whether the guest was
> overcommitting the host or not, it seems like it is (2xVCPUs for each
> PCPU)
Err not so much. I run on an 8-core host. ;)
> 
> 
>>>> From the static information I can extract from a dump it is
hard to
>>>> tell what exactly is going on. VCPU2 seems to be doing
something but
>>>> looks to be rather deep in event handling. The irq_count is at
2, but
>>>> irq_count seems to be used in a few more places than just
xen_hypervisor
>>>> callback. Though that at least seems to be the entry on the
stack I can
>>>> see for VCPU2. The contents of the per-cpu variables for
irq_reqs and
>>>> irq_stack_ptr at least seem consistent as if a callback came in
while
>>>> being on the sched_op hypercall. But then the bt command seems
to be
>>>> operating on a completely different stack.
>>>> VCPU2 seems to be active, not polling (event processing could
be initiated
>>>> from enable_irqs via force callback or before finally exiting
the sched_op
>>>> hypercall), there seems to be a pending upcall but upcalls are
masked.
>>>> The next thing I would try is to instrument the sites
incrementing and
>>>> decrementing irq_count...
>>>> ---
>>>>
>>>> Below some of the info seen from dom0 debugging keys and
looking
>>>> at a dump with crash:
>>>>
>>>> Dom0 Info:
>>>>
>>>> Polling vCPUs: {1,5-7}
>>>> VCPU0: CPU0 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU1: CPU1 [has=F] poll=10 upcall_pend = 00, upcall_mask = 00
>>>> VCPU2: CPU6 [has=T] poll=0  upcall_pend = 01, upcall_mask = 01
>>>> VCPU3: CPU2 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU4: CPU4 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU5: CPU3 [has=F] poll=0  upcall_pend = 00, upcall_mask = 00
>>>> VCPU6: CPU7 [has=F] poll=40 upcall_pend = 01, upcall_mask = 01
>>>> VCPU7: CPU4 [has=F] poll=46 upcall_pend = 01, upcall_mask = 01
>>>>
>>>> 10 [0/1]: s=6 n=1 x=0
>>>> 40 [0/1]: s=6 n=6 x=0
>>>> 46 [0/1]: s=6 n=7 x=0
>>>>
>>>> Guest Info:
>>>>
>>>> lock_spinners:
>>>> [0] (struct xen_spinlock *) 0x
>>>> [1] (struct xen_spinlock *) 0xffff8803bffc8ae8 = { lock=1,
spinners=2 }
>>>> [2] (struct xen_spinlock *) 0xffff8803bfc93700 = { lock=0,
spinners=2 }
>>>
>>> If I understand correctly this means that 2 threads (VCPU2 & 6,
I
>>> suppose) are spinning on this lock, but no one is holding it?
>>
>> To my reading both went into xen_spinlock_slow(). VCPU2 may be either
in the
>> raw_local_irq_enable() before doing poll_irq or maybe in handling some
event
>> after receiving the wakeup irq but before returning from the hv call.
So the
>> VCPU that had the lock has returned it and likely the wakeup has been
sent.
> 
> Makes sense, so we are in the window between one holder unlocking and
> the next unlocking, which explains why noone is currently holding it
> (which I previously thought odd).
> 
>>> An interesting hack^Wexperiment might be to make xen_poll_irq use a
>>> timeout and see if that unwedges things -- this would help confirm
that
>>> the issue is on nested wakeup.
>>
>> I could do that. This would at least re-activate the other waiters. In
the case
>> shown VCPU6 and if something in the path of execution on VCPU2
deadlocks there
>> this would move things ahead.
> 
> Yes, it would be interesting e.g. if only either 6 or 2 made progress
> and the other remained wedged for all time.
On my list next...
> 
>>> I suppose xen_spin_unlock could also be instrumented to indicate
who it
>>> last kicked and for which lock and that might show us something?
>>
>> IIRC I had done this but it did not really show much. What I have done
in the
>> meantime was to instrument the IRQ service functions in
>> arch/x86/kernel/entry_64.S to give me a history of callbacks. This
shows (in
>> another experiment where it is 1,2,5,6 in the same lock and 2,5,6 still
polling)
>> that the last events on the vcpu not polling are:
>>
>> xen_do_hypervisor_callback+127
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> call_softirq+29
>> call_softirq+125
>> xen_do_hypervisor_callback+28
>> call_softirq+29
>> xen_do_hypervisor_callback+28
>>
>> The lower offsets are when irq_count gets incremented and the higher
offsets are
>> when irq_count gets decremented before ending the callback. This shows
that at
>> least in this case there was an upcall, a softirq and another upcall
being
>> triggered without finishing the previous one. There was another
experiment where
>> I saw softirq, upcall, upcall. But at least it seems that there is
always a
>> three level stack.
>> I believe that softirq, upcall would be ok as softirq processing
enables
>> interrupts but is it expected to have an upcall interrupted by another
one?
> 
> I''m not sure. evtchn''s don''t have a priority
mechanism so I expect not
> in general but individual interrupt handlers could well reenable
> interrupts, I think (it might be one of the flags you pass when calling
> request_irq?).
Just a gut feeling but it feels wrong to enable interrupts in any interrupt
handler. I thought for anything that needs interrupts enabled and/or takes
longer it should pushed off to a workqueue...
> 
>>> Can someone explain why the non-locked update of lock_spinners in
>>> spinning_lock() is safe against xen_spin_unlock reading the field
from
>>> another VCPU? I suspect it isn''t, so what happens if the
other VCPU
>>> kicks the lock which is just about to become prev? maybe this is
handled
>>> in xen_spin_lock_slow somehow?
>>
>> Isn''t that safe because lock_spinners is percpu?
> 
> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs, which
> is what I think might be unsafe.
Ah right, that. Hm, the way those two play together always was a bit brain
hurting. Though somehow it feels like if the top level poll_irq would return and
count things as spurious wakeup. I think in that case I would expect the
lock_spinner entry of one vcpu to be not matching the freed lock...
Not a really scientific argument.> 
> Ian.
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Feb-12 15:07 UTC

head link

Re: Strange PVM spinlock case revisited

On Tue, 2013-02-12 at 14:50 +0000, Stefan Bader wrote:> On 12.02.2013 15:07, Ian Campbell wrote:
> > On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
> >> On 11.02.2013 18:29, Ian Campbell wrote:
> >>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
> >>>> A while ago I had been reporting an issue which causes Xen
PVM guests
> >>>> to hang (apparently related to spinlocks). Since then I
was able to
> >>>> gather a few more facts which I try to provide below. For
the real
> >>>> reasons, I am still puzzled and would be thankful for any
hints or
> >>>> direction.
> >>>>
> >>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
> >>>> - PVM guest with 8 VCPUs is running 800 threads doing DB
updates.
> >>>
> >>> This is on a host with >= 8 PCPUs, correct?
> >>
> >> Maybe >=4 but I cannot remember for sure anymore.
> > 
> > OK, so the important point I was getting at was whether the guest was
> > overcommitting the host or not, it seems like it is (2xVCPUs for each
> > PCPU)
> 
> Err not so much. I run on an 8-core host. ;)
Um, which is what I asked. What is >= 4 then?

> > 
> >>> I suppose xen_spin_unlock could also be instrumented to
indicate who it
> >>> last kicked and for which lock and that might show us
something?
> >>
> >> IIRC I had done this but it did not really show much. What I have
done in the
> >> meantime was to instrument the IRQ service functions in
> >> arch/x86/kernel/entry_64.S to give me a history of callbacks. This
shows (in
> >> another experiment where it is 1,2,5,6 in the same lock and 2,5,6
still polling)
> >> that the last events on the vcpu not polling are:
> >>
> >> xen_do_hypervisor_callback+127
> >> call_softirq+29
> >> call_softirq+125
> >> call_softirq+29
> >> call_softirq+125
> >> call_softirq+29
> >> call_softirq+125
> >> xen_do_hypervisor_callback+28
> >> call_softirq+29
> >> xen_do_hypervisor_callback+28
> >>
> >> The lower offsets are when irq_count gets incremented and the
higher offsets are
> >> when irq_count gets decremented before ending the callback. This
shows that at
> >> least in this case there was an upcall, a softirq and another
upcall being
> >> triggered without finishing the previous one. There was another
experiment where
> >> I saw softirq, upcall, upcall. But at least it seems that there is
always a
> >> three level stack.
> >> I believe that softirq, upcall would be ok as softirq processing
enables
> >> interrupts but is it expected to have an upcall interrupted by
another one?
> > 
> > I''m not sure. evtchn''s don''t have a
priority mechanism so I expect not
> > in general but individual interrupt handlers could well reenable
> > interrupts, I think (it might be one of the flags you pass when
calling
> > request_irq?).
> 
> Just a gut feeling but it feels wrong to enable interrupts in any interrupt
> handler. I thought for anything that needs interrupts enabled and/or takes
> longer it should pushed off to a workqueue...
Either that or some sort of interrupt prioritisation mechanism in the
hardware so only higher priority interrupts than the current one will
occur.
> 
> > 
> >>> Can someone explain why the non-locked update of lock_spinners
in
> >>> spinning_lock() is safe against xen_spin_unlock reading the
field from
> >>> another VCPU? I suspect it isn''t, so what happens if
the other VCPU
> >>> kicks the lock which is just about to become prev? maybe this
is handled
> >>> in xen_spin_lock_slow somehow?
> >>
> >> Isn''t that safe because lock_spinners is percpu?
> > 
> > xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs,
which
> > is what I think might be unsafe.
> 
> Ah right, that. Hm, the way those two play together always was a bit brain
> hurting.
Yes...
>  Though somehow it feels like if the top level poll_irq would return and
> count things as spurious wakeup. I think in that case I would expect the
> lock_spinner entry of one vcpu to be not matching the freed lock...
> Not a really scientific argument.
Adding a BUG_ON(__this_cpu_write(lock_spinners) != xl) in
unspinning_lock might be interesting?

I wonder if this is something where the ftrace and related kernel
infrastructure might be useful?

Ian.

Stefan Bader

2013-Feb-12 15:53 UTC

head link

Re: Strange PVM spinlock case revisited

On 12.02.2013 16:07, Ian Campbell wrote:> On Tue, 2013-02-12 at 14:50 +0000, Stefan Bader wrote:
>> On 12.02.2013 15:07, Ian Campbell wrote:
>>> On Tue, 2013-02-12 at 13:42 +0000, Stefan Bader wrote:
>>>> On 11.02.2013 18:29, Ian Campbell wrote:
>>>>> On Fri, 2013-02-08 at 15:33 +0000, Stefan Bader wrote:
>>>>>> A while ago I had been reporting an issue which causes
Xen PVM guests
>>>>>> to hang (apparently related to spinlocks). Since then I
was able to
>>>>>> gather a few more facts which I try to provide below.
For the real
>>>>>> reasons, I am still puzzled and would be thankful for
any hints or
>>>>>> direction.
>>>>>>
>>>>>> - Happens with Xen 4.1.2 and Xen 4.2 host-side.
>>>>>> - PVM guest with 8 VCPUs is running 800 threads doing
DB updates.
>>>>>
>>>>> This is on a host with >= 8 PCPUs, correct?
>>>>
>>>> Maybe >=4 but I cannot remember for sure anymore.
>>>
>>> OK, so the important point I was getting at was whether the guest
was
>>> overcommitting the host or not, it seems like it is (2xVCPUs for
each
>>> PCPU)
>>
>> Err not so much. I run on an 8-core host. ;)
> 
> Um, which is what I asked. What is >= 4 then?
Uhm, me not being able to discern between P and V at some point. So what I was
thinking, was that I believe I had this when running with 4 VCPUs or 8 VCPUs on
the same 8 PCPU host...
> 
> 
>>>
>>>>> I suppose xen_spin_unlock could also be instrumented to
indicate who it
>>>>> last kicked and for which lock and that might show us
something?
>>>>
>>>> IIRC I had done this but it did not really show much. What I
have done in the
>>>> meantime was to instrument the IRQ service functions in
>>>> arch/x86/kernel/entry_64.S to give me a history of callbacks.
This shows (in
>>>> another experiment where it is 1,2,5,6 in the same lock and
2,5,6 still polling)
>>>> that the last events on the vcpu not polling are:
>>>>
>>>> xen_do_hypervisor_callback+127
>>>> call_softirq+29
>>>> call_softirq+125
>>>> call_softirq+29
>>>> call_softirq+125
>>>> call_softirq+29
>>>> call_softirq+125
>>>> xen_do_hypervisor_callback+28
>>>> call_softirq+29
>>>> xen_do_hypervisor_callback+28
>>>>
>>>> The lower offsets are when irq_count gets incremented and the
higher offsets are
>>>> when irq_count gets decremented before ending the callback.
This shows that at
>>>> least in this case there was an upcall, a softirq and another
upcall being
>>>> triggered without finishing the previous one. There was another
experiment where
>>>> I saw softirq, upcall, upcall. But at least it seems that there
is always a
>>>> three level stack.
>>>> I believe that softirq, upcall would be ok as softirq
processing enables
>>>> interrupts but is it expected to have an upcall interrupted by
another one?
>>>
>>> I''m not sure. evtchn''s don''t have a
priority mechanism so I expect not
>>> in general but individual interrupt handlers could well reenable
>>> interrupts, I think (it might be one of the flags you pass when
calling
>>> request_irq?).
>>
>> Just a gut feeling but it feels wrong to enable interrupts in any
interrupt
>> handler. I thought for anything that needs interrupts enabled and/or
takes
>> longer it should pushed off to a workqueue...
> 
> Either that or some sort of interrupt prioritisation mechanism in the
> hardware so only higher priority interrupts than the current one will
> occur.
> 
>>
>>>
>>>>> Can someone explain why the non-locked update of
lock_spinners in
>>>>> spinning_lock() is safe against xen_spin_unlock reading the
field from
>>>>> another VCPU? I suspect it isn''t, so what happens
if the other VCPU
>>>>> kicks the lock which is just about to become prev? maybe
this is handled
>>>>> in xen_spin_lock_slow somehow?
>>>>
>>>> Isn''t that safe because lock_spinners is percpu?
>>>
>>> xen_spin_unlock_slow() accesses the lock_spinners of remote CPUs,
which
>>> is what I think might be unsafe.
>>
>> Ah right, that. Hm, the way those two play together always was a bit
brain
>> hurting.
> 
> Yes...
> 
>>  Though somehow it feels like if the top level poll_irq would return
and
>> count things as spurious wakeup. I think in that case I would expect
the
>> lock_spinner entry of one vcpu to be not matching the freed lock...
>> Not a really scientific argument.
> 
> Adding a BUG_ON(__this_cpu_write(lock_spinners) != xl) in
> unspinning_lock might be interesting?
Hm, beside of writing prev into there which always should not be xl, I think
since write access only happens in spin_lock_slow that should stack ok. The two
cases that would go wrong are sending a wakeup event when the lock was just
replaced on the other vcpu or not sending one when it should be.
IIRC I had one attempt print when xen_spin_unlock_slow went through all vcpus
without finding one to signal. That seemed to happen rather often but then would
resolve as xen_spin_lock_slow for the other lock would set the event pending for
the prev lock if there was one (after being sucessful with the top level lock).

I think there is room for improvement by finding a way not always to search from
vcpu 0 in spin_unlock_slow and maybe in spin_lock_slow to check whether
returning from poll without the current lock being free may be because of prev
being free now. When there are more than one spinner may be able to proceed...
> 
> I wonder if this is something where the ftrace and related kernel
> infrastructure might be useful?
MAybe. Not sure there are hooks for what we are looking for and whether the
system if not hanging too badly. My usual approach of looking at the dump would
require to manually find the buffer and parse the records...





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Stefan Bader

2013-Feb-13 11:31 UTC

head link

Re: Strange PVM spinlock case revisited

On 11.02.2013 18:29, Ian Campbell wrote:
> An interesting hack^Wexperiment might be to make xen_poll_irq use a
> timeout and see if that unwedges things -- this would help confirm that
> the issue is on nested wakeup.
> 
So I did go forward and replaced xen_poll_irq by xen_poll_irq_timeout and it did
get rid of the hang. Though I think there is a big taint there. There was
only one other user of poll_irq_timeout in the kernel code. And that uses
"jiffies + <timeout>*HZ". But when I look at the Xen side in
do_poll, that looks
like it is using timeout in a absolute "ns since boot" (of hv/dom0)
way. Not
sure how that ever can work. The ns since boot in the guest clearly is always
behind the host (and jiffies isn''t ns either).
Effectively I likely got rid of any wait time in the hypervisor and back to
mostly spinning. Which matches the experience that the test run never gets stuck
waiting for a timeout. That maybe proves the stacking is an issue but also is
likely a bit too aggressive in not having any... :/

I will try to think of some better way. Not sure the thinking is realistic but
maybe that could happen:

xen_spin_lock_slow(a)
  ...
  enables irq and upcalls are pending
    upcall processing wants lock b
    xen_spin_lock_slow(b)
                    --- just before replacing lock_spinners ---
                                                   xen_spin_unlock_slow(a)
                                                   finds other vcpu, triggers
                                                     IRQ
    lock b is top spinner
    going into poll_irq
    poll_irq returns
    lock a gets restored
    so maybe no spinners on b
    dropping out to xen_spin_lock
                                                  unlock of b not finding any
                                                  spinners
    lock b acquired

That way the irq for lock a maybe get lost...



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Stefan Bader

2013-Feb-13 12:16 UTC

head link

Re: Strange PVM spinlock case revisited

On 13.02.2013 12:31, Stefan Bader wrote:
> I will try to think of some better way. Not sure the thinking is realistic
but
> maybe that could happen:
> 
> xen_spin_lock_slow(a)
>   ...
>   enables irq and upcalls are pending
>     upcall processing wants lock b
>     xen_spin_lock_slow(b)
>                     --- just before replacing lock_spinners ---
>                                                    xen_spin_unlock_slow(a)
>                                                    finds other vcpu,
triggers
>                                                      IRQ
>     lock b is top spinner
>     going into poll_irq
>     poll_irq returns
>     lock a gets restored
>     so maybe no spinners on b
>     dropping out to xen_spin_lock
>                                                   unlock of b not finding
any
>                                                   spinners
>     lock b acquired
> 
> That way the irq for lock a maybe get lost...
Darn, maybe not since the pending status is not cleared when leaving
xen_spin_lock_slow. So I assume the interrupted lock does at least one cycle out
of slow and back...




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Stefan Bader

2013-Feb-14 11:04 UTC

head link

Strange PVM spinlock case revisited (nailed down)

I think I finally can proof why allowing interrupts for the duration of poll_irq
in xen_spin_lock_slow is bad! \o/

So with another run/dump:

VCPUs 1,2,5 and 6 are spinning on the runq lock of VCPU 1. The number of
spinners is 4 (ok) and the lock released.
According to dom0 debug info VCPUs 2,5 and 6 are polling, VCPU 1 is not
but seems running (has=T). The event channel for spinlock1 is pending.

Checking the interrupt stack of VCPU 1 in the guest dump shows:

hypercall_page+938
xen_poll_irq+16
xen_spin_lock_slow+151
xen_spin_lock_flags+99
_raw_spin_lock_irqsave+46
update_shares+156
rebalance_domains+72
run_rebalance_domains+72
__do_softirq+168
call_softirq+99
do_softirq+101
irq_exit+142
xen_evtchn_do_upcall+53
xen_do_hypervisor_callback+101

So right before finishing a previous callback, irq_exit triggers softirq
processing (interrupts enabled) and while updating the shares this tries to grab
the runq lock which we see in lock_spinners.

Since irq_count is 2 there is one more interruption executing right now (though
irq_regs wont show this). So I just proceeded upwards in the interrupt stack and
found:
try_to_wake_up+57
default_wake_function+18
autoremove_wake_function+22
wake_bit_function+59
__wake_up_common+88
__wake_up+72
__wake_up_bit+49
end_page_writeback+64
put_io_page+36
ext4_end_bio+338
bio_endio+29
req_bio_endio.isra.45+163
blk_update_request+245
blk_update_bidi_request+49
__blk_end_bidi_request+32
__blk_end_request_all+31
blkif_interrupt+460
handle_irq_event_percpu+85
handle_irq_event+78
handle_edge_irq+132
__xen_evtchn_do_upcall+409
xen_evtchn_do_upcall+47
xen_do_hypervisor_callback+101

There was a bit more on the stack but try_to_wake_up seemed a believable current
path. Even more so after looking into the function:

#ifdef CONFIG_SMP
        /*
         * If the owning (remote) cpu is still in the middle of schedule() with
         * this task as prev, wait until its done referencing the task.
         */
        while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
                /*
                 * In case the architecture enables interrupts in
                 * context_switch(), we cannot busy wait, since that
                 * would lead to deadlocks when an interrupt hits and
                 * tries to wake up @prev. So bail and do a complete
                 * remote wakeup.
                 */
                if (ttwu_activate_remote(p, wake_flags))
                        goto stat;
#else
                cpu_relax();
#endif

And prying out the task in question from the stack, it was one currently
being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters for
the runq lock of VCPU 1. Which would get picked up as soon as this callback is
done. Unfortunately we "stay awhile... stay forever!".

Ok, as an experiment I defined  __ARCH_WANT_INTERRUPTS_ON_CTXSW for x86 and
running that kernel in the PVM guest no more locks up.
Problem there is that this define is gone since v3.7 (f3e9478674). And the
same testcase is not able to cause the issue since v3.6 (611ae8e3f5). The
change of TLB flushes unlikely is directly related (rather changing the number
of IPIs and by that the likelihood to observe this).

So for v3.6+ kernels, the question is whether it can be ruled out that during
softirq handling rebalance_domains->update_blocked_averages (was
update_shares)
is called which then can get into spin_lock_slow and enable interrupts.
I have not seen it or am aware of reports about it but its likely one of those
rare occurrences.

As for a solution, the simplest one would be never to re-enable interrupts
before poll_irq... Anything else feels atm right complicated (like trying to
make the scheduler use disable interrupts and spin_lock variants instead of
spin_lock_irq ones... weird hidden implications in common code "just"
for PVM).

-Stefan

Note: comparing the number of db records created on a 15min run of the testcase
seems to yield similar results for interrupts enabled (with that arch define in
v3.2) or disabled. Though I do not have a huge number base and that PVM is the
only one running on the host.





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Feb-14 11:43 UTC

head link

Re: Strange PVM spinlock case revisited (nailed down)

>>> On 14.02.13 at 12:04, Stefan Bader
<stefan.bader@canonical.com> wrote:
> There was a bit more on the stack but try_to_wake_up seemed a believable 
> current
> path. Even more so after looking into the function:
> 
> #ifdef CONFIG_SMP
>         /*
>          * If the owning (remote) cpu is still in the middle of schedule()
with
>          * this task as prev, wait until its done referencing the task.
>          */
>         while (p->on_cpu) {
> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
>                 /*
>                  * In case the architecture enables interrupts in
>                  * context_switch(), we cannot busy wait, since that
>                  * would lead to deadlocks when an interrupt hits and
>                  * tries to wake up @prev. So bail and do a complete
>                  * remote wakeup.
>                  */
>                 if (ttwu_activate_remote(p, wake_flags))
>                         goto stat;
> #else
>                 cpu_relax();
> #endif
> 
> And prying out the task in question from the stack, it was one currently
> being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters 
> for
> the runq lock of VCPU 1. Which would get picked up as soon as this callback
> is
> done. Unfortunately we "stay awhile... stay forever!".
When analyzing a similar problem with our old PV ticket lock
implementation (and the interrupt re-enabling there when possible
before going into polling mode), it was precisely this open coded
lock construct that caused problems for us. Back then I didn''t,
however, realize that this would even affect the simplistic byte
locks used by the pv-ops Xen code (or else I would have pointed
this out).

Being relatively certain that with our new implementation we don''t
have any such problems, I can only recommend against dropping
the re-enabling of interrupts - this needlessly introduces high
interrupt latencies in a broader range of cases. Instead, the
interactions ought to be fixed properly. And it''s time for using
ticket locks in the Xen code too...

Jan

Stefan Bader

2013-Feb-14 14:21 UTC

head link

Re: Strange PVM spinlock case revisited (nailed down)

On 14.02.2013 12:43, Jan Beulich wrote:>>>> On 14.02.13 at 12:04, Stefan Bader
<stefan.bader@canonical.com> wrote:
>> There was a bit more on the stack but try_to_wake_up seemed a
believable
>> current
>> path. Even more so after looking into the function:
>>
>> #ifdef CONFIG_SMP
>>         /*
>>          * If the owning (remote) cpu is still in the middle of
schedule() with
>>          * this task as prev, wait until its done referencing the task.
>>          */
>>         while (p->on_cpu) {
>> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
>>                 /*
>>                  * In case the architecture enables interrupts in
>>                  * context_switch(), we cannot busy wait, since that
>>                  * would lead to deadlocks when an interrupt hits and
>>                  * tries to wake up @prev. So bail and do a complete
>>                  * remote wakeup.
>>                  */
>>                 if (ttwu_activate_remote(p, wake_flags))
>>                         goto stat;
>> #else
>>                 cpu_relax();
>> #endif
>>
>> And prying out the task in question from the stack, it was one
currently
>> being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other
waiters
>> for
>> the runq lock of VCPU 1. Which would get picked up as soon as this
callback
>> is
>> done. Unfortunately we "stay awhile... stay forever!".
> 
> When analyzing a similar problem with our old PV ticket lock
> implementation (and the interrupt re-enabling there when possible
> before going into polling mode), it was precisely this open coded
> lock construct that caused problems for us. Back then I didn''t,
> however, realize that this would even affect the simplistic byte
> locks used by the pv-ops Xen code (or else I would have pointed
> this out).
> 
> Being relatively certain that with our new implementation we don''t
> have any such problems, I can only recommend against dropping
> the re-enabling of interrupts - this needlessly introduces high
> interrupt latencies in a broader range of cases. Instead, the
> interactions ought to be fixed properly. And it''s time for using
> ticket locks in the Xen code too...
> Not sure what *your new* implementation is. ;) I am more concerned about the
currently released/stable kernels which potentially have this problem. I agree
that a proper solution is preferable. Ticket locks likely have a much lower
chance of hitting this as part of the current problem is that lower number VCPUs
are preferred in unlock_slow.
In the end, though, we would need something that could go upstream (and from
there back into stable). So it should not be too complicated. Then a proper
solution can be applied.
Having more understanding now, I had a different idea. Not sure this is
foolproof and surely is causing some more active spinning. But it looks like I
also can keep the system from locking up (v3.2 kernel and the testcase) if I
change xen_spin_unlock_slow to send the wakeup IPI to _all_ spinners instead of
only the first one found.

-Stefan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Feb-14 14:48 UTC

head link

Re: Strange PVM spinlock case revisited (nailed down)

>>> On 14.02.13 at 15:21, Stefan Bader
<stefan.bader@canonical.com> wrote:
> Having more understanding now, I had a different idea. Not sure this is
> foolproof and surely is causing some more active spinning. But it looks
like I
> also can keep the system from locking up (v3.2 kernel and the testcase) if
I
> change xen_spin_unlock_slow to send the wakeup IPI to _all_ spinners
instead
> of only the first one found.
That, afaict, is an absolute requirement with the open coded
lock that you pointed at, as otherwise the code makes a
pseudo ticket assumption in believing to know who''s going to
be the next owner of the lock.

Jan

Xen devel - Feb 2013 - Strange PVM spinlock case revisited

Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Re: Strange PVM spinlock case revisited

Strange PVM spinlock case revisited (nailed down)

Re: Strange PVM spinlock case revisited (nailed down)

Re: Strange PVM spinlock case revisited (nailed down)

Re: Strange PVM spinlock case revisited (nailed down)