thr3ads.net - Xen devel - LVM userspace causing dom0 crash [May 2012]

If this information is useful, please help other people find it:
Share via:

Christopher S. Aker

2012-May-07 15:36 UTC

LVM userspace causing dom0 crash

Xen: 4.1.3-rc1-pre (xenbits @ 23285)
Dom0: 3.2.6 PAE and 3.3.4 PAE

We seeing the below crash on 3.x dom0s.  A simple lvcreate/lvremove loop 
deployed to a few dozen boxes will hit it quite reliably within a short 
time.  This happens on both an older LVM userspace and newest, and in 
production we have seen this hit on lvremove, lvrename, and lvdelete.

#!/bin/bash
while true; do
    lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1
done

BUG: unable to handle kernel paging request at bffff628
IP: [<c10ebc58>] __page_check_address+0xb8/0x170
*pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip 
ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6
EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6
EIP is at __page_check_address+0xb8/0x170
EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000
ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000)
Stack:
  e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84
  00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0
  b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4
Call Trace:
  [<c10ec769>] try_to_unmap_one+0x29/0x310
  [<c10ecb73>] try_to_unmap_file+0x83/0x560
  [<c1005829>] ? xen_pte_val+0xb9/0x140
  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
  [<c10ed13c>] try_to_munlock+0x1c/0x40
  [<c10e7109>] munlock_vma_page+0x49/0x90
  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
  [<c10e7352>] mlock_fixup+0xc2/0x130
  [<c10e742c>] do_mlockall+0x6c/0x80
  [<c10e7469>] sys_munlockall+0x29/0x50
  [<c166f1d8>] sysenter_do_call+0x12/0x28
Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1 e2 
05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c <f7> 06 
01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0
EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP 0069:c80ffe4c
CR2: 00000000bffff628
---[ end trace 8039aeca9c19f5ab ]---
note: lvremove[27902] exited with preempt_count 1
BUG: scheduling while atomic: lvremove/27902/0x00000001
Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip 
ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
Pid: 27902, comm: lvremove Tainted: G      D      3.2.6-1 #1
Call Trace:
  [<c1040fcd>] __schedule_bug+0x5d/0x70
  [<c1666fb9>] __schedule+0x679/0x830
  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
  [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60
  [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
  [<c1001227>] ? hypercall_page+0x227/0x1000
  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
  [<c1667250>] schedule+0x30/0x50
  [<c166890d>] rwsem_down_failed_common+0x9d/0xf0
  [<c1668992>] rwsem_down_read_failed+0x12/0x14
  [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc
  [<c166814d>] ? down_read+0xd/0x10
  [<c1086f9a>] acct_collect+0x3a/0x170
  [<c105028a>] do_exit+0x62a/0x7d0
  [<c104cb37>] ? kmsg_dump+0x37/0xc0
  [<c1669ac0>] oops_end+0x90/0xd0
  [<c1032dbe>] no_context+0xbe/0x190
  [<c1032f28>] __bad_area_nosemaphore+0x98/0x140
  [<c1008089>] ? xen_clocksource_read+0x19/0x20
  [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80
  [<c1032fe2>] bad_area_nosemaphore+0x12/0x20
  [<c166bc12>] do_page_fault+0x2d2/0x3f0
  [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0
  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
  [<c1008294>] ? check_events+0x8/0xc
  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
  [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20
  [<c166b940>] ? spurious_fault+0x130/0x130
  [<c166932e>] error_code+0x5a/0x60
  [<c166b940>] ? spurious_fault+0x130/0x130
  [<c10ebc58>] ? __page_check_address+0xb8/0x170
  [<c10ec769>] try_to_unmap_one+0x29/0x310
  [<c10ecb73>] try_to_unmap_file+0x83/0x560
  [<c1005829>] ? xen_pte_val+0xb9/0x140
  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
  [<c10ed13c>] try_to_munlock+0x1c/0x40
  [<c10e7109>] munlock_vma_page+0x49/0x90
  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
  [<c10e7352>] mlock_fixup+0xc2/0x130
  [<c10e742c>] do_mlockall+0x6c/0x80
  [<c10e7469>] sys_munlockall+0x29/0x50
  [<c166f1d8>] sysenter_do_call+0x12/0x28

Thanks,
-Chris

Konrad Rzeszutek Wilk

2012-May-07 17:17 UTC

head link

Re: LVM userspace causing dom0 crash

On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker
wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285)
> Dom0: 3.2.6 PAE and 3.3.4 PAE
This looks suspicious like a fix that went in some time ago, ah:

2cd1c8d x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous,
regardless of lazy_mmu mode

but that went in 3.2 so that can''t be it.

Hm, can you give more details on what parameters you are passing
to dom0 and the hypervisor so I can reproduce it?

Also, could you send me your .config file? Is the underlaying
storage SCSI?

And is this only happening on these SuperMicro boxes or are you seeing
this on other hardware as well?> 
> We seeing the below crash on 3.x dom0s.  A simple lvcreate/lvremove
> loop deployed to a few dozen boxes will hit it quite reliably within
> a short time.  This happens on both an older LVM userspace and
> newest, and in production we have seen this hit on lvremove,
> lvrename, and lvdelete.
> 
> #!/bin/bash
> while true; do
>    lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1
> done
> 
> BUG: unable to handle kernel paging request at bffff628
> IP: [<c10ebc58>] __page_check_address+0xb8/0x170
> *pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000
> Oops: 0000 [#1] SMP
> Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6
> ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
> Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6
> EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6
> EIP is at __page_check_address+0xb8/0x170
> EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000
> ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000)
> Stack:
>  e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84
>  00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0
>  b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4
> Call Trace:
>  [<c10ec769>] try_to_unmap_one+0x29/0x310
>  [<c10ecb73>] try_to_unmap_file+0x83/0x560
>  [<c1005829>] ? xen_pte_val+0xb9/0x140
>  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
>  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
>  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
>  [<c10ed13c>] try_to_munlock+0x1c/0x40
>  [<c10e7109>] munlock_vma_page+0x49/0x90
>  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
>  [<c10e7352>] mlock_fixup+0xc2/0x130
>  [<c10e742c>] do_mlockall+0x6c/0x80
>  [<c10e7469>] sys_munlockall+0x29/0x50
>  [<c166f1d8>] sysenter_do_call+0x12/0x28
> Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1
> e2 05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c
> <f7> 06 01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0
> EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP
0069:c80ffe4c
> CR2: 00000000bffff628
> ---[ end trace 8039aeca9c19f5ab ]---
> note: lvremove[27902] exited with preempt_count 1
> BUG: scheduling while atomic: lvremove/27902/0x00000001
> Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6
> ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
> Pid: 27902, comm: lvremove Tainted: G      D      3.2.6-1 #1
> Call Trace:
>  [<c1040fcd>] __schedule_bug+0x5d/0x70
>  [<c1666fb9>] __schedule+0x679/0x830
>  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
>  [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60
>  [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
>  [<c1001227>] ? hypercall_page+0x227/0x1000
>  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
>  [<c1667250>] schedule+0x30/0x50
>  [<c166890d>] rwsem_down_failed_common+0x9d/0xf0
>  [<c1668992>] rwsem_down_read_failed+0x12/0x14
>  [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc
>  [<c166814d>] ? down_read+0xd/0x10
>  [<c1086f9a>] acct_collect+0x3a/0x170
>  [<c105028a>] do_exit+0x62a/0x7d0
>  [<c104cb37>] ? kmsg_dump+0x37/0xc0
>  [<c1669ac0>] oops_end+0x90/0xd0
>  [<c1032dbe>] no_context+0xbe/0x190
>  [<c1032f28>] __bad_area_nosemaphore+0x98/0x140
>  [<c1008089>] ? xen_clocksource_read+0x19/0x20
>  [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80
>  [<c1032fe2>] bad_area_nosemaphore+0x12/0x20
>  [<c166bc12>] do_page_fault+0x2d2/0x3f0
>  [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0
>  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
>  [<c1008294>] ? check_events+0x8/0xc
>  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
>  [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20
>  [<c166b940>] ? spurious_fault+0x130/0x130
>  [<c166932e>] error_code+0x5a/0x60
>  [<c166b940>] ? spurious_fault+0x130/0x130
>  [<c10ebc58>] ? __page_check_address+0xb8/0x170
>  [<c10ec769>] try_to_unmap_one+0x29/0x310
>  [<c10ecb73>] try_to_unmap_file+0x83/0x560
>  [<c1005829>] ? xen_pte_val+0xb9/0x140
>  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
>  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
>  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
>  [<c10ed13c>] try_to_munlock+0x1c/0x40
>  [<c10e7109>] munlock_vma_page+0x49/0x90
>  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
>  [<c10e7352>] mlock_fixup+0xc2/0x130
>  [<c10e742c>] do_mlockall+0x6c/0x80
>  [<c10e7469>] sys_munlockall+0x29/0x50
>  [<c166f1d8>] sysenter_do_call+0x12/0x28
> 
> Thanks,
> -Chris
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Christopher S. Aker

2012-May-07 18:37 UTC

head link

Re: LVM userspace causing dom0 crash

This looks suspiciously like the problem described by Nai Xia in "mm: 
page_check_address bug fix and make it validate subpages in huge pages" 
<http://lkml.org/lkml/2011/3/28/196> - which never made it into mainline 
that I can detect.

We''ve rebuilt without CONFIG_HIGHPTE and are running our test again. 
We
shall see.

-Chris

Christopher S. Aker

2012-May-07 19:59 UTC

head link

Re: LVM userspace causing dom0 crash

On 5/7/12 2:37 PM, Christopher S. Aker wrote:> This looks suspiciously like the problem described by Nai Xia in "mm:
> page_check_address bug fix and make it validate subpages in huge
pages"
> <http://lkml.org/lkml/2011/3/28/196> - which never made it into
mainline
> that I can detect.
>
> We''ve rebuilt without CONFIG_HIGHPTE and are running our test
again. We
> shall see.
No joy just disabling CONFIG_HIGHPTE.  Triggered it on four boxes in no 
time.  Bugs out exactly the same as before except the second BUG 
"scheduling while atomic" doesn''t happen.

On 5/7/12 1:17 PM, Konrad Rzeszutek Wilk wrote:> Hm, can you give more details on what parameters you are passing to
> dom0 and the hypervisor so I can reproduce it?
Xen and dom0 binaries and modules, and arguments here:

http://theshore.net/~caker/xen/BUGS/lvm/

This is atop hardware RAID -- we haven''t tested on anything other than 
Supermicro motherboards.

Thanks,
-Chris

Christopher S. Aker

2012-May-10 15:49 UTC

head link

Re: LVM userspace causing dom0 crash

We''ve tried 3.3.5 along with disabling CONFIG_HUGETLB_PAGE (since it
was
potentially in the code path looking at the source) and we''re still
able
to trigger the bug.

Is there anything else we can try or more information we can provide 
that can help this along?

Thanks,
-Chris

Konrad Rzeszutek Wilk

2012-May-11 15:39 UTC

head link

Re: LVM userspace causing dom0 crash

On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker
wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285)
> Dom0: 3.2.6 PAE and 3.3.4 PAE
> 
> We seeing the below crash on 3.x dom0s.  A simple lvcreate/lvremove
> loop deployed to a few dozen boxes will hit it quite reliably within
> a short time.  This happens on both an older LVM userspace and
> newest, and in production we have seen this hit on lvremove,
> lvrename, and lvdelete.
> 
> #!/bin/bash
> while true; do
>    lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1
> done
So I tried this with 3.4-rc6 and didn''t see this.
The machine isn''t that powerfull - just a  Intel(R) Core(TM) i5-2500
CPU @ 3.30GHz
so four CPUs are visible.

Let me try with 3.2.x shortly.> 
> BUG: unable to handle kernel paging request at bffff628
> IP: [<c10ebc58>] __page_check_address+0xb8/0x170
> *pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000
> Oops: 0000 [#1] SMP
> Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6
> ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
> Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6
> EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6
> EIP is at __page_check_address+0xb8/0x170
> EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000
> ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000)
> Stack:
>  e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84
>  00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0
>  b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4
> Call Trace:
>  [<c10ec769>] try_to_unmap_one+0x29/0x310
>  [<c10ecb73>] try_to_unmap_file+0x83/0x560
>  [<c1005829>] ? xen_pte_val+0xb9/0x140
>  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
>  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
>  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
>  [<c10ed13c>] try_to_munlock+0x1c/0x40
>  [<c10e7109>] munlock_vma_page+0x49/0x90
>  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
>  [<c10e7352>] mlock_fixup+0xc2/0x130
>  [<c10e742c>] do_mlockall+0x6c/0x80
>  [<c10e7469>] sys_munlockall+0x29/0x50
>  [<c166f1d8>] sysenter_do_call+0x12/0x28
> Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1
> e2 05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c
> <f7> 06 01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0
> EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP
0069:c80ffe4c
> CR2: 00000000bffff628
> ---[ end trace 8039aeca9c19f5ab ]---
> note: lvremove[27902] exited with preempt_count 1
> BUG: scheduling while atomic: lvremove/27902/0x00000001
> Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6
> ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
> Pid: 27902, comm: lvremove Tainted: G      D      3.2.6-1 #1
> Call Trace:
>  [<c1040fcd>] __schedule_bug+0x5d/0x70
>  [<c1666fb9>] __schedule+0x679/0x830
>  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
>  [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60
>  [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
>  [<c1001227>] ? hypercall_page+0x227/0x1000
>  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
>  [<c1667250>] schedule+0x30/0x50
>  [<c166890d>] rwsem_down_failed_common+0x9d/0xf0
>  [<c1668992>] rwsem_down_read_failed+0x12/0x14
>  [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc
>  [<c166814d>] ? down_read+0xd/0x10
>  [<c1086f9a>] acct_collect+0x3a/0x170
>  [<c105028a>] do_exit+0x62a/0x7d0
>  [<c104cb37>] ? kmsg_dump+0x37/0xc0
>  [<c1669ac0>] oops_end+0x90/0xd0
>  [<c1032dbe>] no_context+0xbe/0x190
>  [<c1032f28>] __bad_area_nosemaphore+0x98/0x140
>  [<c1008089>] ? xen_clocksource_read+0x19/0x20
>  [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80
>  [<c1032fe2>] bad_area_nosemaphore+0x12/0x20
>  [<c166bc12>] do_page_fault+0x2d2/0x3f0
>  [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0
>  [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30
>  [<c1008294>] ? check_events+0x8/0xc
>  [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
>  [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20
>  [<c166b940>] ? spurious_fault+0x130/0x130
>  [<c166932e>] error_code+0x5a/0x60
>  [<c166b940>] ? spurious_fault+0x130/0x130
>  [<c10ebc58>] ? __page_check_address+0xb8/0x170
>  [<c10ec769>] try_to_unmap_one+0x29/0x310
>  [<c10ecb73>] try_to_unmap_file+0x83/0x560
>  [<c1005829>] ? xen_pte_val+0xb9/0x140
>  [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8
>  [<c10e1bf8>] ? vm_normal_page+0x28/0xc0
>  [<c1038e95>] ? kmap_atomic_prot+0x45/0x110
>  [<c10ed13c>] try_to_munlock+0x1c/0x40
>  [<c10e7109>] munlock_vma_page+0x49/0x90
>  [<c10e7247>] munlock_vma_pages_range+0x57/0xa0
>  [<c10e7352>] mlock_fixup+0xc2/0x130
>  [<c10e742c>] do_mlockall+0x6c/0x80
>  [<c10e7469>] sys_munlockall+0x29/0x50
>  [<c166f1d8>] sysenter_do_call+0x12/0x28
> 
> Thanks,
> -Chris
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2012-May-11 18:30 UTC

head link

Re: LVM userspace causing dom0 crash

On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker
wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285)
> Dom0: 3.2.6 PAE and 3.3.4 PAE
> 
> We seeing the below crash on 3.x dom0s.  A simple lvcreate/lvremove
> loop deployed to a few dozen boxes will hit it quite reliably within
> a short time.  This happens on both an older LVM userspace and
> newest, and in production we have seen this hit on lvremove,
> lvrename, and lvdelete.
> 
> #!/bin/bash
> while true; do
>    lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1
> done
I just did this with 3.2.16 and didn''t experience this. Can you
try 3.2.16 pls?

I used the attached .config


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Christopher S. Aker

2012-May-11 22:40 UTC

head link

Re: LVM userspace causing dom0 crash

On May 11, 2012, at 2:30 PM, Konrad Rzeszutek Wilk
wrote:> On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker wrote:
>> Xen: 4.1.3-rc1-pre (xenbits @ 23285)
> I just did this with 3.2.16 and didn''t experience this. Can you
> try 3.2.16 pls?
> 
> I used the attached .config
> <.config.txt>
Thanks, but no joy with 3.2.16 and your config - only changes made were to build
in a few drivers verses as modules to avoid initrd.  Two boxes hit it out of a
dozen in under an hour.

[ 2152.285097] BUG: unable to handle kernel paging request at bffff3f0 
[ 2152.285160] IP: [<c1137f4a>] __page_check_address+0xca/0x1b0 
[ 2152.285238] *pdpt = 000000001b5ac027 *pde = 0000000000000000  
[ 2152.285286] Oops: 0000 [#1] PREEMPT SMP  
[ 2152.285338] Modules linked in: dm_snapshot xen_evtchn xenfs ext2 dm_mod
tpm_tis ata_generic ata_piix e1000e sg
[ 2152.285468]  
[ 2152.285495] Pid: 506, comm: lvremove Tainted: G        W    3.2.16 #4
Supermicro X8DT6/X8DT6
[ 2152.285572] EIP: 0061:[<c1137f4a>] EFLAGS: 00010246 CPU: 14 
[ 2152.285607] EIP is at __page_check_address+0xca/0x1b0 
[ 2152.285641] EAX: bffff000 EBX: dbe79dd8 ECX: 00000000 EDX: e2c00000 
[ 2152.285678] ESI: bffff3f0 EDI: e33538e0 EBP: daf03e58 ESP: daf03e48 
[ 2152.285715]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 
[ 2152.285750] Process lvremove (pid: 506, ti=daf02000 task=dd18a3e0
task.ti=daf02000)
[ 2152.285804] Stack: 
[ 2152.285830]  00000fff e33538e0 dd1a5400 da4826c0 daf03e8c c1138b69 daf03e7c
00000000
[ 2152.285938]  00000124 00000000 c158df53 b767e000 00000000 b7686000 da4826c0
b767e000
[ 2152.286042]  e33538e0 daf03f1c c11390e3 00000002 e2c00280 008c7025 daf03ebc
c1037d59
[ 2152.286146] Call Trace: 
[ 2152.286178]  [<c1138b69>] try_to_unmap_one+0x29/0x370 
[ 2152.286218]  [<c158df53>] ? _raw_spin_unlock+0x13/0x40 
[ 2152.286255]  [<c11390e3>] try_to_unmap_file+0x83/0x5a0 
[ 2152.286295]  [<c1037d59>] ? xen_pte_val+0xb9/0x140 
[ 2152.286332]  [<c1036c46>] ? __raw_callee_save_xen_pte_val+0x6/0x8 
[ 2152.286372]  [<c112d6e8>] ? vm_normal_page+0x28/0xe0 
[ 2152.286408]  [<c1037a5d>] ? xen_pmd_val+0x6d/0xf0 
[ 2152.286446]  [<c107995b>] ? get_parent_ip+0xb/0x40 
[ 2152.286482]  [<c113972c>] try_to_munlock+0x1c/0x40 
[ 2152.286518]  [<c11331d9>] munlock_vma_page+0x49/0x90 
[ 2152.286582]  [<c113332d>] munlock_vma_pages_range+0x6d/0xb0 
[ 2152.286620]  [<c1133437>] mlock_fixup+0xc7/0x130 
[ 2152.286656]  [<c1133717>] do_mlock+0x97/0xc0 
[ 2152.286690]  [<c1133782>] sys_munlock+0x42/0x60 
[ 2152.286728]  [<c159465f>] sysenter_do_call+0x12/0x28 
[ 2152.286761] Code: e0 c4 76 c1 8b 04 85 c0 c4 76 c1 2b 90 b0 12 00 00 c1 e2 05
03 90 ac 12 00 00 89 d0 e8 b0 9e f3 ff 8b 4d 0c 85 c9 8d 34 30 75 0c <f7>
06 01 01 00 00 0f 84 ab 00 00 00 8b 03 8b 53 04 ff 15 14 25
[ 2152.287263] EIP: [<c1137f4a>] __page_check_address+0xca/0x1b0 SS:ESP
0069:daf03e48
[ 2152.287331] CR2: 00000000bffff3f0 
[ 2152.287560] ---[ end trace a7919e7f17c0a74d ]--- 
[ 2152.287622] note: lvremove[506] exited with preempt_count 1 
[ 2152.287685] BUG: sleeping function called from invalid context at
kernel/rwsem.c:21
[ 2152.287767] in_atomic(): 1, irqs_disabled(): 0, pid: 506, name: lvremove 
[ 2152.287833] Pid: 506, comm: lvremove Tainted: G      D W    3.2.16 #4 
[ 2152.287897] Call Trace: 
[ 2152.287983]  [<c1078244>] __might_sleep+0xe4/0x110 
[ 2152.288047]  [<c158d337>] down_read+0x17/0x30 
[ 2152.288112]  [<c10cb21a>] acct_collect+0x3a/0x160 
[ 2152.288177]  [<c108a55a>] do_exit+0x65a/0x810 
[ 2152.288240]  [<c1087878>] ? kmsg_dump+0x98/0xc0 
[ 2152.288303]  [<c158f640>] oops_end+0x90/0xd0 
[ 2152.288367]  [<c106b24e>] no_context+0xbe/0x190 
[ 2152.288430]  [<c106b3b8>] __bad_area_nosemaphore+0x98/0x140 
[ 2152.288496]  [<c103b9ec>] ? xen_clocksource_read+0x2c/0x60 
[ 2152.288560]  [<c106b472>] bad_area_nosemaphore+0x12/0x20 
[ 2152.288625]  [<c1591393>] do_page_fault+0x2b3/0x450 
[ 2152.288690]  [<c10e5624>] ? handle_percpu_irq+0x34/0x50 
[ 2152.288753]  [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 
[ 2152.288819]  [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 
[ 2152.288886]  [<c103bc2b>] ? xen_restore_fl_direct_reloc+0x4/0x4 
[ 2152.288953]  [<c10e842a>] ? rcu_enter_nohz+0x4a/0x80 
[ 2152.289017]  [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 
[ 2152.289083]  [<c107995b>] ? get_parent_ip+0xb/0x40 
[ 2152.289145]  [<c15915ab>] ? sub_preempt_count+0x7b/0xb0 
[ 2152.289209]  [<c15910e0>] ? spurious_fault+0x130/0x130 
[ 2152.289273]  [<c158ee9b>] error_code+0x67/0x6c 
[ 2152.289365]  [<c1137f4a>] ? __page_check_address+0xca/0x1b0 
[ 2152.289429]  [<c1138b69>] try_to_unmap_one+0x29/0x370 
[ 2152.289494]  [<c158df53>] ? _raw_spin_unlock+0x13/0x40 
[ 2152.289558]  [<c11390e3>] try_to_unmap_file+0x83/0x5a0 
[ 2152.289623]  [<c1037d59>] ? xen_pte_val+0xb9/0x140 
[ 2152.289686]  [<c1036c46>] ? __raw_callee_save_xen_pte_val+0x6/0x8 
[ 2152.289752]  [<c112d6e8>] ? vm_normal_page+0x28/0xe0 
[ 2152.289816]  [<c1037a5d>] ? xen_pmd_val+0x6d/0xf0 
[ 2152.289879]  [<c107995b>] ? get_parent_ip+0xb/0x40 
[ 2152.289943]  [<c113972c>] try_to_munlock+0x1c/0x40 
[ 2152.291288]  [<c11331d9>] munlock_vma_page+0x49/0x90 
[ 2152.291352]  [<c113332d>] munlock_vma_pages_range+0x6d/0xb0 
[ 2152.291417]  [<c1133437>] mlock_fixup+0xc7/0x130 
[ 2152.291479]  [<c1133717>] do_mlock+0x97/0xc0 
[ 2152.291542]  [<c1133782>] sys_munlock+0x42/0x60 
[ 2152.291606]  [<c159465f>] sysenter_do_call+0x12/0x28 
[ 2152.291759] Modules linked in: dm_snapshot xen_evtchn xenfs ext2 dm_mod
tpm_tis ata_generic ata_piix e1000e sg
[ 2152.292366] Pid: 506, comm: lvremove Tainted: G      D W    3.2.16 #4

-Chris

Christopher S. Aker

2012-May-24 17:56 UTC

head link

Re: LVM userspace causing dom0 crash

Same Xen, but now with a 64 bit dom0 and 64 bit userspace we were able 
to trigger this across about 15 machines within 48 hours (which is an 
improvement):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff8134cbc2>] inode_has_perm+0x12/0x40
PGD 27248067 PUD 5390067 PMD 0
Oops: 0000 [#1] SMP
CPU 6
Modules linked in: ebtable_nat xen_gntdev e1000e
Pid: 3550, comm: lvremove Not tainted 3.3.6-1-x86_64 #1 Supermicro 
X8DT6/X8DT6
RIP: e030:[<ffffffff8134cbc2>]  [<ffffffff8134cbc2>] 
inode_has_perm+0x12/0x40
RSP: e02b:ffff880023219bc8  EFLAGS: 00010246
RAX: 0000000000800002 RBX: ffff88000fedae90 RCX: ffff880023219bd8
RDX: 0000000000800000 RSI: 0000000000000000 RDI: ffff8800270c51e0
RBP: ffff880023219bc8 R08: 0000000000000080 R09: ffff88000fedae90
R10: ffff8800273f1b40 R11: ffff880023219bd8 R12: 0000000000000081
R13: ffff88000fedae90 R14: ffff880025ad8009 R15: ffff880025ad8008
FS:  00007f5a9af837a0(0000) GS:ffff88003fd80000(0063) 
knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000020 CR3: 000000000aa15000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lvremove (pid: 3550, threadinfo ffff880023218000, task 
ffff88000e371d40)
Stack:
  ffff880023219c68 ffffffff8134d109 0000000000000009 0000000000000000
  ffff88000fedae90 0000000000000000 0000000000000000 0000000000000000
  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
  [<ffffffff8134d109>] selinux_inode_permission+0xa9/0x100
  [<ffffffff8134ad37>] security_inode_permission+0x17/0x20
  [<ffffffff8113244c>] inode_permission+0x3c/0xd0
  [<ffffffff81134b21>] link_path_walk+0x91/0x800
  [<ffffffff81135903>] path_lookupat+0x53/0x690
  [<ffffffff8134d01d>] ? path_has_perm+0x4d/0x50
  [<ffffffff81135f6c>] do_path_lookup+0x2c/0xc0
  [<ffffffff81136717>] user_path_parent+0x47/0x80
  [<ffffffff81136a0e>] do_unlinkat+0x2e/0x1d0
  [<ffffffff8112bd09>] ? vfs_lstat+0x19/0x20
  [<ffffffff810431fe>] ? sys32_lstat64+0x2e/0x40
  [<ffffffff81136bc1>] sys_unlink+0x11/0x20
  [<ffffffff81731416>] sysenter_dispatch+0x7/0x21
  [<ffffffff8100961d>] ? xen_force_evtchn_callback+0xd/0x10
  [<ffffffff81009de2>] ? check_events+0x12/0x20
Code: 00 e8 b3 44 dd ff c9 c3 48 81 ff ff 0f 00 00 77 e8 0f 0b eb fe 0f 
1f 40 00 55 48 89 e5 f6 46 0d 02 75 23 48 8b 76 38 48 8b 7f 68 <0f> b7 
46 20 45 89 c1 8b 76 1c 49 89 c8 8b 7f 04 89 d1 89 c2 e8
RIP  [<ffffffff8134cbc2>] inode_has_perm+0x12/0x40
  RSP <ffff880023219bc8>
CR2: 0000000000000020
---[ end trace 9f021822c5071694 ]---

A different trace, but curious that it was triggered by lvm userspace still.

Have disabled SELinux and we''ve reset the test across about 25
machines.

-Chris

Xen devel - May 2012 - LVM userspace causing dom0 crash

LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash

Re: LVM userspace causing dom0 crash