Xen: 4.1.3-rc1-pre (xenbits @ 23285) Dom0: 3.2.6 PAE and 3.3.4 PAE We seeing the below crash on 3.x dom0s. A simple lvcreate/lvremove loop deployed to a few dozen boxes will hit it quite reliably within a short time. This happens on both an older LVM userspace and newest, and in production we have seen this hit on lvremove, lvrename, and lvdelete. #!/bin/bash while true; do lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1 done BUG: unable to handle kernel paging request at bffff628 IP: [<c10ebc58>] __page_check_address+0xb8/0x170 *pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6 EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6 EIP is at __page_check_address+0xb8/0x170 EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000 ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000) Stack: e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84 00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0 b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4 Call Trace: [<c10ec769>] try_to_unmap_one+0x29/0x310 [<c10ecb73>] try_to_unmap_file+0x83/0x560 [<c1005829>] ? xen_pte_val+0xb9/0x140 [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 [<c10ed13c>] try_to_munlock+0x1c/0x40 [<c10e7109>] munlock_vma_page+0x49/0x90 [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 [<c10e7352>] mlock_fixup+0xc2/0x130 [<c10e742c>] do_mlockall+0x6c/0x80 [<c10e7469>] sys_munlockall+0x29/0x50 [<c166f1d8>] sysenter_do_call+0x12/0x28 Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1 e2 05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c <f7> 06 01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0 EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP 0069:c80ffe4c CR2: 00000000bffff628 ---[ end trace 8039aeca9c19f5ab ]--- note: lvremove[27902] exited with preempt_count 1 BUG: scheduling while atomic: lvremove/27902/0x00000001 Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e Pid: 27902, comm: lvremove Tainted: G D 3.2.6-1 #1 Call Trace: [<c1040fcd>] __schedule_bug+0x5d/0x70 [<c1666fb9>] __schedule+0x679/0x830 [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60 [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30 [<c1001227>] ? hypercall_page+0x227/0x1000 [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 [<c1667250>] schedule+0x30/0x50 [<c166890d>] rwsem_down_failed_common+0x9d/0xf0 [<c1668992>] rwsem_down_read_failed+0x12/0x14 [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc [<c166814d>] ? down_read+0xd/0x10 [<c1086f9a>] acct_collect+0x3a/0x170 [<c105028a>] do_exit+0x62a/0x7d0 [<c104cb37>] ? kmsg_dump+0x37/0xc0 [<c1669ac0>] oops_end+0x90/0xd0 [<c1032dbe>] no_context+0xbe/0x190 [<c1032f28>] __bad_area_nosemaphore+0x98/0x140 [<c1008089>] ? xen_clocksource_read+0x19/0x20 [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80 [<c1032fe2>] bad_area_nosemaphore+0x12/0x20 [<c166bc12>] do_page_fault+0x2d2/0x3f0 [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0 [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 [<c1008294>] ? check_events+0x8/0xc [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20 [<c166b940>] ? spurious_fault+0x130/0x130 [<c166932e>] error_code+0x5a/0x60 [<c166b940>] ? spurious_fault+0x130/0x130 [<c10ebc58>] ? __page_check_address+0xb8/0x170 [<c10ec769>] try_to_unmap_one+0x29/0x310 [<c10ecb73>] try_to_unmap_file+0x83/0x560 [<c1005829>] ? xen_pte_val+0xb9/0x140 [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 [<c10ed13c>] try_to_munlock+0x1c/0x40 [<c10e7109>] munlock_vma_page+0x49/0x90 [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 [<c10e7352>] mlock_fixup+0xc2/0x130 [<c10e742c>] do_mlockall+0x6c/0x80 [<c10e7469>] sys_munlockall+0x29/0x50 [<c166f1d8>] sysenter_do_call+0x12/0x28 Thanks, -Chris
On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285) > Dom0: 3.2.6 PAE and 3.3.4 PAEThis looks suspicious like a fix that went in some time ago, ah: 2cd1c8d x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode but that went in 3.2 so that can''t be it. Hm, can you give more details on what parameters you are passing to dom0 and the hypervisor so I can reproduce it? Also, could you send me your .config file? Is the underlaying storage SCSI? And is this only happening on these SuperMicro boxes or are you seeing this on other hardware as well?> > We seeing the below crash on 3.x dom0s. A simple lvcreate/lvremove > loop deployed to a few dozen boxes will hit it quite reliably within > a short time. This happens on both an older LVM userspace and > newest, and in production we have seen this hit on lvremove, > lvrename, and lvdelete. > > #!/bin/bash > while true; do > lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1 > done > > BUG: unable to handle kernel paging request at bffff628 > IP: [<c10ebc58>] __page_check_address+0xb8/0x170 > *pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000 > Oops: 0000 [#1] SMP > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 > ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e > Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6 > EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6 > EIP is at __page_check_address+0xb8/0x170 > EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000 > ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 > Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000) > Stack: > e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84 > 00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0 > b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4 > Call Trace: > [<c10ec769>] try_to_unmap_one+0x29/0x310 > [<c10ecb73>] try_to_unmap_file+0x83/0x560 > [<c1005829>] ? xen_pte_val+0xb9/0x140 > [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 > [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 > [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 > [<c10ed13c>] try_to_munlock+0x1c/0x40 > [<c10e7109>] munlock_vma_page+0x49/0x90 > [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 > [<c10e7352>] mlock_fixup+0xc2/0x130 > [<c10e742c>] do_mlockall+0x6c/0x80 > [<c10e7469>] sys_munlockall+0x29/0x50 > [<c166f1d8>] sysenter_do_call+0x12/0x28 > Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1 > e2 05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c > <f7> 06 01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0 > EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP 0069:c80ffe4c > CR2: 00000000bffff628 > ---[ end trace 8039aeca9c19f5ab ]--- > note: lvremove[27902] exited with preempt_count 1 > BUG: scheduling while atomic: lvremove/27902/0x00000001 > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 > ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e > Pid: 27902, comm: lvremove Tainted: G D 3.2.6-1 #1 > Call Trace: > [<c1040fcd>] __schedule_bug+0x5d/0x70 > [<c1666fb9>] __schedule+0x679/0x830 > [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 > [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60 > [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30 > [<c1001227>] ? hypercall_page+0x227/0x1000 > [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 > [<c1667250>] schedule+0x30/0x50 > [<c166890d>] rwsem_down_failed_common+0x9d/0xf0 > [<c1668992>] rwsem_down_read_failed+0x12/0x14 > [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc > [<c166814d>] ? down_read+0xd/0x10 > [<c1086f9a>] acct_collect+0x3a/0x170 > [<c105028a>] do_exit+0x62a/0x7d0 > [<c104cb37>] ? kmsg_dump+0x37/0xc0 > [<c1669ac0>] oops_end+0x90/0xd0 > [<c1032dbe>] no_context+0xbe/0x190 > [<c1032f28>] __bad_area_nosemaphore+0x98/0x140 > [<c1008089>] ? xen_clocksource_read+0x19/0x20 > [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80 > [<c1032fe2>] bad_area_nosemaphore+0x12/0x20 > [<c166bc12>] do_page_fault+0x2d2/0x3f0 > [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0 > [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 > [<c1008294>] ? check_events+0x8/0xc > [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 > [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20 > [<c166b940>] ? spurious_fault+0x130/0x130 > [<c166932e>] error_code+0x5a/0x60 > [<c166b940>] ? spurious_fault+0x130/0x130 > [<c10ebc58>] ? __page_check_address+0xb8/0x170 > [<c10ec769>] try_to_unmap_one+0x29/0x310 > [<c10ecb73>] try_to_unmap_file+0x83/0x560 > [<c1005829>] ? xen_pte_val+0xb9/0x140 > [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 > [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 > [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 > [<c10ed13c>] try_to_munlock+0x1c/0x40 > [<c10e7109>] munlock_vma_page+0x49/0x90 > [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 > [<c10e7352>] mlock_fixup+0xc2/0x130 > [<c10e742c>] do_mlockall+0x6c/0x80 > [<c10e7469>] sys_munlockall+0x29/0x50 > [<c166f1d8>] sysenter_do_call+0x12/0x28 > > Thanks, > -Chris > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
This looks suspiciously like the problem described by Nai Xia in "mm: page_check_address bug fix and make it validate subpages in huge pages" <http://lkml.org/lkml/2011/3/28/196> - which never made it into mainline that I can detect. We''ve rebuilt without CONFIG_HIGHPTE and are running our test again. We shall see. -Chris
On 5/7/12 2:37 PM, Christopher S. Aker wrote:> This looks suspiciously like the problem described by Nai Xia in "mm: > page_check_address bug fix and make it validate subpages in huge pages" > <http://lkml.org/lkml/2011/3/28/196> - which never made it into mainline > that I can detect. > > We''ve rebuilt without CONFIG_HIGHPTE and are running our test again. We > shall see.No joy just disabling CONFIG_HIGHPTE. Triggered it on four boxes in no time. Bugs out exactly the same as before except the second BUG "scheduling while atomic" doesn''t happen. On 5/7/12 1:17 PM, Konrad Rzeszutek Wilk wrote:> Hm, can you give more details on what parameters you are passing to > dom0 and the hypervisor so I can reproduce it?Xen and dom0 binaries and modules, and arguments here: http://theshore.net/~caker/xen/BUGS/lvm/ This is atop hardware RAID -- we haven''t tested on anything other than Supermicro motherboards. Thanks, -Chris
We''ve tried 3.3.5 along with disabling CONFIG_HUGETLB_PAGE (since it was potentially in the code path looking at the source) and we''re still able to trigger the bug. Is there anything else we can try or more information we can provide that can help this along? Thanks, -Chris
On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285) > Dom0: 3.2.6 PAE and 3.3.4 PAE > > We seeing the below crash on 3.x dom0s. A simple lvcreate/lvremove > loop deployed to a few dozen boxes will hit it quite reliably within > a short time. This happens on both an older LVM userspace and > newest, and in production we have seen this hit on lvremove, > lvrename, and lvdelete. > > #!/bin/bash > while true; do > lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1 > doneSo I tried this with 3.4-rc6 and didn''t see this. The machine isn''t that powerfull - just a Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz so four CPUs are visible. Let me try with 3.2.x shortly.> > BUG: unable to handle kernel paging request at bffff628 > IP: [<c10ebc58>] __page_check_address+0xb8/0x170 > *pdpt = 0000000003cfb027 *pde = 0000000013873067 *pte = 0000000000000000 > Oops: 0000 [#1] SMP > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 > ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e > Pid: 27902, comm: lvremove Not tainted 3.2.6-1 #1 Supermicro X8DT6/X8DT6 > EIP: 0061:[<c10ebc58>] EFLAGS: 00010246 CPU: 6 > EIP is at __page_check_address+0xb8/0x170 > EAX: bffff000 EBX: cbf76dd8 ECX: 00000000 EDX: 00000000 > ESI: bffff628 EDI: e49ed900 EBP: c80ffe60 ESP: c80ffe4c > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 > Process lvremove (pid: 27902, ti=c80fe000 task=d29adca0 task.ti=c80fe000) > Stack: > e4205000 00000fff da9b6bc0 d0068dc0 e49ed900 c80ffe94 c10ec769 c80ffe84 > 00000000 00000129 00000125 b76c5000 00000001 00000000 d0068c08 d0068dc0 > b76c5000 e49ed900 c80fff24 c10ecb73 00000002 00000005 35448025 c80ffec4 > Call Trace: > [<c10ec769>] try_to_unmap_one+0x29/0x310 > [<c10ecb73>] try_to_unmap_file+0x83/0x560 > [<c1005829>] ? xen_pte_val+0xb9/0x140 > [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 > [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 > [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 > [<c10ed13c>] try_to_munlock+0x1c/0x40 > [<c10e7109>] munlock_vma_page+0x49/0x90 > [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 > [<c10e7352>] mlock_fixup+0xc2/0x130 > [<c10e742c>] do_mlockall+0x6c/0x80 > [<c10e7469>] sys_munlockall+0x29/0x50 > [<c166f1d8>] sysenter_do_call+0x12/0x28 > Code: ff c1 ee 09 81 e6 f8 0f 00 00 81 e1 ff 0f 00 00 0f ac ca 0c c1 > e2 05 03 55 ec 89 d0 e8 12 d3 f4 ff 8b 4d 0c 85 c9 8d 34 30 75 0c > <f7> 06 01 01 00 00 0f 84 84 00 00 00 8b 0d 00 0e 9b c1 89 4d f0 > EIP: [<c10ebc58>] __page_check_address+0xb8/0x170 SS:ESP 0069:c80ffe4c > CR2: 00000000bffff628 > ---[ end trace 8039aeca9c19f5ab ]--- > note: lvremove[27902] exited with preempt_count 1 > BUG: scheduling while atomic: lvremove/27902/0x00000001 > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 > ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e > Pid: 27902, comm: lvremove Tainted: G D 3.2.6-1 #1 > Call Trace: > [<c1040fcd>] __schedule_bug+0x5d/0x70 > [<c1666fb9>] __schedule+0x679/0x830 > [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 > [<c10a05fc>] ? rcu_enter_nohz+0x3c/0x60 > [<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30 > [<c1001227>] ? hypercall_page+0x227/0x1000 > [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 > [<c1667250>] schedule+0x30/0x50 > [<c166890d>] rwsem_down_failed_common+0x9d/0xf0 > [<c1668992>] rwsem_down_read_failed+0x12/0x14 > [<c1346b63>] call_rwsem_down_read_failed+0x7/0xc > [<c166814d>] ? down_read+0xd/0x10 > [<c1086f9a>] acct_collect+0x3a/0x170 > [<c105028a>] do_exit+0x62a/0x7d0 > [<c104cb37>] ? kmsg_dump+0x37/0xc0 > [<c1669ac0>] oops_end+0x90/0xd0 > [<c1032dbe>] no_context+0xbe/0x190 > [<c1032f28>] __bad_area_nosemaphore+0x98/0x140 > [<c1008089>] ? xen_clocksource_read+0x19/0x20 > [<c10081f7>] ? xen_vcpuop_set_next_event+0x47/0x80 > [<c1032fe2>] bad_area_nosemaphore+0x12/0x20 > [<c166bc12>] do_page_fault+0x2d2/0x3f0 > [<c106e389>] ? hrtimer_interrupt+0x1a9/0x2b0 > [<c10079ea>] ? xen_force_evtchn_callback+0x1a/0x30 > [<c1008294>] ? check_events+0x8/0xc > [<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4 > [<c1668a44>] ? _raw_spin_unlock_irqrestore+0x14/0x20 > [<c166b940>] ? spurious_fault+0x130/0x130 > [<c166932e>] error_code+0x5a/0x60 > [<c166b940>] ? spurious_fault+0x130/0x130 > [<c10ebc58>] ? __page_check_address+0xb8/0x170 > [<c10ec769>] try_to_unmap_one+0x29/0x310 > [<c10ecb73>] try_to_unmap_file+0x83/0x560 > [<c1005829>] ? xen_pte_val+0xb9/0x140 > [<c1004116>] ? __raw_callee_save_xen_pte_val+0x6/0x8 > [<c10e1bf8>] ? vm_normal_page+0x28/0xc0 > [<c1038e95>] ? kmap_atomic_prot+0x45/0x110 > [<c10ed13c>] try_to_munlock+0x1c/0x40 > [<c10e7109>] munlock_vma_page+0x49/0x90 > [<c10e7247>] munlock_vma_pages_range+0x57/0xa0 > [<c10e7352>] mlock_fixup+0xc2/0x130 > [<c10e742c>] do_mlockall+0x6c/0x80 > [<c10e7469>] sys_munlockall+0x29/0x50 > [<c166f1d8>] sysenter_do_call+0x12/0x28 > > Thanks, > -Chris > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker wrote:> Xen: 4.1.3-rc1-pre (xenbits @ 23285) > Dom0: 3.2.6 PAE and 3.3.4 PAE > > We seeing the below crash on 3.x dom0s. A simple lvcreate/lvremove > loop deployed to a few dozen boxes will hit it quite reliably within > a short time. This happens on both an older LVM userspace and > newest, and in production we have seen this hit on lvremove, > lvrename, and lvdelete. > > #!/bin/bash > while true; do > lvcreate -L 256M -n test1 vg1; lvremove -f vg1/test1 > doneI just did this with 3.2.16 and didn''t experience this. Can you try 3.2.16 pls? I used the attached .config _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On May 11, 2012, at 2:30 PM, Konrad Rzeszutek Wilk wrote:> On Mon, May 07, 2012 at 11:36:22AM -0400, Christopher S. Aker wrote: >> Xen: 4.1.3-rc1-pre (xenbits @ 23285) > I just did this with 3.2.16 and didn''t experience this. Can you > try 3.2.16 pls? > > I used the attached .config > <.config.txt>Thanks, but no joy with 3.2.16 and your config - only changes made were to build in a few drivers verses as modules to avoid initrd. Two boxes hit it out of a dozen in under an hour. [ 2152.285097] BUG: unable to handle kernel paging request at bffff3f0 [ 2152.285160] IP: [<c1137f4a>] __page_check_address+0xca/0x1b0 [ 2152.285238] *pdpt = 000000001b5ac027 *pde = 0000000000000000 [ 2152.285286] Oops: 0000 [#1] PREEMPT SMP [ 2152.285338] Modules linked in: dm_snapshot xen_evtchn xenfs ext2 dm_mod tpm_tis ata_generic ata_piix e1000e sg [ 2152.285468] [ 2152.285495] Pid: 506, comm: lvremove Tainted: G W 3.2.16 #4 Supermicro X8DT6/X8DT6 [ 2152.285572] EIP: 0061:[<c1137f4a>] EFLAGS: 00010246 CPU: 14 [ 2152.285607] EIP is at __page_check_address+0xca/0x1b0 [ 2152.285641] EAX: bffff000 EBX: dbe79dd8 ECX: 00000000 EDX: e2c00000 [ 2152.285678] ESI: bffff3f0 EDI: e33538e0 EBP: daf03e58 ESP: daf03e48 [ 2152.285715] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069 [ 2152.285750] Process lvremove (pid: 506, ti=daf02000 task=dd18a3e0 task.ti=daf02000) [ 2152.285804] Stack: [ 2152.285830] 00000fff e33538e0 dd1a5400 da4826c0 daf03e8c c1138b69 daf03e7c 00000000 [ 2152.285938] 00000124 00000000 c158df53 b767e000 00000000 b7686000 da4826c0 b767e000 [ 2152.286042] e33538e0 daf03f1c c11390e3 00000002 e2c00280 008c7025 daf03ebc c1037d59 [ 2152.286146] Call Trace: [ 2152.286178] [<c1138b69>] try_to_unmap_one+0x29/0x370 [ 2152.286218] [<c158df53>] ? _raw_spin_unlock+0x13/0x40 [ 2152.286255] [<c11390e3>] try_to_unmap_file+0x83/0x5a0 [ 2152.286295] [<c1037d59>] ? xen_pte_val+0xb9/0x140 [ 2152.286332] [<c1036c46>] ? __raw_callee_save_xen_pte_val+0x6/0x8 [ 2152.286372] [<c112d6e8>] ? vm_normal_page+0x28/0xe0 [ 2152.286408] [<c1037a5d>] ? xen_pmd_val+0x6d/0xf0 [ 2152.286446] [<c107995b>] ? get_parent_ip+0xb/0x40 [ 2152.286482] [<c113972c>] try_to_munlock+0x1c/0x40 [ 2152.286518] [<c11331d9>] munlock_vma_page+0x49/0x90 [ 2152.286582] [<c113332d>] munlock_vma_pages_range+0x6d/0xb0 [ 2152.286620] [<c1133437>] mlock_fixup+0xc7/0x130 [ 2152.286656] [<c1133717>] do_mlock+0x97/0xc0 [ 2152.286690] [<c1133782>] sys_munlock+0x42/0x60 [ 2152.286728] [<c159465f>] sysenter_do_call+0x12/0x28 [ 2152.286761] Code: e0 c4 76 c1 8b 04 85 c0 c4 76 c1 2b 90 b0 12 00 00 c1 e2 05 03 90 ac 12 00 00 89 d0 e8 b0 9e f3 ff 8b 4d 0c 85 c9 8d 34 30 75 0c <f7> 06 01 01 00 00 0f 84 ab 00 00 00 8b 03 8b 53 04 ff 15 14 25 [ 2152.287263] EIP: [<c1137f4a>] __page_check_address+0xca/0x1b0 SS:ESP 0069:daf03e48 [ 2152.287331] CR2: 00000000bffff3f0 [ 2152.287560] ---[ end trace a7919e7f17c0a74d ]--- [ 2152.287622] note: lvremove[506] exited with preempt_count 1 [ 2152.287685] BUG: sleeping function called from invalid context at kernel/rwsem.c:21 [ 2152.287767] in_atomic(): 1, irqs_disabled(): 0, pid: 506, name: lvremove [ 2152.287833] Pid: 506, comm: lvremove Tainted: G D W 3.2.16 #4 [ 2152.287897] Call Trace: [ 2152.287983] [<c1078244>] __might_sleep+0xe4/0x110 [ 2152.288047] [<c158d337>] down_read+0x17/0x30 [ 2152.288112] [<c10cb21a>] acct_collect+0x3a/0x160 [ 2152.288177] [<c108a55a>] do_exit+0x65a/0x810 [ 2152.288240] [<c1087878>] ? kmsg_dump+0x98/0xc0 [ 2152.288303] [<c158f640>] oops_end+0x90/0xd0 [ 2152.288367] [<c106b24e>] no_context+0xbe/0x190 [ 2152.288430] [<c106b3b8>] __bad_area_nosemaphore+0x98/0x140 [ 2152.288496] [<c103b9ec>] ? xen_clocksource_read+0x2c/0x60 [ 2152.288560] [<c106b472>] bad_area_nosemaphore+0x12/0x20 [ 2152.288625] [<c1591393>] do_page_fault+0x2b3/0x450 [ 2152.288690] [<c10e5624>] ? handle_percpu_irq+0x34/0x50 [ 2152.288753] [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 [ 2152.288819] [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 [ 2152.288886] [<c103bc2b>] ? xen_restore_fl_direct_reloc+0x4/0x4 [ 2152.288953] [<c10e842a>] ? rcu_enter_nohz+0x4a/0x80 [ 2152.289017] [<c103b24a>] ? xen_force_evtchn_callback+0x1a/0x30 [ 2152.289083] [<c107995b>] ? get_parent_ip+0xb/0x40 [ 2152.289145] [<c15915ab>] ? sub_preempt_count+0x7b/0xb0 [ 2152.289209] [<c15910e0>] ? spurious_fault+0x130/0x130 [ 2152.289273] [<c158ee9b>] error_code+0x67/0x6c [ 2152.289365] [<c1137f4a>] ? __page_check_address+0xca/0x1b0 [ 2152.289429] [<c1138b69>] try_to_unmap_one+0x29/0x370 [ 2152.289494] [<c158df53>] ? _raw_spin_unlock+0x13/0x40 [ 2152.289558] [<c11390e3>] try_to_unmap_file+0x83/0x5a0 [ 2152.289623] [<c1037d59>] ? xen_pte_val+0xb9/0x140 [ 2152.289686] [<c1036c46>] ? __raw_callee_save_xen_pte_val+0x6/0x8 [ 2152.289752] [<c112d6e8>] ? vm_normal_page+0x28/0xe0 [ 2152.289816] [<c1037a5d>] ? xen_pmd_val+0x6d/0xf0 [ 2152.289879] [<c107995b>] ? get_parent_ip+0xb/0x40 [ 2152.289943] [<c113972c>] try_to_munlock+0x1c/0x40 [ 2152.291288] [<c11331d9>] munlock_vma_page+0x49/0x90 [ 2152.291352] [<c113332d>] munlock_vma_pages_range+0x6d/0xb0 [ 2152.291417] [<c1133437>] mlock_fixup+0xc7/0x130 [ 2152.291479] [<c1133717>] do_mlock+0x97/0xc0 [ 2152.291542] [<c1133782>] sys_munlock+0x42/0x60 [ 2152.291606] [<c159465f>] sysenter_do_call+0x12/0x28 [ 2152.291759] Modules linked in: dm_snapshot xen_evtchn xenfs ext2 dm_mod tpm_tis ata_generic ata_piix e1000e sg [ 2152.292366] Pid: 506, comm: lvremove Tainted: G D W 3.2.16 #4 -Chris
Same Xen, but now with a 64 bit dom0 and 64 bit userspace we were able to trigger this across about 15 machines within 48 hours (which is an improvement): BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: [<ffffffff8134cbc2>] inode_has_perm+0x12/0x40 PGD 27248067 PUD 5390067 PMD 0 Oops: 0000 [#1] SMP CPU 6 Modules linked in: ebtable_nat xen_gntdev e1000e Pid: 3550, comm: lvremove Not tainted 3.3.6-1-x86_64 #1 Supermicro X8DT6/X8DT6 RIP: e030:[<ffffffff8134cbc2>] [<ffffffff8134cbc2>] inode_has_perm+0x12/0x40 RSP: e02b:ffff880023219bc8 EFLAGS: 00010246 RAX: 0000000000800002 RBX: ffff88000fedae90 RCX: ffff880023219bd8 RDX: 0000000000800000 RSI: 0000000000000000 RDI: ffff8800270c51e0 RBP: ffff880023219bc8 R08: 0000000000000080 R09: ffff88000fedae90 R10: ffff8800273f1b40 R11: ffff880023219bd8 R12: 0000000000000081 R13: ffff88000fedae90 R14: ffff880025ad8009 R15: ffff880025ad8008 FS: 00007f5a9af837a0(0000) GS:ffff88003fd80000(0063) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000020 CR3: 000000000aa15000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lvremove (pid: 3550, threadinfo ffff880023218000, task ffff88000e371d40) Stack: ffff880023219c68 ffffffff8134d109 0000000000000009 0000000000000000 ffff88000fedae90 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff8134d109>] selinux_inode_permission+0xa9/0x100 [<ffffffff8134ad37>] security_inode_permission+0x17/0x20 [<ffffffff8113244c>] inode_permission+0x3c/0xd0 [<ffffffff81134b21>] link_path_walk+0x91/0x800 [<ffffffff81135903>] path_lookupat+0x53/0x690 [<ffffffff8134d01d>] ? path_has_perm+0x4d/0x50 [<ffffffff81135f6c>] do_path_lookup+0x2c/0xc0 [<ffffffff81136717>] user_path_parent+0x47/0x80 [<ffffffff81136a0e>] do_unlinkat+0x2e/0x1d0 [<ffffffff8112bd09>] ? vfs_lstat+0x19/0x20 [<ffffffff810431fe>] ? sys32_lstat64+0x2e/0x40 [<ffffffff81136bc1>] sys_unlink+0x11/0x20 [<ffffffff81731416>] sysenter_dispatch+0x7/0x21 [<ffffffff8100961d>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81009de2>] ? check_events+0x12/0x20 Code: 00 e8 b3 44 dd ff c9 c3 48 81 ff ff 0f 00 00 77 e8 0f 0b eb fe 0f 1f 40 00 55 48 89 e5 f6 46 0d 02 75 23 48 8b 76 38 48 8b 7f 68 <0f> b7 46 20 45 89 c1 8b 76 1c 49 89 c8 8b 7f 04 89 d1 89 c2 e8 RIP [<ffffffff8134cbc2>] inode_has_perm+0x12/0x40 RSP <ffff880023219bc8> CR2: 0000000000000020 ---[ end trace 9f021822c5071694 ]--- A different trace, but curious that it was triggered by lvm userspace still. Have disabled SELinux and we''ve reset the test across about 25 machines. -Chris